Scrapinghub Crawls the Deep Web

Scrapinghub Crawls the Deep Web

“The easiest way to think about Memex is: How can I make the unseen seen?”

— Dan Kaufman, director of the innovation office at DARPA

Scrapinghub is participating in Memex, an ambitious DARPA project that tackles the huge challenge of crawling, indexing, and making sense of areas of the Deep Web, that is, web content not being indexed by traditional search engines such as Google, Bing and others. This content, according to current estimations, dwarfs Google’s total indexed content by a ratio of almost 20 to 1. It includes all sorts of criminal activity that, until Memex became available, had proven to be really hard to track down in a systematic way.

The inventor of Memex, Chris White, appeared on 60 Minutes to explain how it works and how it could revolutionize law enforcement investigations:

new search engine exposes the dark web
Lesley Stahl and producer Shachar Bar-On got an early look at Memex on 60 Minutes

Scrapinghub will be participating alongside Cloudera, Elephant Scale and Openindex as part of the Hyperion Gray team. We’re delighted to be able to bring our web scraping expertise and open source projects, such as Scrapy, Splash and Crawl Frontier, to a project that has such a positive impact in the real world.

We hope to share more news regarding Memex and Scrapinghub in the coming months!

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

6 thoughts on “Scrapinghub Crawls the Deep Web

Leave a Reply

Your email address will not be published. Required fields are marked *