Scrapinghub Crawls the Deep Web

"The easiest way to think about Memex is: How can I make the unseen seen?"

-- Dan Kaufman, director of the innovation office at DARPA

Scrapinghub is participating in Memex, an ambitious DARPA project that tackles the huge challenge of crawling, indexing, and making sense of areas of the Deep Web, that is, web content not being indexed by traditional search engines such as Google, Bing and others. This content, according to current estimations, dwarfs Google’s total indexed content by a ratio of almost 20 to 1. It includes all sorts of criminal activity that, until Memex became available, had proven to be really hard to track down in a systematic way.

The inventor of Memex, Chris White, appeared on 60 Minutes to explain how it works and how it could revolutionize law enforcement investigations:

 

new search engine exposes the dark web Lesley Stahl and producer Shachar Bar-On got an early look at Memex on 60 Minutes

 

Scrapinghub will be participating alongside Cloudera, Elephant Scale and Openindex as part of the Hyperion Gray team. We’re delighted to be able to bring our web scraping expertise and open source projects, such as Scrapy, Splash and Crawl Frontier, to a project that has such a positive impact in the real world.

We hope to share more news regarding Memex and Scrapinghub in the coming months!

March 01, 2019 In "data extraction" , "python" , "scrapy" , "Data Quality" , "Spidermon" , "spider" , "scrapyproject"
February 15, 2019 In "Web Scraping" , "Crawlera" , "crawling" , "web crawling" , "Proxies"
February 07, 2019 In "Web Scraping" , "Crawlera" , "crawling" , "web crawling" , "Proxies"
crawling, darpa, memex, open source, Professional Services, scrapy