Skip to content

Scrapy Tips from the Pros: July 2016

Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality with your own middleware or pipeline.

This makes it easy for the Scrapy community to easily develop new plugins to improve upon existing functionality, without making changes to Scrapy itself.

In this post we’ll show how you can leverage the DeltaFetch plugin to run incremental crawls.

Incremental Crawls With Deltafetch

Some crawlers we develop are designed to crawl and fetch the data we need only once. On the other hand, many crawlers have to run periodically in order to keep our datasets up-to-date.

In many of these periodic crawlers, we’re only interested in new pages included since the last crawl. For example, we have a crawler that scrapes articles from a bunch of online media outlets. The spiders are executed once a day and they first retrieve article URLs from pre-defined index pages. Then they extract the title, author, date and content from each article. This approach often leads to many duplicate results and an increasing number of requests each time we run the crawler.

Fortunately, we are not the first ones to have this issue. The community already has a solution: the scrapy-deltafetch plugin. You can use this plugin for incremental (delta) crawls. DeltaFetch’s main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages where no items were extracted before, to URLs from the spiders’ start_urls attribute or requests generated in the spiders’ start_requests method.

DeltaFetch works by intercepting every Item and Request objects generated in spider callbacks. For Items, it computes the related request identifier (a.k.a. fingerprint) and stores it into a local database. For Requests, Deltafetch computes the request fingerprint and drops the request if it already exists in the database.

Now let’s see how to set up Deltafetch for your Scrapy spiders.

Getting Started with DeltaFetch

First, install DeltaFetch using pip:

$ pip install scrapy-deltafetch

Then, you have to enable it in your project’s settings.py file:

SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

DeltaFetch in Action

This crawler has a spider that crawls books.toscrape.com. It navigates through all the listing pages and visits every book details page to fetch some data like book title, description and category. The crawler is executed once a day in order to capture new books that are included in the catalogue. There’s no need to revisit book pages that have already been scraped, because the data collected by the spider typically doesn’t change.

To see Deltafetch in action, clone this repository, which has DeltaFetch already enabled in settings.py, and then run:

$ scrapy crawl toscrape

Wait until it finishes and then take a look at the stats that Scrapy logged at the end:

2016-07-19 10:17:53 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/stored': 1000,
    ...
    'downloader/request_count': 1051,
    ...
    'item_scraped_count': 1000,
}

Among other things, you’ll see that the spider did 1051 requests to scrape 1000 items and that DeltaFetch stored 1000 request fingerprints. This means that only 51 page requests haven’t generated items and so they will be revisited next time.

Now, run the spider again and you’ll see a lot of log messages like this:

2016-07-19 10:47:10 [toscrape] INFO: Ignoring already visited: 
<GET http://books.toscrape.com/....../index.html>

And in the stats you’ll see that 1000 requests were skipped because items have been scraped from those pages in a previous crawl. Now the spider hasn’t extracted any items and it did only 51 requests, all of them to listing pages from where no items have been scraped before:

2016-07-19 10:47:10 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/skipped': 1000,
    ...
    'downloader/request_count': 51,
}

Changing the Database Key

By default, DeltaFetch uses a request fingerprint to tell requests apart. This fingerprint is a hash computed based on the canonical URL, HTTP method and request body.

Some websites have several URLs for the same data. For example, an e-commerce site could have the following URLs pointing to a single product:

Request fingerprints aren’t suitable in these situations as the canonical URL will differ despite the item being the same. In this example, we could use the product’s ID as the DeltaFetch key.

DeltaFetch allows us to define custom keys by passing a meta parameter named deltafetch_key when initializing the Request:

from w3lib.url import url_query_parameter

...

def parse(self, response):
    ...
    for product_url in response.css('a.product_listing'):
        yield Request(
            product_url,
            meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
            callback=self.parse_product_page
        )
    ...

This way, DeltaFetch will ignore requests to duplicate pages even if they have different URLs.

Resetting DeltaFetch

If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:

$ scrapy crawl example -a deltafetch_reset=1

Using DeltaFetch on Scrapy Cloud

You can also use DeltaFetch in your spiders running on Scrapy Cloud. You just have to enable the DeltaFetch and DotScrapy Persistence addons in your project’s Addons page. The latter is required to allow your crawler to access the .scrapy folder, where DeltaFetch stores its database.

image00

Deltafetch is quite handy in situations as the ones we’ve just seen. Keep in mind that Deltafetch only avoid sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders.

You can check out the project page on github for further information: http://github.com/scrapy-plugins/scrapy-deltafetch

Wrap-up

You can find many interesting Scrapy plugins in the scrapy-plugins page on Github and you can also contribute to the community by including your own plugin there.

If you have a question or a topic that you’d like to see in this monthly column, please drop a comment here letting us know or reach us out via @scrapinghub on Twitter.

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf.

However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable.

Congress Bills in Peru

For the sake of transparency, Peruvian Congress provides a website where people can check the list of bills that are being processed, voted and eventually become law. For each bill, there’s a page with its authorship, title, submission date and a brief summary. These pages are frequently updated when bills are moved between commissions, approved and then published as laws.

By having all of this information online, lawyers and the general public can potentially inspect bills that could be the result of lobbying. In Peruvian history, there have been many laws passed that were to benefit only one specific company or individual.

Screen Shot 2016-07-13 at 9.52.11 AM

However, having transparency doesn’t mean it’s accessible. This site is very clunky, and the information for each bill is spread across several pages. It displays the bills in a very long list with far too many pages, and until very recently there has been no way to search for specific bills.

In the past, if you wanted to find a bill, you would need to look through several pages manually. This is very time consuming as there are around one thousand bills proposed every year. Not long ago, the site added a search tool, but it’s not user-friendly at all:

Screen Shot 2016-07-13 at 9.53.53 AM

The Solution

My lawyer friends from the Peruvian NGOs Hiperderecho.org and Respeto.pe asked me about the possibilities to build a web application. Their goal was to organize all the data from the Congress bills, allowing people to easily search and discover bills by keywords, authors and categories.

The first step in building this was to grab all bill data and metadata from the Congress website. Since they don’t provide an API, we had to use web scraping. For that, Scrapy is a champ.

I wrote several Scrapy spiders to crawl the Congress site and download as much data as possible. The spiders wake up every 8 hours and crawl the Congress pages looking for new bills. They parse the data they scrape and save it into a local PostgreSQL database.

Once we had achieved the critical step of getting all the data, it was relatively easy to build a search tool to navigate the 5400+ bills and counting. I used Django to create a simple interface for users, and so ProyectosDeLey.pe was born.

Screen Shot 2016-07-13 at 10.09.55 AM

The Findings

All kinds of possibilities are open once we have the data. For example, we could now generate statistics on the status of the bills. We found that of the 5402 proposed bills, only 740 became laws, meaning most of the bills were rejected or forgotten on the pile and never processed.

Screen Shot 2016-07-13 at 10.15.01 AM

Quick searches also revealed that many bills are not that useful. A bunch of them are only proposals to turn some specific days into “national days”.

There are proposals for national day of peace, “peace consolidation”, “peace and reconciliation”, Peruvian Coffee, Peruvian Cuisine, and also national days for several Peruvian produce.

There were even more than one bill proposing the celebration of the same thing, on the very same day. Organizing the bills into a database and building our search tool allowed people to discover these redundant and unnecessary bills.

Call In the Lawyers

After we aggregated the data into statistics, my lawyer friends found that the majority of bills are approved after only one round of voting. In the Peruvian legislation, dismissal of the second round of voting for any bill should be carried out only under exceptional circumstances.

However, the numbers show that the use of one round of voting has become the norm, as 88% of the bills approved were only done so in one round. The second round of voting has been created to compensate for the fact that the Peruvian Congress has only one chamber were all the decisions are made. It’s also expected that members of Congress should use the time between first and second voting for further debate and consultation with advisers and outside experts.

Bonus

The nice thing about having such information in a well-structured machine-readable format, is that we can create cool data visualizations, such as this interactive timeline that shows all the events that happened for a given bill:

Screen Shot 2016-07-13 at 10.27.19 AM

Another cool thing is that this data allows us to monitor Congress’ activities. Our web app allows users to subscribe to a RSS feed in order to get the latest bills, hot off the Congress press. My lawyer friends use it to issue “Legal Alerts” in the social media when some of the bills intend to do more wrong than good.

Wrap Up

People can build very useful tools with data available on the web. Unfortunately, government data often has poor accessibility and usability, making the transparency laws less useful than they should be. The work of volunteers is key in order to build tools that turn the otherwise clunky content into useful data for journalists, lawyers and regular citizens as well. Thanks to open source software such as Scrapy and Django, we can quickly grab the data and create useful tools like this.

See? You can help a lot of people by doing what you love!:-)

Scrapely: The Brains Behind Portia Spiders

Unlike Portia labiata, the hunting spider that feeds on other spiders, our Portia feeds on data. Considered the Einstein in the spider world, we modeled our own creation after the intelligence and visual abilities of its arachnid namesake.

P1070205

Portia is our visual web scraping tool which is pushing the boundaries of automated data extraction. Portia is completely open source and we welcome all contributors who are interested in collaborating. Now is the perfect time since we’re on the beta launch of Portia 2.0, so please jump in!

Portia spider.png

You don’t need a programming background to use Portia. Its web-based UI means you can choose the data you want by clicking (annotating) elements on a webpage, as shown below. Doing this creates a sample which Portia later uses to extract information from similar pages.

Portia

When taking a look at the brains of Portia, the first component you need to meet is Scrapely.

What is Scrapely?

Portia uses Scrapely to extract structured data from HTML pages. While other commonly used libraries like Parsel (Scrapy’s Selector) and Beautiful Soup use CSS and XPath selectors, Scrapely takes annotated HTML samples as input.

Other libraries work by building a DOM tree based on the HTML source. They then use the given CSS or XPath selectors to find matches in that tree. Scrapely, however, treats a HTML page as a stream of tokens. By building a representation of the page this way, Portia is able to handle any type of HTML no matter how badly formatted.

Machine Learning Data Extraction

To extract data, Portia employs machine learning using the instance-based wrapper induction extraction method implemented by Scrapely. This is an example of supervised learning. You annotate an HTML page to create a set of pre-trained samples that guide the information extraction.

Scrapely reads the streams of tokens from the unannotated pages, and looks for regions that are similar to the sample’s annotations. To decide what should be extracted from new pages, it notes the tags that occur before and after the annotated regions, referred to as the prefix and suffix respectively.

This approach is handy as you don’t need a well-defined HTML page. It instead relies on the order of tags on a page. Another useful feature of this approach is that Scrapely doesn’t need to find a 100% match, and instead looks for the best match. Even if the page is updated and tags are changed, Scrapely can still extract the data.

More importantly, Scrapely will not extract the information if the match isn’t similar enough. This approach helps to reduce false positives.

Portia + Scrapely

Now that you better understand the inner workings on Portia, let’s move to the front-facing portion.

When you click elements on a page, Portia highlights them. You can give each field a name and chose how they’re extracted. Portia also provides a live preview of what data will be extracted as you annotate.

Portia example 1

Portia passes these annotations to Scrapely, which generates a new HTML page that includes the annotations. You can see the annotated page’s HTML in the screenshot below:

HTML 1

Scrapely then compiles this information into an extraction tree, which looks like this:

HTML 2

Scrapely uses the extraction tree to extract data from the page.

The ContainerExtractor finds the most likely HTML element that contains all the annotations.

Next, the RecordExtractor looks through this element and applies each of the BasicTypeExtractors from the tree, one for each field that you defined.

After trying to match all extractors, Scrapely outputs either an item with the data extracted, or nothing if it couldn’t match any elements that are similar enough to the annotations.

This is how Scrapely works in the background to support the straightforward UI of Portia.

The Future of Data Extraction

You currently need to manually annotate pages with Portia. However, we are developing technologies that will do this for you. Our first tool, the automatic item list extractor, finds list data on a page and automatically annotates the important information contained within.

Another feature we’ve began working on will let you automatically extract data from commonly found pages such as product, contact, and article pages. It works by using samples created by Portia users (and our team) to build models so that frequently extracted information will be automatically annotated.

Wrap Up

And that’s Portia in a nutshell! Or at least the machine learning brains… In any case, I hope that you found this an informative peek into the underbelly of our visual web scraping tool. We again invite you to contribute to Portia 2.0 beta since it’s completely open source.

Let us know what else you’d like to learn about how Portia works or our advances in automated data extraction.

 

Introducing Portia2Code: Portia Projects into Scrapy Spiders

We’re thrilled to announce the release of our latest tool, Portia2Code!

port3

With it you can convert your Portia 2.0 projects into Scrapy spiders. This means you can add your own functionality and use Portia’s friendly UI to quickly prototype your spiders, giving you much more control and flexibility.

A perfect example of where you may find this new feature useful is when you need to interact with the web page. You can convert your Portia project to Scrapy, and then use Splash with a custom script to close pop-ups, scroll for more results, fill in forms, and so on.

Read on to learn more about using Portia2Code and how it can fit in your stack. But keep in mind that it only supports Portia 2.0 projects.

Using Portia2Code

First you need to install the portia2code library using:

$ pip install portia2code 

Then you need to download and extract your Portia project. You can do this through the API:

$ curl --user $SHUB_APIKEY: \
    'https://portia-beta.scrapinghub.com/api/projects/$PROJECT_ID/download' > project.zip
$ unzip project.zip -d project

Finally, you can convert your project with:

$ portia_porter project converted_output_dir

Customising Your Spiders

You can change the functionality as you would with a standard Scrapy spider. Portia2code produces spiders that extend from scrapy.CrawlSpider, the code for which is included in the downloaded project.

The example below shows you how to make an additional API request when there’s a meta property on the page named ‘metrics’.

In this example, the extended spider is separated out from the original spider. This is to demonstrate the changes that you need to make when modifying the spider. In practice you would make changes to the spider in the same class.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule

from ..utils.spiders import BasePortiaSpider
from ..utils.processors import Field
from ..utils.processors import Item
from ..items import ArticleItem


class ExampleCom(BasePortiaSpider):
	name = "www.example.com"
	start_urls = [u'http://www.example.com/search/?q=articles']
	allowed_domains = [u'www.example.com']
	rules = [
    	Rule(LinkExtractor(allow=(ur'\d{6}'), deny=()), callback='parse_item',
         	follow=True)
	]
	items = [
    	[Item(ArticleItem, None, u'#content', [
          	Field(u'title', u'.page_title *::text', []),
          	Field(u'Article', u'.article *::text', []),
          	Field(u'published', u'.date *::text', []),
          	Field(u'Authors', u'.authors *::text', []),
          	Field(u'pdf', u'#pdf-link::attr(href)', [])])]
	]


import json
from scrapy import Request
from six.moves.urllib.parse import urljoin


class ExtendedExampleCom(ExampleCom):
	base_api_url = 'https://api.examplemetrics.com/v1/metrics/'
	allowed_domains = [u'www.example.com', u'api.examplemetrics.com']

	def parse_item(self, response):
    	for item in super(ExtendedExampleCom, self).parse_item(response):
        	score = response.css('meta[name="metrics"]::attr(content)')
        	if score:
            	yield Request(
                	url=urljoin(self.base_api_url, score.extract()[0]),
                	callback=self.add_score, meta={'item': item})
        	else:
            	yield item

	def add_score(self, response):
    	item = response.meta['item']
    	item['score'] = json.loads(response.body)['score']
    	return item

What’s happening here?

The site contains a meta tag. We join its content attribute with the base URL given by base_api_url to produce the full URL for the metrics.

The domain of the base_api_url differs from the rest of the site. This means we have to add its domain to the allowed_domains array to prevent it from being filtered.

We want to add an extra field to the items extracted, so the first step is to override the parse_item function. The most important part is to loop over parse_item in the superclass in order to extract the items.

Next we need to check if the meta property ‘metrics’ is present. If that’s the case, we send another request and store the current item in the request meta. Once we receive a response, we use the add_score method that we defined to add the score property from the JSON response, and then return the final item. If the property is not present, we return the item as is.

This is a common pattern in Portia-built spiders. You would need to load some pages in Splash, which greatly increases the time to crawl a site. This approach means you can download the additional data with a single small request without having to load scripts and other assets on the page.

How it works

When you build a spider in Portia, the output consists largely of JSON definitions that define how the spider should crawl and extract data.

When you run a spider, the JSON definitions are compiled into a custom Scrapy spider along with trained samples for extraction. The spider uses the Scrapely library with the trained samples to extract from similar pages.

Portia uses unique selectors for each annotated element and builds an extraction tree that can use item loaders to extract the relevant data.

Future Plans

Here are the features that we are planning to add in the future:

  • Load pages using Splash depending on crawl rules
  • Follow links automatically
  • Text data extractors (annotations generated by highlighting text)

Wrap Up

We predict that Portia2Code will make Portia even more useful to those of you who need to scrape data fast and efficiently. Let us know how you will use the new Portia2Code feature by Tweeting at us.

Happy scraping!

Scrapy Tips from the Pros June 2016

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

Scraping Infinite Scrolling Pages

In the era of single page apps and tons of AJAX requests per page, a lot of websites have replaced “previous/next” pagination buttons with a fancy infinite scrolling mechanism. Websites using this technique load new items whenever the user scrolls to the bottom of the page (think Twitter, Facebook, Google Images). Even though UX experts maintain that infinite scrolling provides an overwhelming amount of data for users, we’re seeing an increasing number of web pages resorting to presenting this unending list of results.

When developing our web scrapers, one of the first things we do is look for UI components with links that might lead us to the next page of results. Unfortunately, these links aren’t present on infinite scrolling web pages.

While this scenario might seem like a classic case for a JavaScript engine such as Splash or Selenium, it’s actually a simple fix. Instead of simulating user interaction with such engines, all you have to do is inspect your browser’s AJAX requests when you scroll the target page and then re-create those requests in your Scrapy spider.

Let’s use Spidy Quotes as an example and build a spider to get all the items listed on it.

Inspecting the Page

First things first, we need to understand how the infinite scrolling works on this page and we can do so by using the Network panel in the Browser’s developer tools. Open the panel and then scroll down the page to see the requests that the browser is firing:

scrapy tips from the pros june

Click on a request for a closer look. The browser sends a request to /api/quotes?page=x and then receives a JSON object like this in response:

{
   "has_next":true,
   "page":8,
   "quotes":[
      {
         "author":{
            "goodreads_link":"/author/show/1244.Mark_Twain",
            "name":"Mark Twain"
         },
         "tags":["individuality", "majority", "minority", "wisdom"],
         "text":"Whenever you find yourself on the side of the ..."
      },
      {
         "author":{
            "goodreads_link":"/author/show/1244.Mark_Twain",
            "name":"Mark Twain"
         },
         "tags":["books", "contentment", "friends"],
         "text":"Good friends, good books, and a sleepy ..."
      }
   ],
   "tag":null,
   "top_ten_tags":[["love", 49], ["inspirational", 43], ...]
}

This is the information we need for our spider. All it has to do is generate requests to “/api/quotes?page=x” for an increasing x until the has_next field becomes false. The best part of this is that we don’t even have to scrape the HTML contents to get the data we need. It’s all in a beautiful machine-readable JSON.

Building the Spider

Here is our spider. It extracts the target data from the JSON content returned by the server. This approach is easier and more robust than digging into the page’s HTML tree, trusting that layout changes will not break our spiders.

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

To further practice this tip, you can experiment with building a spider for our blog since it also uses infinite scrolling to load older posts.

Wrap Up

If you were feeling daunted by the prospect of scraping infinite scrolling websites, hopefully you’re feeling a bit more confident now. The next time that you have to deal with a page based on AJAX calls triggered by user actions, take a look at the requests that your browser is making and then replay them in your spider. The response is usually in a JSON format, making your spider even simpler.

And that’s it for June! Please let us know what you would like to see in future columns by reaching out on Twitter. We also recently released a Datasets Catalog, so if you’re stumped on what to scrape, take a look for some inspiration.

This Month in Open Source at Scrapinghub June 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource@scrapinghub.com or on Twitter @scrapinghub

OS-Scrapinghub

Scrapy 1.1

For those who missed the big news, Scrapy 1.1 is live! It’s the first official release that comes with Python 3 support, so you can go ahead and move your stack over.

The major changes in this release since the RC1 we announced in February include improved HTTPS connections (with proxy support) and handling URLs with non-ASCII characters. Make sure you upgrade w3lib to 1.14.2.

We’re very grateful for the feedback we received during the release candidate phase. A huge thanks to all the reporters, reviewers and code/documentation contributors.

If you find anything that’s not working, please take a few minutes to report the issue(s) on GitHub.

Notable limitations still present in this release include:

  • Scrapy 1.1 doesn’t work on Windows under Python 3 (Twisted is not fully ported to Python 3 on Windows, but we’ll keep an eye out for updates to this situation).
  • Scrapy’s FTP, Telnet console, and email do not work in Python 3.

Splash 2.1

Splash 2.1 now lets you:

If you’re using the Scrapy-Splash plugin (formerly “scrapyjs”), we encourage you to upgrade to the latest v0.7 version. It includes many goodies that makes integrating with Scrapy much easier. Check the latest README for details, especially the scrapy_splash.SplashRequest utility.

Google Summer of Code 2016

We’re thrilled to have 5 students this year:

  • Aron Bordin is working on supporting spiders in other programming languages with Scrapy.
  • Preetwinder Bath is porting Frontera to Python 3.
  • Tamer Tas is working on dockerization and orchestration of Frontera deployments.
  • Avishkar Gupta is replacing PyDispatcher to improve Scrapy’s signaling API performance.
  • Michael Manukyan is adding web scraping helpers for Splash.

We’d like to thank the Python Software Foundation for again having Scrapinghub as a sub-org this year!

Libraries

cssselect maintenance

Scrapy relies on lxml and cssselect for all the XPath and CSS selection awesomeness that we use each and every day at Scrapinghub. We learned that Simon Sapin, author of cssselect package, was looking for new maintainers. So we put ourselves forward and now cssselect is hosted under the Scrapy organization on GitHub. Don’t worry though, Simon is still involved! We’re planning on fixing a few corner cases and maybe working on CSS Selectors Level 4. We’ll definitely need assistance with this task, so please reach out if you’re interested in helping out!

Dateparser 0.3.5

We released Dateparser 0.3.5 with support for dates in Danish and Japanese. It now handles dates with accents much better. The library is now working with the latest version of python-dateutil.

Check the full release notes here.

js2xml

This side project of mine is now hosted under Scrapinghub’s organization on GitHub. It’s a little helper library to convert JavaScript code into an XML tree. This means you can use XPath and CSS selectors to extract data (strings, objects, function arguments, etc.) from HTML-embedded JavaScript (this does not interpret it though). You’d be amazed at how much valuable data is “hidden” in JavaScript inside web pages.

It’s on PyPI and is now Python 3-compatible.

Check this Jupyter/ipython notebook for an overview of what you can do with it and make sure to let us know what you think.

w3lib 1.14.2

We updated our w3lib library to handle non-ASCII URLs better, as part of adding Python 3 support to Scrapy 1.1. We recommend that you upgrade to the latest 1.14.2 version.

parsel 1.0.2

If you’re using Scrapy 1.1, you’re using parsel under the hood. Parsel is Scrapy Selectors as an independent package. There’s a new release of parsel that fixes the hiding of XPath exceptions.

Portia

We’ve made some changes to Slybot, the Portia crawler, that include:

  • Re-added nested regions and text data annotations.
  • Selectors now handle comments correctly.
  • Added automatic link following seeded with start urls and sample urls.
  • Allow adjusting splash wait time.

For Portia itself:

  • New download API endpoint: GET portia/api/projects/PID/download[/SID]

Most of the recent developments have been taking place in the Portia beta.

The big changes include:

  • Clustering of pages during extraction to decide which sample to use for extraction.
  • Download Portia spider as Scrapy code: GET portia/api/projects/PID/download[/SID]?format=code
  • Uses Django style Storage object for accessing files.
  • Database access more consistent for MySQL backend.
  • Better element overlays; they can now be split across lines.
  • Re-add toggle CSS option for samples, you can now annotate hidden elements.
  • UI usable on low resolution screens, thanks to smarter wrapping.
  • Inform user of unpublished changes when using Git backend.

Try out the beta using the nui-develop branch.

Frontera

Frontera 0.5 introduces improved crawling strategy, new logging and better test coverage.

Mosquitera

Scrapy-mosquitera is a library to assist Scrapy spiders to do more optimal crawls. In its basic form, it’s a collection of matchers and a mixin to narrow down the crawl to a specific date range. However, you can extend it to be applicable on any domain (URL paths, location filtering, etc). You can find more details about how it works and how you can create your own matchers in the documentation.

Wrap Up

This concludes the June edition of This Month in Open Source at Scrapinghub. We’re always looking for new contributors, so if you’re interested, feel free to explore our GitHub.

Introducing the Datasets Catalog

catal3

Folks using Portia and Scrapy are engaged in a variety of fascinating web crawling projects, so we wanted to provide you with a way to share your data extraction prowess with the world.

With this need in mind, we’re pleased to introduce the latest addition to our Scrapinghub platform: the Datasets Catalog!

This new feature allows you to immediately share the results of your Scrapinghub projects as publicly searchable datasets. Not only is this a great way to collaborate with others, you can also save time by using other people’s datasets in your projects.

datasets_central_page

As fans of the open data movement, we hope that this new feature will ease the process of disseminating data. Open data has been used to help foster transparency in governmental and corporate systems worldwide. Researchers and developers have also benefited from the mutual sharing of information. A couple of our own engineers have even used open data to power transportation apps and to help journalists expose corruption.

Read on to get some ideas on how to use the Datasets Catalog in your workflow.

The Datasets Catalog at a Glance

We are launching the Datasets Catalog with the following features:

  • Publish the data collected by your Portia or Scrapy spiders/web crawlers as easily accessible datasets
  • Highlight your scraped data and help others locate the information they need by giving each dataset a name and a description
  • Let others discover your datasets through search engines like Google
  • Browse publicly available datasets that other people are sharing: https://app.scrapinghub.com/datasets
  • Choose how to share your dataset using three different privacy settings:
    • Public datasets are accessible by anyone (even those without a Scrapinghub account) and are indexed by search engines
    • Restricted datasets are accessible only to the users that you explicitly grant access (they need to have a Scrapinghub account)
    • Private datasets are accessible only by the members of your organization

How Does it Work?

publish datasetYou can find this new “Datasets” option in the menu located at the top navigation bar. On the main Datasets Catalog page, you can browse available datasets along with those that you have recently visited.

Publishing your scraped data into complete datasets takes just one click. This tutorial will get you started on publishing and sharing your extracted data.

 

Wrap Up

And there you have it, a way to not only showcase your web crawling and data extraction skills, but to also help others with the information that you provide.

We invite you to contribute your datasets and play your part in helping drive the open source movement forward. Reach out to us on Twitter and let us know what datasets you would like to see featured and if you have any recommendations for improving the whole Datasets experience.

We’re excited to see what you come up with!

Follow

Get every new post delivered to your Inbox.

Join 119 other followers

%d bloggers like this: