Skip to content

Introducing Scrapy Cloud with Python 3 Support

It’s the end of an era. Python 2 is on its way out with only a few security and bug fixes forthcoming from now until its official retirement in 2020. Given this withdrawal of support and the fact that Python 3 has snazzier features, we are thrilled to announce that Scrapy Cloud now officially supports Python 3.

scrapy_cloud_2x

If you are new to Scrapinghub, Scrapy Cloud is our production platform that allows you to deploy, monitor, and scale your web scraping projects. It pairs with Scrapy, the open source web scraping framework, and Portia, our open source visual web scraper.

Scrapy + Scrapy Cloud with Python 3

I’m sure you Scrapy users are breathing a huge sigh of relief! While Scrapy with official Python 3 support has been around since May, you can now deploy your Scrapy spiders using the fancy new features introduced with Python 3 to Scrapy Cloud. You’ll have the beloved extended tuple unpacking, function annotations, keyword-only arguments and much more at your fingertips.

Fear not if you are a Python 2 developer and can’t port your spiders’ codebase to Python 3, because Scrapy Cloud will continue supporting Python 2. In fact, Python 2 remains the default unless you explicitly set your environment to Python 3.

Deploying your Python 3 Spiders

Docker support was one of the new features that came along with the Scrapy Cloud 2.0 release in May. It brings more flexibility to your spiders, allowing you to define in which kind of runtime environment (AKA stack) they will be executed.

This configuration is done in your local project’s scrapinghub.yml. There you have to include a section called stacks having scrapy:1.1-py3 as the stack for your Scrapy Cloud project:

projects:
    default: 99999
stacks:
    default: scrapy:1.1-py3

After doing that, you just have to deploy your project using shub:

$ shub deploy

Note: make sure you are using shub 2.3+ by upgrading it:

$ pip install shub --upgrade

And you’re all done! The next time you run your spiders on Scrapy Cloud, they will run on Scrapy 1.1 + Python 3.

Multi-target Deployment File

If you have a multi-target deployment file, you can define a separate stack for each project ID:

projects:
    default:
        id: 55555
        stack: scrapy:1.1
    py3:
        id: 99999
        stack: scrapy:1.1-py3

This allows you to deploy your local project to whichever Scrapy Cloud project you want, using a different stack for each one:

$ shub deploy py3

This deploys your crawler to project 99999 and uses Scrapy 1.1 + Python 3 as the execution environment.

You can find different versions of the Scrapy stack here.

Wrap Up

We hope that you’re as excited as we are for this newest upgrade to Python 3. If you have further questions or are interested in learning more about the souped up Scrapy Cloud, take a look at our Knowledge Base article.

For those new to our platform, Scrapy Cloud has a forever free subscription, so sign up and give us a try.

Sign up for free

What the Suicide Squad Tells Us About Web Data

Web data is a bit like the Matrix. It’s all around us, but not everyone knows how to use it meaningfully. So here’s a brief overview of the many ways that web data can benefit you as a researcher, marketer, entrepreneur, or even multinational business owner.

morpheus_meme

Since web scraping and web data extraction are sometimes viewed a bit like antiheroes, I’m introducing each of the use cases through characters from the Suicide Squad film. I did my best to pair according to character traits and real world web data uses, so hopefully this isn’t too much of a stretch.

This should be spoiler free, with nothing revealed that you can’t get from the trailers! Fair warning, you’re going to have Ballroom Blitz stuck in your head all day. And if you haven’t seen Suicide Squad yet, hopefully we get you pumped up for this popcorn movie.

Market Research and Predictions: Deadshot

Deadshot’s claim to fame is accurate aim. He can predict bullet trajectories and he never misses a shot. So I paired him with using web data for market research and trend prediction. You can scrape multiple websites for price fluctuation, new products, reviews, and consumer trends. This is an automated process that allows you to quickly and accurately analyze data without needing to manually monitor websites.

Social Media Monitoring: Harley Quinn

Harley Quinn has a sunny personality that remains chipper even when faced with death, destruction, torture, and mayhem. She also always has a witty comeback no matter the situation. These traits go hand-in-hand with how brands should approach social media channels. Extracting web data from social media interactions help you understand consumer opinions. You can monitor ongoing chatter about your company or your competition and respond in the most positive way possible.

Lead Generation and HR Recruitment: Amanda Waller

This is probably the most obvious pairing since Amanda Waller (played by the wonderful Viola Davis) is the one responsible for assembling the Suicide Squad. She carefully researched and compiled intimate details on all the criminals-turned-reluctant-heroes. This is an aspect of web data that benefits all sales, marketing, recruitment, and HR. With a pre-vetted pool, you’ll have access to qualified leads and decision-makers without needing to wade through the worst of the worst.

Tracking Criminal Activity in the Dark Web: Killer Croc

This sewer-dwelling villain thrives in dark and hidden spaces. He’s used to working underground and in places most people don’t even know exist. This makes Killer Croc the perfect backdrop for the type of web data located in the deep/dark web. The dark web is the part of the internet that is not indexed by search engines (Google, Bing, etc.) and is often a haven for criminal activity. Data scraped from this part of the web is commonly used by law enforcement agencies.

Competitive Pricing: Captain Boomerang

This jewelry thief goes around the world stealing from banks and committing acts of burglary – with a boomerang… Captain Boomerang knows all about pricing and the comparative value of products so he can get the largest bang for his buck. Similarly, web data is a great resource for new companies looking to research their industry and how their prices match up to the competition. And if you are an established company, this is a great way for you to keep track of newcomers and potential market disruptors.

Machine Learning Models: Enchantress

In her 6313 years of existence, the Enchantress has had to cope with changing times, customs, and civilizations. The ability to learn quickly and adapt to new situations is definitely an important part of her continued survival. Likewise, machine learning is a form of artificial intelligence that can learn when given new information. Train your machine learning models using datasets for conducting sentiment analysis, making predictions, and even automating web scraping. Whether you are an SaaS company specializing in developing machine learning technology or someone who needs machine learning analysis, you need to ensure you have up-to-date datasets.

Monitoring Resellers: Colonel Rick Flag

Colonel Rick Flag is a “good guy” whose job is to keep track of the Suicide Squad and kill them if they get out of line. Now obviously your relationship with resellers is not a life-and-death situation, but it can be good to know how your brand is being represented across the internet. Web scraping can help you keep track of reseller customer reviews and any contract violations that might be occurring.

Monitoring Legal Matters and Government Corruption: Katana

Katana the samurai is the enforcer of the Suicide Squad. She is there as an additional check to keep the criminal members in line. Similarly, web data allows reporters, lawyers, and concerned citizens to keep track of government officials, potential corruption charges, and changing legal matters. You can scrape obscure or poorly presented public records and then use that information to create accessible interfaces for easy reference and research.

Web Scraping for Fun: the Joker

I believe the Joker needs no introduction, whether you know this character from Jack Nicholson, Heath Ledger, or the new Jared Leto incarnation. He is unpredictable, has eclectic tastes, and is capable of doing anything. And honestly, this is what web scraping is all about. Whether you want to build a bike sharing app or monitor government corruption, web data provides the backbone for all of your creative endeavors.

Wrap Up

I hope you enjoyed this unorthodox tour of the world of web data! If you’re looking for some mindless fun, Suicide Squad ain’t half bad (it ain’t half good either). If you’re looking to explore how web data fits within your business or personal projects, feel free to reach out to us. And if you’re looking to hate on or defend Suicide Squad, comment below.

P.S. There is no way this movie is worse than Batman v Superman: Dawn of Justice

This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.

If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub.

OS-Scrapinghub

Scrapy

This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars on GitHub, making it the 10th most starred Python project on GitHub! We are very proud of this and want to thank all our users, stargazers and contributors!

What’s coming in Scrapy 1.2 (in a couple of weeks)?

  • The ability to specify the encoding of items in your JSON, CSV or XML output files
  • Creating Scrapy projects in any folder you want (not only the current one)

Scrapy Plugins

We’re moving various Scrapy middleware and helpers to their own repository under scrapy-plugins home on GitHub. They are all available on PyPI.
Many of these were previously found wrapped inside scrapylib (which will not see a new release).

Here are some of the newly released ones:

  • scrapy-querycleaner: used for cleaning up query parameters in URLs; helpful for when some of them are not relevant (you get the same page with or without them), thus avoiding duplicate page fetches.
  • scrapy-magicfields: automagically add special fields in your scraped items such as timestamps, response attributes, etc.

Libraries

Dateparser

In mid-June we released version 0.4 of Dateparser with quite a few parsing improvements and new features (as well as several bug fixes). For example, this version introduces its own parser, replacing dateutil’s one. However, we may converge back at some point in the future.

It also handles relative dates in the future e.g. “tomorrow”, “in two weeks”, etc. We also replaced PyYAML with one of its active forks, ruamel.yaml. We hope you enjoy it!

Fun fact: we caught the attention of Kenneth Reitz with dateparser. And although dateparser didn’t quite solve his issue, “[he] like[s] it a lot” so it made our day😉

w3lib

w3lib v1.15 now has a canonicalize_url(), extracted from Scrapy helpers. You may find it handy when walking in the jungle of non-ASCII URLs in Python 3!

Wrap Up

And that’s it for This Month in Open Source at Scrapinghub August 2016. Open Source is in our DNA and so we’re always working on new projects and improving pre-existing ones. Keep up with us and explore our GitHub. We welcome contributors and we are also hiring, so check out our jobs page!

Meet Parsel: the Selector Library behind Scrapy

We eat our own spider food since Scrapy is our go-to workhorse on a daily basis. However, there are certain situations where Scrapy can be overkill and that’s when we use Parsel. Parsel is a Python library for extracting data from XML/HTML text using CSS or XPath selectors. It powers the scraping API of the Scrapy framework.

HarryParseltongue

Not to be confused with Parseltongue/Parselmouth

We extracted Parsel from Scrapy during Europython 2015 as a part of porting Scrapy to Python 3. As a library, it’s lighter than Scrapy (it relies on lxml and cssselect) and also more flexible, allowing you to use it within any Python program.

v-3

Using Parsel

Install Parsel using pip:

pip install parsel

And here’s how you use it. Say you have this HTML snippet in a variable:

>>> html = u'''
<ul>
	<li><a href="http://blog.scrapinghub.com">Blog</a></li>
...
	<li><a href="http://www.scrapinghub.com">Scrapinghub</a></li>
...
	<li class="external"><a href="http://www.scrapy.org">Scrapy</a></li>
</ul>
'''

You then import the Parsel library, load it into a Parsel Selector and extract links with an XPath expression:

>>> import parsel
>>> sel = parsel.Selector(html)
>>> sel.xpath("//a/@href").extract()
[u'http://blog.scrapinghub.com', u'http://www.scrapinghub.com', u'http://www.scrapy.org']

Note: Parsel works both in Python 3 and Python 2. If you’re using Python 2, remember to pass the HTML in a unicode object.

Sweet Parsel Features

One of the nicest features of Parsel is the ability to chain selectors. This allows you to chain CSS and XPath selectors however you wish, such as in this example:

>>> sel.css('li.external').xpath('./a/@href').extract()
[u'http://www.scrapy.org']

You can also iterate through the results of the .css() and .xpath() methods since each element will be another selector:

>>> for li in sel.css('ul li'):
...     print(li.xpath('./a/@href').extract_first())
...
http://blog.scrapinghub.com
http://www.scrapinghub.com
http://www.scrapy.org

You can find more examples of this in the documentation.

When to use Parsel

The beauty of Parsel is in its wide applicability. It is useful for a range of situations including:

  • Processing XML/HTML data in an IPython notebook
  • Writing end-to-end tests for your website or app
  • Simple web scraping projects with the Python Requests library
  • Simple automation tasks at the command-line

And now, you can also run Parsel with the command-line tool for simple extraction tasks in your terminal. This new development is thanks to our very own Rolando who created parsel-cli.

Install parsel-cli with pip install parsel-cli and play around using the examples below (you need to have curl installed).

The following command will download and extract the list of Academy Award-winning films from Wikipedia:

curl -s https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films |\
    parsel-cli 'table.wikitable tr td i a::text'

You can also get the current top 5 news items from Hacker News using:

curl -s https://news.ycombinator.com |\
    parsel-cli 'a.storylink::attr(href)' | head -n 5

And how about obtaining a list of the latest YouTube videos from a specific channel?

curl -s https://www.youtube.com/user/crashcourse/videos |\
    parsel-cli 'h3 a::attr(href), h3 a::text' |\
    paste -s -d' \n' - | sed 's|^|http://youtube.com|'

Wrap Up

I hope that you enjoyed this little tour of Parsel and I am looking forward to seeing how these examples have sparked your imagination when finding solutions for your HTML parsing needs.

The next time you find yourself wanting to extract data from HTML/XML and don’t need Scrapy and its crawling capabilities, you know what to do: just Parsel it!

Feel free to reach out to us on Twitter and let us know how you use Parsel in your projects.

Scrapy Tips from the Pros: July 2016

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality with your own middleware or pipeline.

This makes it easy for the Scrapy community to easily develop new plugins to improve upon existing functionality, without making changes to Scrapy itself.

In this post we’ll show how you can leverage the DeltaFetch plugin to run incremental crawls.

Incremental Crawls With Deltafetch

Some crawlers we develop are designed to crawl and fetch the data we need only once. On the other hand, many crawlers have to run periodically in order to keep our datasets up-to-date.

In many of these periodic crawlers, we’re only interested in new pages included since the last crawl. For example, we have a crawler that scrapes articles from a bunch of online media outlets. The spiders are executed once a day and they first retrieve article URLs from pre-defined index pages. Then they extract the title, author, date and content from each article. This approach often leads to many duplicate results and an increasing number of requests each time we run the crawler.

Fortunately, we are not the first ones to have this issue. The community already has a solution: the scrapy-deltafetch plugin. You can use this plugin for incremental (delta) crawls. DeltaFetch’s main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages where no items were extracted before, to URLs from the spiders’ start_urls attribute or requests generated in the spiders’ start_requests method.

DeltaFetch works by intercepting every Item and Request objects generated in spider callbacks. For Items, it computes the related request identifier (a.k.a. fingerprint) and stores it into a local database. For Requests, Deltafetch computes the request fingerprint and drops the request if it already exists in the database.

Now let’s see how to set up Deltafetch for your Scrapy spiders.

Getting Started with DeltaFetch

First, install DeltaFetch using pip:

$ pip install scrapy-deltafetch

Then, you have to enable it in your project’s settings.py file:

SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

DeltaFetch in Action

This crawler has a spider that crawls books.toscrape.com. It navigates through all the listing pages and visits every book details page to fetch some data like book title, description and category. The crawler is executed once a day in order to capture new books that are included in the catalogue. There’s no need to revisit book pages that have already been scraped, because the data collected by the spider typically doesn’t change.

To see Deltafetch in action, clone this repository, which has DeltaFetch already enabled in settings.py, and then run:

$ scrapy crawl toscrape

Wait until it finishes and then take a look at the stats that Scrapy logged at the end:

2016-07-19 10:17:53 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/stored': 1000,
    ...
    'downloader/request_count': 1051,
    ...
    'item_scraped_count': 1000,
}

Among other things, you’ll see that the spider did 1051 requests to scrape 1000 items and that DeltaFetch stored 1000 request fingerprints. This means that only 51 page requests haven’t generated items and so they will be revisited next time.

Now, run the spider again and you’ll see a lot of log messages like this:

2016-07-19 10:47:10 [toscrape] INFO: Ignoring already visited: 
<GET http://books.toscrape.com/....../index.html>

And in the stats you’ll see that 1000 requests were skipped because items have been scraped from those pages in a previous crawl. Now the spider hasn’t extracted any items and it did only 51 requests, all of them to listing pages from where no items have been scraped before:

2016-07-19 10:47:10 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/skipped': 1000,
    ...
    'downloader/request_count': 51,
}

Changing the Database Key

By default, DeltaFetch uses a request fingerprint to tell requests apart. This fingerprint is a hash computed based on the canonical URL, HTTP method and request body.

Some websites have several URLs for the same data. For example, an e-commerce site could have the following URLs pointing to a single product:

Request fingerprints aren’t suitable in these situations as the canonical URL will differ despite the item being the same. In this example, we could use the product’s ID as the DeltaFetch key.

DeltaFetch allows us to define custom keys by passing a meta parameter named deltafetch_key when initializing the Request:

from w3lib.url import url_query_parameter

...

def parse(self, response):
    ...
    for product_url in response.css('a.product_listing'):
        yield Request(
            product_url,
            meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
            callback=self.parse_product_page
        )
    ...

This way, DeltaFetch will ignore requests to duplicate pages even if they have different URLs.

Resetting DeltaFetch

If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:

$ scrapy crawl example -a deltafetch_reset=1

Using DeltaFetch on Scrapy Cloud

You can also use DeltaFetch in your spiders running on Scrapy Cloud. You just have to enable the DeltaFetch and DotScrapy Persistence addons in your project’s Addons page. The latter is required to allow your crawler to access the .scrapy folder, where DeltaFetch stores its database.

image00

Deltafetch is quite handy in situations as the ones we’ve just seen. Keep in mind that Deltafetch only avoid sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders.

You can check out the project page on github for further information: http://github.com/scrapy-plugins/scrapy-deltafetch

Wrap-up

You can find many interesting Scrapy plugins in the scrapy-plugins page on Github and you can also contribute to the community by including your own plugin there.

If you have a question or a topic that you’d like to see in this monthly column, please drop a comment here letting us know or reach us out via @scrapinghub on Twitter.

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf.

However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable.

Congress Bills in Peru

For the sake of transparency, Peruvian Congress provides a website where people can check the list of bills that are being processed, voted and eventually become law. For each bill, there’s a page with its authorship, title, submission date and a brief summary. These pages are frequently updated when bills are moved between commissions, approved and then published as laws.

By having all of this information online, lawyers and the general public can potentially inspect bills that could be the result of lobbying. In Peruvian history, there have been many laws passed that were to benefit only one specific company or individual.

Screen Shot 2016-07-13 at 9.52.11 AM

However, having transparency doesn’t mean it’s accessible. This site is very clunky, and the information for each bill is spread across several pages. It displays the bills in a very long list with far too many pages, and until very recently there has been no way to search for specific bills.

In the past, if you wanted to find a bill, you would need to look through several pages manually. This is very time consuming as there are around one thousand bills proposed every year. Not long ago, the site added a search tool, but it’s not user-friendly at all:

Screen Shot 2016-07-13 at 9.53.53 AM

The Solution

My lawyer friends from the Peruvian NGOs Hiperderecho.org and Respeto.pe asked me about the possibilities to build a web application. Their goal was to organize all the data from the Congress bills, allowing people to easily search and discover bills by keywords, authors and categories.

The first step in building this was to grab all bill data and metadata from the Congress website. Since they don’t provide an API, we had to use web scraping. For that, Scrapy is a champ.

I wrote several Scrapy spiders to crawl the Congress site and download as much data as possible. The spiders wake up every 8 hours and crawl the Congress pages looking for new bills. They parse the data they scrape and save it into a local PostgreSQL database.

Once we had achieved the critical step of getting all the data, it was relatively easy to build a search tool to navigate the 5400+ bills and counting. I used Django to create a simple interface for users, and so ProyectosDeLey.pe was born.

Screen Shot 2016-07-13 at 10.09.55 AM

The Findings

All kinds of possibilities are open once we have the data. For example, we could now generate statistics on the status of the bills. We found that of the 5402 proposed bills, only 740 became laws, meaning most of the bills were rejected or forgotten on the pile and never processed.

Screen Shot 2016-07-13 at 10.15.01 AM

Quick searches also revealed that many bills are not that useful. A bunch of them are only proposals to turn some specific days into “national days”.

There are proposals for national day of peace, “peace consolidation”, “peace and reconciliation”, Peruvian Coffee, Peruvian Cuisine, and also national days for several Peruvian produce.

There were even more than one bill proposing the celebration of the same thing, on the very same day. Organizing the bills into a database and building our search tool allowed people to discover these redundant and unnecessary bills.

Call In the Lawyers

After we aggregated the data into statistics, my lawyer friends found that the majority of bills are approved after only one round of voting. In the Peruvian legislation, dismissal of the second round of voting for any bill should be carried out only under exceptional circumstances.

However, the numbers show that the use of one round of voting has become the norm, as 88% of the bills approved were only done so in one round. The second round of voting has been created to compensate for the fact that the Peruvian Congress has only one chamber were all the decisions are made. It’s also expected that members of Congress should use the time between first and second voting for further debate and consultation with advisers and outside experts.

Bonus

The nice thing about having such information in a well-structured machine-readable format, is that we can create cool data visualizations, such as this interactive timeline that shows all the events that happened for a given bill:

Screen Shot 2016-07-13 at 10.27.19 AM

Another cool thing is that this data allows us to monitor Congress’ activities. Our web app allows users to subscribe to a RSS feed in order to get the latest bills, hot off the Congress press. My lawyer friends use it to issue “Legal Alerts” in the social media when some of the bills intend to do more wrong than good.

Wrap Up

People can build very useful tools with data available on the web. Unfortunately, government data often has poor accessibility and usability, making the transparency laws less useful than they should be. The work of volunteers is key in order to build tools that turn the otherwise clunky content into useful data for journalists, lawyers and regular citizens as well. Thanks to open source software such as Scrapy and Django, we can quickly grab the data and create useful tools like this.

See? You can help a lot of people by doing what you love!:-)

Scrapely: The Brains Behind Portia Spiders

Unlike Portia labiata, the hunting spider that feeds on other spiders, our Portia feeds on data. Considered the Einstein in the spider world, we modeled our own creation after the intelligence and visual abilities of its arachnid namesake.

P1070205

Portia is our visual web scraping tool which is pushing the boundaries of automated data extraction. Portia is completely open source and we welcome all contributors who are interested in collaborating. Now is the perfect time since we’re on the beta launch of Portia 2.0, so please jump in!

Portia spider.png

You don’t need a programming background to use Portia. Its web-based UI means you can choose the data you want by clicking (annotating) elements on a webpage, as shown below. Doing this creates a sample which Portia later uses to extract information from similar pages.

Portia

When taking a look at the brains of Portia, the first component you need to meet is Scrapely.

What is Scrapely?

Portia uses Scrapely to extract structured data from HTML pages. While other commonly used libraries like Parsel (Scrapy’s Selector) and Beautiful Soup use CSS and XPath selectors, Scrapely takes annotated HTML samples as input.

Other libraries work by building a DOM tree based on the HTML source. They then use the given CSS or XPath selectors to find matches in that tree. Scrapely, however, treats a HTML page as a stream of tokens. By building a representation of the page this way, Portia is able to handle any type of HTML no matter how badly formatted.

Machine Learning Data Extraction

To extract data, Portia employs machine learning using the instance-based wrapper induction extraction method implemented by Scrapely. This is an example of supervised learning. You annotate an HTML page to create a set of pre-trained samples that guide the information extraction.

Scrapely reads the streams of tokens from the unannotated pages, and looks for regions that are similar to the sample’s annotations. To decide what should be extracted from new pages, it notes the tags that occur before and after the annotated regions, referred to as the prefix and suffix respectively.

This approach is handy as you don’t need a well-defined HTML page. It instead relies on the order of tags on a page. Another useful feature of this approach is that Scrapely doesn’t need to find a 100% match, and instead looks for the best match. Even if the page is updated and tags are changed, Scrapely can still extract the data.

More importantly, Scrapely will not extract the information if the match isn’t similar enough. This approach helps to reduce false positives.

Portia + Scrapely

Now that you better understand the inner workings on Portia, let’s move to the front-facing portion.

When you click elements on a page, Portia highlights them. You can give each field a name and chose how they’re extracted. Portia also provides a live preview of what data will be extracted as you annotate.

Portia example 1

Portia passes these annotations to Scrapely, which generates a new HTML page that includes the annotations. You can see the annotated page’s HTML in the screenshot below:

HTML 1

Scrapely then compiles this information into an extraction tree, which looks like this:

HTML 2

Scrapely uses the extraction tree to extract data from the page.

The ContainerExtractor finds the most likely HTML element that contains all the annotations.

Next, the RecordExtractor looks through this element and applies each of the BasicTypeExtractors from the tree, one for each field that you defined.

After trying to match all extractors, Scrapely outputs either an item with the data extracted, or nothing if it couldn’t match any elements that are similar enough to the annotations.

This is how Scrapely works in the background to support the straightforward UI of Portia.

The Future of Data Extraction

You currently need to manually annotate pages with Portia. However, we are developing technologies that will do this for you. Our first tool, the automatic item list extractor, finds list data on a page and automatically annotates the important information contained within.

Another feature we’ve began working on will let you automatically extract data from commonly found pages such as product, contact, and article pages. It works by using samples created by Portia users (and our team) to build models so that frequently extracted information will be automatically annotated.

Wrap Up

And that’s Portia in a nutshell! Or at least the machine learning brains… In any case, I hope that you found this an informative peek into the underbelly of our visual web scraping tool. We again invite you to contribute to Portia 2.0 beta since it’s completely open source.

Let us know what else you’d like to learn about how Portia works or our advances in automated data extraction.

 

Follow

Get every new post delivered to your Inbox.

Join 117 other followers

%d bloggers like this: