Skip to content

New Changes to Our Scrapy Cloud Platform

We are proud to announce some exciting changes we’ve introduced this week. These changes bring a much more pleasant user experience, and several new features including the addition of Portia to our platform!

Here are the highlights:

An Improved Look and Feel

We have introduced a number of improvements in the way our dashboard looks and feels. This includes a new layout based on Bootstrap 3, a more user-friendly color scheme, the ability to schedule jobs once per month, and a greatly improved spiders page with pagination and search.

Filtering and pagination of spiders:

spiders

A new user interface for adding periodic jobs:

periodic-jobs

A new user interface for scheduling spiders:

schedule-spider

And much more!

Your Organization on Scrapy Cloud

You are now able to create and manage organizations in Scrapy Cloud, add members if necessary and from there create new projects under your organization. This will make it much easier for you to manage your projects and keep them all in one place. Also, to make things simpler, our billing system will soon be managed at the organization level rather than per individual user.

You are now able to create projects within the context of an organization, however other organization members will need to be invited in order to access it. A user can be invited to a project even if that user is not a member of the project’s organization.

Export Items as XML

Due to popular demand, we have added the ability to download your items as XML:

export-items-xml

Improvements to Periodic Jobs

We have made several improvements to the way periodic jobs are handled:

  • There is no delay when creating or editing a job. For example, if you create a new a job at 11:59 to run at 12:00, it will do so without any trouble.
  • If there is any downtime, jobs that were intended to be scheduled during the downtime will be scheduled automatically once the service is restored.
  • You can now schedule jobs at specific dates in the month.

Portia Now Available in Dash

Last year, we open-sourced our annotation based scraping tool, Portia. We have since been working to integrate it into Dash, and it’s finally here!

We have added an ‘Open in Portia’ button to your projects’ Autoscraping page, so you can now open your Scrapy Cloud projects in Portia. We intend Portia to be a successor to our existing Autoscraping interface, and hope you find it to be a much more pleasant experience. No longer do you have to do a preliminary crawl to begin annotating, you can just jump straight in!

Check out this demo of how you can create a spider using Portia and Dash!

Enjoy the new features, and of course if you have any feedback please don’t hesitate to post on our support forum!

Introducing ScrapyRT: An API for Scrapy spiders

We’re proud to announce our new open source project, ScrapyRT! ScrapyRT, short for Scrapy Real Time, allows you to extract data from a single web page via an API using your existing Scrapy spiders.

Why did we start this project?

We needed to be able to retrieve the latest data for a previously scraped page, on demand. ScrapyRT made this easy by allowing us to reuse our spider logic to extract data from a single page, rather than running the whole crawl again.

How does ScrapyRT work?

ScrapyRT runs as a web service and retrieving data is as simple as making a request with the URL you want to extract data from and the name of the spider you would like to use.

Let’s say you were running ScrapyRT on localhost, you could make a request like this:

http://localhost:9080/crawl.json?spider_name=foo&url=http://example.com/product/1

ScrapyRT will schedule a request in Scrapy for the URL specified and use the ‘foo’ spider’s parse method as a callback. The data extracted from the page will be serialized into JSON and returned in the response body. If the spider specified doesn’t exist, a 404 will be returned. The majority of Scrapy spiders will be compatible without any additional programming necessary.

How do I use ScrapyRT in my Scrapy project?

 > git clone https://github.com/scrapinghub/scrapyrt.git
 > cd scrapyrt
 > pip install -r requirements.txt
 > python setup.py install
 > cd ~/your-scrapy-project
 > scrapyrt

ScrapyRT will be running on port 9080, and you can schedule your spiders per the example shown earlier.

We hope you find ScrapyRT useful and look forward to hearing your feedback!

Comment here or discuss on HackerNews.

Looking back at 2014

One year ago we were looking back at the great 2013 we had and realized we would have quite a big challenge in front of us in order to have as much growth as we had during last year. So here are some highlights of the things we’ve been up to during this year, let’s see how well we did!

2014 was quite the travelling year for Scrapinghub! We sponsored both the US PyCon in Montreal and the spanish PyCon in Zaragoza. We’ve also been to Codemotion in Madrid and PythonBrasil. We hope to hit the road during 2015 too, bringing some spider magic to even more cities!

PyCon US Scrapinghub booth

PyCon US Scrapinghub booth

During this year we’ve also continued to work on both new and ongoing Professional Services projects for clients all around the world and we’re glad to see that our efforts are paying off, we have increased our customer base while maintaining the same quality standards we’ve had since we were just a few guys, back in 2010.

Our platform has grown too! There’s been steady effort in getting it to the point where we’ve been able to accommodate the ever increasing volume of scraping we and our customers have been doing. In 2014 alone Scrapy Cloud has scraped and stored data from over 10 billion pages (more than 5 times the amount we did in 2013!) and an extra 5 billion have passed through Crawlera.

We are excited to see our revenue tripling from last year, and it makes us very proud to have grown organically so far. We can only imagine what we could do with some funding, but we won’t do anything that could jeopardize the way we run the company, which has proven very successful.

Open Source

In the open source front, we’ve been spending a lot of time improving our annotation based scraping tool Portia. Our main focus has been on integrating it into our Scrapy cloud platform, and soon Scrapinghub users will be able to open their Autoscraping projects in Portia. You can see an example of the current Dash integration here. This will eventually be our successor to our Autoscraping tool. If you just cannot wait to try Portia you’re in luck, we open sourced it sometime ago (it was trending Python project on Github for a month!) so you can try it locally if you wish!

 

We also have a number of new and interesting open source projects: Dateparser, Crawl Frontier and Splash.

 

  • Dateparser is a parser for human readable dates/ which is able to detect and support multiple languages. It can even read text such as “2 weeks ago” and determine the date relative to the current time! The project already has over 300 stars and 17 forks on github.

 

  • Crawl frontier is a framework for building the frontier part of your web crawler, that’s the bit of a crawling system that decides the logic and policies to follow when a crawler is visiting websites such as what pages should be crawled next, priorities and ordering, how often pages are revisited, etc. Although originally designed for use with Scrapy, Crawl frontier can now be used with any other crawling framework or project you wish to use!

 

  • Splash is an API for JavaScript rendering. It’s currently our in-house solution for crawling Javascript powered websites, and we’re hopeful that usage will begin to grow outside of Scrapinghub.

 

Of course, all our other existing open source projects such as Scrapy, webstruct and others have seen major improvements, something that will keep on going during 2015 and beyond.

We’re glad to be able to share all these experiences, numbers and new projects with you, but we know very well that behind every single one of those stands the hard work done by all members of our team, saying that this wouldn’t have been possible without them is an understatement. And the team has grown, 2014 marks the second year in a row that our team has doubled, we’re now 85 Scrapinghubbers! (and we’ll be over 90 by the end of January)

Green markers represent 2014 new hires. Click for an interactive map of our full team.

Green markers represent 2014 new hires. Click for an interactive map of our full team.

So here’s a sincere thank you from the Scrapinghub team to all of our customers and supporters. Thanks for an amazing 2014 and watch out 2015, here we come! Happy New Year!

XPath tips from the web scraping trenches

In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you’re looking for a tutorial, here is a XPath tutorial with nice examples.

In this post, we’ll show you some tips we found valuable when using XPath in the trenches, using Scrapy Selector API for our examples.

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
>>> xp = lambda x: sel.xpath(x).extract() # let's type this only once
>>> xp('//a//text()') # take a peek at the node-set
   [u'Click here to go to the ', u'Next Page']
>>> xp('string(//a//text())')  # convert it to a string
   [u'Click here to go to the ']

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp('//a[1]') # selects the first a node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> xp('string(//a[1])') # converts it to string
[u'Click here to go to the Next Page']

So, in general:

GOOD:

>>> xp("//a[contains(., 'Next Page')]")
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

BAD:

>>> xp("//a[contains(.//text(), 'Next Page')]")
[]

GOOD:

>>> xp("substring-after(//a, 'Next ')")
[u'Page']

BAD:

>>> xp("substring-after(//a//text(), 'Next ')")
[u'']

You can read more detailed explanations about string values of nodes and node-sets in the XPath spec.

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel=Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()
>>> xp("//li[1]") # get all first LI elements under whatever it is its parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//li)[1]") # get the first LI element in the whole document
[u'<li>1</li>']
>>> xp("//ul/li[1]")  # get all first LI elements under an UL parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//ul/li)[1]") # get the first LI element under an UL parent in the document
[u'<li>1</li>']

Also,

//a[starts-with(@href, '#')][1] gets a collection of the local anchors that occur first under their respective parents.

(//a[starts-with(@href, '#')])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary

If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

Let’s cook up some examples:

>>> sel = Selector(text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>')
>>> xp = lambda x: sel.xpath(x).extract()

BAD: doesn’t work because there are multiple classes in the attribute

>>> xp("//*[@class='content']")
[]

BAD: gets more than we want

>>> xp("//*[contains(@class,'content')]")
[u'<p class="content-author">Someone</p>']

GOOD:

>>> xp("//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]")
[u'<p class="content text-wrap">Some content</p>']

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']

Read more about what you can do with Scrapy’s Selectors here.

Learn to use all the different axes

It is handy to know how to use the axes, you can follow through the examples given in the tutorial to quickly review this.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents:

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from script and style tags and also skip whitespace-only text nodes. Source: http://stackoverflow.com/a/19350897/2572383

Do you have another XPath tip?

Please, leave us a comment with your tips or questions. :)

And for everybody who contributed tips and reviewed this article, a big thank you!

Introducing Data Reviews

One of the things that takes more time when building a spider is reviewing the scraped data and making sure it conforms to the requirements and expectations of your client or team. This process is so time consuming that, in many cases, it ends up taking more time than writing the spider code itself, depending on how well the requirements are written. To make this process more efficient we have introduced the ability to comment data directly on Dash (Scrapinghub UI), right next to the data, instead of relying on other channels (like issue trackers, emails or chat).

 

data_reviews

 

With this new feature you can discuss problems with data right where they appear without having to copy/paste data around, and have a conversation with your client or team until the issue is resolved. This reduces the time spent on data QA, making the whole process more productive and rewarding.

So go ahead, start adding comments to your data (you can comment whole items or individual fields) and let the conversation flow around data! You can mark resolved issues by archiving comments, and you will see jobs with unresolved (unarchived) comments directly on the Jobs Dashboard.

Last, but not least, you have the Data Reviews API to insert comments programmatically. This is useful, for example, to report problems in post-processing scripts that analyze the scraped data.

Happy scraping!

Extracting schema.org microdata using Scrapy selectors and XPath

Web pages are full of data, that is what web scraping is mostly about. But often you want more than data, you want meaning. Microdata markup embedded in HTML source helps machines understand what the pages are about: contact information, product reviews, events etc.

Web authors have several ways to add metadata to their web pages: HTML “meta” tags, microformats, social media meta tags (Facebook’s Open Graph protocol, Twitter Cards).

And then there is Schema.org, an initiative from big players in the search space (Yahoo!, Google, Bing and Yandex) to “create and support a common set of schemas for structured data markup on web pages.”

Here are some stats from Web Data Commons project on the use of microdata in the Common Crawl web corpus:

In summary, we found structured data within 585 million HTML pages out of the 2.24 billion pages contained in the crawl (26%).

Let’s focus on Schema.org syntax for the rest of this article. The markup looks like this (example from schema.org “Getting started” page):

<div itemscope itemtype ="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

Let’s assume you want to extract this microdata and get the movie item from the snippet above.

Using Scrapy Selector let’s first loop on elements with an itemscope attribute (this represents a container for an item’s properties) using .//*[@itemscope], and for each item, get all properties, i.e. elements that have an itemprop attribute:

>>> from scrapy.selector import Selector
>>> selector = Selector(text="""
... <div itemscope itemtype ="http://schema.org/Movie">
...   <h1 itemprop="name">Avatar</h1>
...   <span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
...   <span itemprop="genre">Science fiction</span>
...   <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
... </div>""", type="html")
>>> selector.xpath('.//*[@itemscope]')
[<Selector xpath='.//*[@itemscope]' data=u'<div itemscope itemtype="http://schema.o'>]
>>> 

Let’s print out more interesting stuff, like the item’s itemtype and the properties’ values (the text representation of their HTML element):

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print item.xpath('@itemtype').extract()
...     for property in item.xpath('.//*[@itemprop]'):
...         print property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract()
... 
[u'http://schema.org/Movie']
[u'name'] [u'Avatar']
[u'director'] [u'James Cameron']
[u'genre'] [u'Science fiction']
[u'trailer'] [u'Trailer']
>>> 

Hm. The value for trailer isnt that interesting. We should have selected the href of <a href="../movies/avatar-theatrical-trailer.html">Trailer</a>.

For the following extended example markup, taken from Wikipedia,

<div itemscope itemtype="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <div itemprop="director" itemscope itemtype="http://schema.org/Person">
  Director: <span itemprop="name">James Cameron</span> 
(born <time itemprop="birthDate" datetime="1954-08-16">August 16, 1954</time>)
  </div>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

we see it’s not always href attributes that are interesting, but datetime perhaps, or src for images, or content for meta element like “… any attribute really.

Therefore we need to get itemprop text content and elements attributes (we use the @* XPath expression for that).

Let’s do that on the 2nd HTML snippet:

>>> selector = Selector(text="""
... <div itemscope itemtype="http://schema.org/Movie">
...   <h1 itemprop="name">Avatar</h1>
...   <div itemprop="director" itemscope itemtype="http://schema.org/Person">
...   Director: <span itemprop="name">James Cameron</span> 
... (born <time itemprop="birthDate" datetime="1954-08-16">August 16, 1954</time>)
...   </div>
...   <span itemprop="genre">Science fiction</span>
...   <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
... </div>""", type="html")
>>> for item in selector.xpath('.//*[@itemscope]'):
...     print item.xpath('@itemtype').extract()
...     for property in item.xpath('.//*[@itemprop]'):
...         print property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract(),
...         print property.xpath('@*').extract()
... 
[u'http://schema.org/Movie']
[u'name'] [u'Avatar'] [u'name']
[u'director'] [u'n  Director: James Cameron n(born August 16, 1954)n  '] [u'director', u'', u'http://schema.org/Person']
[u'name'] [u'James Cameron'] [u'name']
[u'birthDate'] [u'August 16, 1954'] [u'birthDate', u'1954-08-16']
[u'genre'] [u'Science fiction'] [u'genre']
[u'trailer'] [u'Trailer'] [u'../movies/avatar-theatrical-trailer.html', u'trailer']
[u'http://schema.org/Person']
[u'name'] [u'James Cameron'] [u'name']
[u'birthDate'] [u'August 16, 1954'] [u'birthDate', u'1954-08-16']
>>> 

Wait a minute! We’re getting only attribute values, not attribute names (and itemprop attribute twice for that matter, once for @itemprop and once for @*).

To get names of attributes, one cannot apparently use name() on an attribute… BUT you can do name(@*[i]), i being the attribute position:

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print "Item:", item.xpath('@itemtype').extract()
...     for property in item.xpath('.//*[@itemprop]'):
...         print "Property:",
...         print property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract()
...         for position, attribute in enumerate(property.xpath('@*'), start=1):
...             print "attribute: name=%s; value=%s" % (
...                 property.xpath('name(@*[%d])' % position).extract(),
...                 attribute.extract())
...         print
...     print
... 
Item: [u'http://schema.org/Movie']
Property: [u'name'] [u'Avatar']
attribute: name=[u'itemprop']; value=name

Property: [u'director'] [u'n  Director: James Cameron n(born August 16, 1954)n  ']
attribute: name=[u'itemprop']; value=director
attribute: name=[u'itemscope']; value=
attribute: name=[u'itemtype']; value=http://schema.org/Person

Property: [u'name'] [u'James Cameron']
attribute: name=[u'itemprop']; value=name

Property: [u'birthDate'] [u'August 16, 1954']
attribute: name=[u'itemprop']; value=birthDate
attribute: name=[u'datetime']; value=1954-08-16

Property: [u'genre'] [u'Science fiction']
attribute: name=[u'itemprop']; value=genre

Property: [u'trailer'] [u'Trailer']
attribute: name=[u'href']; value=../movies/avatar-theatrical-trailer.html
attribute: name=[u'itemprop']; value=trailer


Item: [u'http://schema.org/Person']
Property: [u'name'] [u'James Cameron']
attribute: name=[u'itemprop']; value=name

Property: [u'birthDate'] [u'August 16, 1954']
attribute: name=[u'itemprop']; value=birthDate
attribute: name=[u'datetime']; value=1954-08-16


>>> 

There’s still something wrong, right? Duplicate properties.

It’s because the 2nd HTML snippet is using embedded items. Indeed, James Cameron is the director of “Avatar”, but he’s also a Person (yes, he is!). That’s why the markup says <div> with an itemscope attribute.

How can we fix that, only selecting properties at the current scope, and leave nested properties when we reach the nested item?

Well, it happens that Scrapy Selectors support some EXSLT extensions, notably the sets operations. Here, we’ll use set:difference: we’ll select itemprops under the current itemscope, and then exclude those that are themselves children of another itemscope, under the current one.

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print "Item:", item.xpath('@itemtype').extract()
...     for property in item.xpath(
...             """set:difference(.//*[@itemprop],
...                               .//*[@itemscope]//*[@itemprop])"""):
...         print "Property:", property.xpath('@itemprop').extract(),
...         print property.xpath('string(.)').extract()
...         for position, attribute in enumerate(property.xpath('@*'), start=1):
...             print "attribute: name=%s; value=%s" % (
...                 property.xpath('name(@*[%d])' % position).extract(),
...                 attribute.extract())
...         print
...     print
... 
Item: [u'http://schema.org/Movie']
Property: [u'name'] [u'Avatar']
attribute: name=[u'itemprop']; value=name

Property: [u'director'] [u'n  Director: James Cameron n(born August 16, 1954)n  ']
attribute: name=[u'itemprop']; value=director
attribute: name=[u'itemscope']; value=
attribute: name=[u'itemtype']; value=http://schema.org/Person

Property: [u'genre'] [u'Science fiction']
attribute: name=[u'itemprop']; value=genre

Property: [u'trailer'] [u'Trailer']
attribute: name=[u'href']; value=../movies/avatar-theatrical-trailer.html
attribute: name=[u'itemprop']; value=trailer


Item: [u'http://schema.org/Person']
Property: [u'name'] [u'James Cameron']
attribute: name=[u'itemprop']; value=name

Property: [u'birthDate'] [u'August 16, 1954']
attribute: name=[u'itemprop']; value=birthDate
attribute: name=[u'datetime']; value=1954-08-16


>>> 

But wouldn’t it be nice to keep a reference to the Person item as property of the Movie director itemprop? We’d need to uniquely identify items in the markup, maybe the position of each itemscope element, or its number.

You can use XPath count() function for that, counting other itemscopes that are siblings before (preceding::*[@itemscope]) or ancestors of the current one (ancestor::*[@itemscope]):

>>> for item in selector.xpath('.//*[@itemscope]'):
...     print "Item:", item.xpath('@itemtype').extract()
...     print "ID:", item.xpath("""count(preceding::*[@itemscope])
...                              + count(ancestor::*[@itemscope])
...                              + 1""").extract()
... 
Item: [u'http://schema.org/Movie']
ID: [u'1.0']
Item: [u'http://schema.org/Person']
ID: [u'2.0']
>>> 

Here’s a cleaned up example routine, referencing items by their ID when itemprops are also itemscopes:

>>> import pprint
>>> items = []
>>> for itemscope in selector.xpath('//*[@itemscope][@itemtype]'):
...     item = {"itemtype": itemscope.xpath('@itemtype').extract()[0]}
...     item["item_id"] = int(float(itemscope.xpath("""count(preceding::*[@itemscope])
...                                                  + count(ancestor::*[@itemscope])
...                                                  + 1""").extract()[0]))
...     properties = []
...     for itemprop in itemscope.xpath("""set:difference(.//*[@itemprop],
...                                                       .//*[@itemscope]//*[@itemprop])"""):
...         property = {"itemprop": itemprop.xpath('@itemprop').extract()[0]}
...         if itemprop.xpath('@itemscope'):
...             property["value_ref"] = {
...                 "item_id": int(float(itemprop.xpath("""count(preceding::*[@itemscope])
...                                                      + count(ancestor::*[@itemscope])
...                                                      + 1""").extract()[0]))
...             }
...         else:
...             value = itemprop.xpath('normalize-space(.)').extract()[0]
...             if value:
...                 property["value"] = value
...         attributes = []
...         for index, attribute in enumerate(itemprop.xpath('@*'), start=1):
...             propname = itemprop.xpath('name(@*[%d])' % index).extract()[0]
...             if propname not in ("itemprop", "itemscope"):
...                 attributes.append((propname, attribute.extract()))
...         if attributes:
...             property["attributes"] = dict(attributes)
...         properties.append(property)
...     item["properties"] = properties
...     items.append(item)
... 
>>> pprint.pprint(items)
[{'item_id': u'1.0',
  'itemtype': u'http://schema.org/Movie',
  'properties': [{'itemprop': u'name', 'value': u'Avatar'},
                 {'attributes': {u'itemtype': u'http://schema.org/Person'},
                  'itemprop': u'director',
                  'value_ref': {'item_id': u'2.0'}},
                 {'itemprop': u'genre', 'value': u'Science fiction'},
                 {'attributes': {u'href': u'../movies/avatar-theatrical-trailer.html'},
                  'itemprop': u'trailer',
                  'value': u'Trailer'}]},
 {'item_id': u'2.0',
  'itemtype': u'http://schema.org/Person',
  'properties': [{'itemprop': u'name', 'value': u'James Cameron'},
                 {'attributes': {u'datetime': u'1954-08-16'},
                  'itemprop': u'birthDate',
                  'value': u'August 16, 1954'}]}]
>>> 

This dict output is not quite what W3C details in the microdata specs, but it’s close enough to leave the rest to you as an exercise ;-)

You can find a Github Gist of the above code here.

Announcing Portia, the open source visual web scraper!

We’re proud to announce the developer release of Portia, our new open source visual scraping tool based on Scrapy. Check out this video:

As you can see, Portia allows you to visually configure what’s crawled and extracted in a very natural way. It provides immediate feedback, making the process of creating web scrapers quicker and easier than ever before!

Portia is available to developers on github. We plan to offer a hosted version on Scrapinghub soon, which will be compatible with Autoscraping and fully integrated with our platform.

Please send us your feedback!

Follow

Get every new post delivered to your Inbox.

Join 281 other followers