Browsed by
Tag: web scraping

How to Increase Sales with Online Reputation Management

How to Increase Sales with Online Reputation Management

One negative review can cost your business up to 22% of its prospects. This was one of the sobering findings in a study highlighted on Moz last year. With over half of shoppers rating reviews as important in their buying decision, no company large or small can afford to ignore stats like these – let alone the reviews themselves. In what follows I’ll let you in on how web scraping can help you stay on top.

What is online reputation management

Online reputation management is carefully maintaining and curating your brand’s image by monitoring social media, reviews, and articles about your company. When it comes to online reputation management, you can’t have too much information. This is a critical part of your business strategy that impacts pretty much every level of your organization from customer service to marketing to sales. BrightLocal found that, “84% of people trust online reviews as much as a personal recommendation.” The relationship between brands and customers has become a two-way street because of the multitude of channels for interaction. Hence the rise of influencer and guerilla marketing tactics.

A key part of online reputation management is highlighting positive reviews to send the message that you are a responsive company that rewards loyal and happy customers. Online reputation management is likewise critical to putting out any potential customer fires. The attrition rate of consumers shoots up to 70% when they stumble across four or more negative articles. You need to be able to act fast to address criticisms and to mitigate escalating issues. Ideally you should not delete negative feedback, but instead show the steps that you are taking to rectify the situation. Besides sparing you an occasional taste of the Streisand effect, this shows that you are responsible, transparent, and not afraid to own up to errors.

How to manage your reputation online

While you could manually monitor social media and review aggregators, in addition to Googling your company for unexpected articles, it’s much more effective to automate the process. There are a lot of different companies and services that specialize in this service including:

  1. Sprout Social
  2. Brandwatch
  3. Klear
  4. Sprinklr

If you want complete control over your data and the type of information that you’d like to monitor, web scraping is the most comprehensive and flexible choice.

Web scraping, the man behind the curtain

Web scraping provides reliable and up-to-date web data

There is an inconceivably vast amount of content on the web which was built for human consumption. However, its unstructured nature presents an obstacle for software. So the general idea behind web scraping is to turn this unstructured web content into a structured format for easy analysis.

Automated data extraction smooths the tedious manual aspect of research and allows you to focus on finding actionable insights and implementing them. And this is especially critical when it comes to online reputation management. Respondents to The Social Habit study showed that when customers contact companies through social media for customer support issues, 32% expect a response within 30 minutes and 42% expect a response within 60 minutes. Using web scraping, you could easily have constantly updating data feeds that alert you to comments, help queries, and complaints about your brand on any website, allowing you to take instant action.

You also need to be sure that nothing falls through the cracks. You can easily monitor thousands, if not millions of websites for changes and updates that will impact your company.

Sentiment analysis and review monitoring

Now, a key part of online reputation management is monitoring reviews for positive and negative feedback. Once the extracted web data is in, you can use machine learning to do sentiment analysis. This form of textual analysis can categorize messages as positive or negative, and the more data you use to train the program, the more effective it becomes. This is a great method for being able to quickly respond to negative reviews while keeping track of positive reviews to reward customers and highlight loyalty.

Straight From the Horse’s Mouth

Here are two entrepreneurs providing real world examples of how they use online reputation management and review monitoring to increase their business.

The Importance of Review Monitoring

Kent Lewis
President and Founder of Anvil Media, Inc.
http://www.anvilmediainc.com/about/team/kent-lewis

As a career agency professional who has owned my own agency for the past 16 years, I have a few thoughts regarding monitoring reviews and assessing sentiment analysis to move businesses forward:

Monitoring reviews (including sentiment) is essential to your business. Ignoring (negative) reviews can cause undue and unnecessary harm. Since 90% of customers read online reviews before visiting a business, negative reviews can directly affect sales. Conversely, a one-star increase on Yelp, leads to a 5-9% increase in business revenue.

Online reviews can be monitored manually (bookmarking and visiting sites like Google My Business, Yelp and others daily or as-needed). However, there are a host of tools available that automate the process. Utilize a mix of free (socialmention.com) and paid tools (Revinate.com) to regularly monitor reviews, in order to address negative reviews and celebrate positive reviews.

While the primary objective for monitoring reviews is identifying and mitigating negative reviews, there are a host of other benefits to capturing and analyzing the data. Harvesting and analyzing the data will provide insights that will improve your products and services. For starters, you can measure and trend sentiment for the brand overall. With additional insights, you can track and monitor reviews for specific products, services or locations. Social media and review sites are the largest (free) focus group in the world. Additionally, you can look at competitors and create benchmarks to track trends over time. Lastly, you can identify superfans that can be nurtured into brand ambassadors.

The sources of data vary by company and industry. Most businesses can be reviewed on Google My Business, Yelp, BBB and Glassdoor (for employees). Each industry has specific sites that also must be monitored, including Expedia and Travelocity for travel & hospitality.

To get maximum value from your monitoring efforts, always look at competitor reviews. Customers are telling you what business you should be in based on their feedback and suggestions for improvement… learn from the entire industry, not just your current or past customers.

Online reputation management, social media, and competitor monitoring

Max Robinson
Owner of Ace Work Gear
http://www.aceworkgear.com

We use tools like Sprout Social which helps us to track mentions on social media for our clients, as this where the majority of the discussion happens about their business. The main reason that our clients want to track these mentions is that people tend to speak more openly and honestly about their experiences with a business on social media than anywhere else online. This also gives our clients the chance to join in conversations with their customers in a casual manner, whereas interactions on review sites can be far more formal.

We report on the number of mentions, and whether our client is being discussed in a positive or negative manner, as well as what the discussion is specifically related to. We look at 3 main social media platforms – Facebook, Twitter and Reddit. We also monitor mentions of competitors across all of these platforms, as per the request of our clients.

Monitoring the online reputation of competitors

Do not neglect your competitors when monitoring reviews and social media. Keeping track of the online reputation of competitors allows you to:

  1. Correctly position and price your product or service offerings
  2. Snatch customers who are posting dissatisfied reviews and comments about your competition
  3. Launch more effective marketing campaigns that address pain points experienced by customers of your competition
  4. Determine what your competitors are doing right so that you can innovate off of their ideas

And that’s just the tip of the iceberg. Competitive intelligence and having an accurate overview of your industry only serves to help you sell your products more effectively. And to bring it back to online reputation management, having a negative perception of your brand is like shooting yourself in the foot. You’re already at a severe disadvantage, especially when compared to positively reviewed competitors.

How to use online reputation management to increase your sales

In an interview with Don Sorensen, president of Big Blue Robot, he shared that one company he worked with was losing an estimated $2 million and more in sales due to a poor online reputation. Don’t let this be you.

  1. The first step is to level the playing field by locating and responding to all of the negative reviews. With a damaged reputation, you should be in crisis mode and monitoring brand mentions around-the-clock so that you are never caught by surprise.
  2. Dominate your search results so that there is little room for people with vendettas to swoop in. This means posting regularly on social media, getting press coverage, and answering questions in forums related to your business or your industry.
  3. Curate your brand’s reputation by having an active blog that carefully frames the benefits of your business, tailored to your audience.

If you are proactive and have a positive reputation or have managed to repair your reputation, then enthusiastic reviews and word of mouth will increase and improve your lead generation prospects. Your sales team should also be fully aware of your online reputation so they can soothe potential concerns or draw attention to success stories.

Wrap up

They say that a good reputation is more valuable than money. Guard yours closely with web data and ensure that you are taking every precaution necessary to retain customers and win over new leads.

Explore ways that you can use web data or chat with one of our representatives to learn more.

How to Build your own Price Monitoring Tool

How to Build your own Price Monitoring Tool

Computers are great at repetitive tasks. They don’t get distracted, bored, or tired. Automation is how you should be approaching tedious tasks that are absolutely essential to becoming a successful business or when carrying out mundane responsibilities. Price monitoring, for example, is a practice that every company should be doing, and is a task that readily lends itself to automation.

In this tutorial, I’ll walk you through how to create your very own price monitoring tool from scratch. While I’m approaching this as a careful shopper who wants to make sure I’m getting the best price for a specific product, you could develop a similar tool to monitor your competitors using similar methods.

Why you should be monitoring competitor prices

Price monitoring is basically knowing how your competitors price their products, how your prices fit within your industry, and whether there are any fluctuations that you can take advantage of.

When it comes to mission critical tasks like price monitoring, it’s important to ensure accuracy, obtain up-to-date information, and have the capacity for massive scale. By pricing your products perfectly, you can make sure that your competitors aren’t undercutting you, which makes you more likely to nab customers.

In our article on how web data is used by startups, Max Robinson, owner of Ace Work Gear, shared his thoughts on the importance of price monitoring:

“But it occurred to me that if you aren’t offering competitive prices, then you’re essentially throwing money down the drain. Even if you have good visibility, users will look elsewhere to buy once they’ve seen your prices.”

And that’s part of why automation is so important. You don’t want to miss sudden sales or deals from competitors that might make your offerings less desirable.

Overview

In terms of using price monitoring as a consumer, the key is to be able to take advantage of rapid price drops so you can buy during lightning sales. For this tutorial, I used Scrapy, our open source web scraping framework, and Scrapy Cloud, our fully-featured production environment (there’s a forever free account option). Here is the basic outline of my approach:

  1. Develop web scrapers to periodically collect prices from a list of products and online retailers.
  2. Build a Python script to check whether there are price drops in the most recently scraped data and then send an email alert when there are.
  3. Deploy the project to Scrapy Cloud and schedule periodic jobs to run the spiders and the script every X minutes.

Collecting the Prices

I monitored prices from a couple online retailers. To scrape the prices, I built one Scrapy spider for each of these. The spiders work by:

  1. Reading a list of product URLs from a JSON file
  2. Scraping the prices for the listed products
  3. Storing the prices in a Scrapy Cloud Collection (an efficient key-value storage)

Here is a sample JSON file with product URLs:

{
    "headsetlogitech": [
        "https://www.retailer1.com/pagefor-logitech-headset",
        "http://www.retailer2.com/pagefor-headset-logitech",
        "http://www.retailer3.com/pagefor-headset-log"
    ],
    "webcamlogitech": [
        "https://www.retailer1.com/pagefor-logitech-webcam",
        "http://www.retailer2.com/pagefor-webcam-logitech",
        "http://www.retailer3.com/pagefor-webcam-log"
    ]
}

If you want to monitor more retailers than the three I implemented, all you need to do is add their URLs to the JSON file and then create the requisite Scrapy spider for each website.

The Spiders

If you are new to the world of Scrapy and web scraping, then I suggest that you check out this tutorial first. When building a spider, you need to pay attention to the layout of each retailer’s product page. For most of these stores, the spider code will be really straightforward, containing only the extraction logic using CSS selectors. In this case, the URLs are read during the spider’s startup.

Here’s an example spider for Best Buy:

class BestbuySpider(BaseSpider):
  name = "bestbuy.com"
  
  def parse(self, response):
    item = response.meta.get('item', {})
    item['url'] = response.url
    item['title'] = response.css(
      'div#sku-title > h1::text'
    ).extract_first().strip()
    item['price'] = float(response.css(
      'div.price-block ::attr(data-customer-price)'
    ).extract_first(default=0))
    yield item

BaseSpider contains the logic to read the URLs from the JSON file and generate requests. In addition to the spiders, I created an item pipeline to store product data in a Scrapy Cloud collection. You can check out the other spiders that I built in the project repository.

Building the Price Monitoring Script

Now that the spiders have been built, you should start getting product prices that are then stored in a collection. To monitor price fluctuations, the next step is to build a Python script that will pull data from that collection, check if the most recent prices are the lowest in a given time span, and then send an email alert when it finds a good deal.

Here is my model email notification that is sent out when there’s a price drop:

image00

You can find the source code for the price monitor in the project repository. As you might have noticed, there are customizable options via command line arguments. You can:

  • modify the time frame in which the prices are compared to find out whether the latest price is the best of the day, the week, the month, and so forth.
  • set a price margin to ignore insignificant price drops since some retailers have minuscule price fluctuations throughout the day. You probably don’t want to receive an email when the product that you’re interested in drops one cent…

Deployment and Execution

Now that you have the spider(s) and the script, you need to deploy both to Scrapy Cloud, our PaaS for web crawlers.

I scheduled my spiders to collect prices every 30 minutes and the script to check this data at 30 minute intervals as well. You can configure this through your Scrapy Cloud dashboard, easily changing the periodicity depending on your needs.

image03

Check out this video to learn how to deploy Scrapy spiders and this tutorial on how to run a regular Python script on Scrapy Cloud.

How to run this project in your own Scrapy Cloud account:

  • Clone the project:
    • git clone git@github.com:scrapinghub/sample-projects.git
  • Add the products you want to monitor to resources/urls.json
  • Sign up for Scrapy Cloud (it’s free!)
  • Create a project on Scrapy Cloud
  • Deploy your local project to Scrapy Cloud
  • Create a periodic execution job to run each spider
  • Create a periodic execution job to run the monitor script
  • Sit back, relax, and let automation work its magic

Scaling up

This price monitor is a good fit for individuals interested in getting the best deals for their wishlist. However, if you’re looking to scale up and create a reliable tool for monitoring competitors, here are some typical challenges that you will face:

  • Getting prices from online retailers who feature millions of products can be overwhelming. Scraping these sites requires advanced crawling strategies to make sure that you always have hot data that is relevant.
  • Online retailers typically have layout variations throughout their website and the smallest shifts can bring your crawler to a screeching halt. To get around this, you might need to use advanced techniques such as machine learning to help with data discovery.
  • Running into anti-bot software can shut your price gathering activities down. You will need to develop some sophisticated techniques for bypassing these obstacles.

If you’re curious about how to implement or develop an automated price monitoring tool, feel free to reach out with any questions.

Tell us about your needs

Wrap up

To sum up, there’s no reason why you should be manually searching for prices and monitoring competitors. Using Scrapy, Scrapy Cloud, a Python script, and just a little bit of programming know-how, you can easily get your holiday shopping done under budget with deals delivered straight to your inbox.

If you’re looking for a professional-grade competitor and price monitoring service, get in touch!

An Introduction to XPath: How to Get Started

An Introduction to XPath: How to Get Started

XPath is a powerful language that is often used for scraping the web. It allows you to select nodes or compute values from an XML or HTML document and is actually one of the languages that you can use to extract web data using Scrapy. The other is CSS and while CSS selectors are a popular choice, XPath can actually allow you to do more.

With XPath, you can extract data based on text elements’ contents, and not only on the page structure. So when you are scraping the web and you run into a hard-to-scrape website, XPath may just save the day (and a bunch of your time!).

This is an introductory tutorial that will walk you through the basic concepts of XPath, crucial to a good understanding of it, before diving into more complex use cases.

Note: You can use the XPath playground to experiment with XPath. Just paste the HTML samples provided in this post and play with the expressions.

The basics

Consider this HTML document:

<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#">page</a></h2>
    <p>This is the first paragraph.</p>
    <!-- this is the end -->
  </body>
</html>

XPath handles any XML/HTML document as a tree. This tree’s root node is not part of the document itself. It is in fact the parent of the document element node (<html> in case of the HTML above). This is how the XPath tree for the HTML document looks like:

HTML tree

As you can see, there are many node types in an XPath tree:

  • Element node: represents an HTML element, a.k.a an HTML tag.
  • Attribute node: represents an attribute from an element node, e.g. “href” attribute in <a href=”http://www.example.com”>example</a>.
  • Comment node: represents comments in the document (<!-- … -->).
  • Text node: represents the text enclosed in an element node (example in <p>example</p>).

Distinguishing between these different types is useful to understand how XPath expressions work. Now let’s start digging into XPath.

Here is how we can select the title element from the page above using an XPath expression:

/html/head/title

This is what we call a location path. It allows us to specify the path from the context node (in this case the root of the tree) to the element we want to select, as we do when addressing files in a file system. The location path above has three location steps, separated by slashes. It roughly means: start from the ‘html’ element, look for a ‘head’ element underneath, and a ‘title’ element underneath that ‘head’. The context node changes in each step. For example, the head node is the context node when the last step is being evaluated.

However, we usually don’t know or don’t care about the full explicit node-by-node path, we just care about the nodes with a given name. We can select them using:

//title

Which means: look in the whole tree, starting from the root of the tree (//) and select only those nodes whose name matches title. In this example, // is the axis and title is the node test.

In fact, the expressions we’ve just seen are using XPath’s abbreviated syntax. Translating //title to the full syntax we get:

/descendant-or-self::node()/child::title

So, // in the abbreviated syntax is short for descendant-or-self, which means the current node or any node below it in the tree. This part of the expression is called the axis and it specifies a set of nodes to select from, based on their direction on the tree from the current context (downwards, upwards, on the same tree level). Other examples of axes are: parent, child, ancestor, etc — we’ll dig more into this later on.

The next part of the expression, node(), is called a node test, and it contains an expression that is evaluated to decide whether a given node should be selected or not. In this case, it selects nodes from all types. Then we have another axis,child, which means go to the child nodes from the current context, followed by another node test, which selects the nodes named as title.

So, the axis defines where in the tree the node test should be applied and the nodes that match the node test will be returned as a result.

You can test nodes against their name or against their type.

Here are some examples of name tests:

Expression Meaning
/html Selects the node named html, which is under the root.
/html/head Selects the node named head, which is under the html node.
//title Selects all the title nodes from the HTML tree.
//h2/a Selects all a nodes which are directly under an h2 node.

And here are some examples of node type tests:

Expression Meaning
//comment() Selects only comment nodes.
//node() Selects any kind of node in the tree.
//text() Selects only text nodes, such as “This is the first paragraph”.
//* Selects all nodes, except comment and text nodes.

We can also combine name and node tests in the same expression. For example:

//p/text()

This expression selects the text nodes from inside p elements. In the HTML snippet shown above, it would select “This is the first paragraph.”.

Now, let’s see how we can further filter and specify things. Consider this HTML document:

<html>
  <body>
    <ul>
      <li>Quote 1</li>
      <li>Quote 2 with <a href="...">link</a></li>
      <li>Quote 3 with <a href="...">another link</a></li>
      <li><h2>Quote 4 title</h2> ...</li>
    </ul>
  </body>
</html>

Say we want to select only the first li node from the snippet above. We can do this with:

//li[position() = 1]

The expression surrounded by square brackets is called a predicate and it filters the node set returned by //li (that is, all li nodes from the document) using the given condition. In this case it checks each node’s position using the position() function, which returns the position of the current node in the resulting node set (notice that positions in XPath start at 1, not 0). We can abbreviate the expression above to:

//li[1]

Both XPath expressions above would select the following element:

<li class="quote">Quote 1</li>

Check out a few more predicate examples:

Expression Meaning
//li[position()%2=0] Selects the li elements at even positions.
//li[a] Selects the li elements which enclose an a element.
//li[a or h2] Selects the li elements which enclose either an a or an h2 element.
//li[ a [ text() = "link" ] ] Selects the li elements which enclose an a element whose text is “link”. Can also be written as //li[ a/text()="link" ].
//li[last()] Selects the last li element in the document.

So, a location path is basically composed by steps, which are separated by / and each step can have an axis, a node test and a predicate. Here we have an expression composed by two steps, each one with axis, node test and predicate:

//li[ 4 ]/h2[ text() = "Quote 4 title" ]

And here is the same expression, written using the non-abbreviated syntax:

/descendant-or-self::node()
    /child::li[ position() = 4 ]
        /child::h2[ text() = "Quote 4 title" ]

We can also combine multiple XPath expressions in a single one using the union operator |. For example, we can select all a and h2 elements in the document above using this expression:

//a | //h2

Now, consider this HTML document:

<html>
  <body>
    <ul>
      <li id="begin"><a href="https://scrapy.org">Scrapy</a></li>
      <li><a href="https://scrapinghub.com">Scrapinghub</a></li>
      <li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
      <li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a></li>
    </ul>
  </body>
</html>

Say we want to select only the a elements whose link points to an HTTPS URL. We can do it by checking their href attribute:

//a[starts-with(@href, "https")]

This expression first selects all the a elements from the document and for each of those elements, it checks whether their href attribute starts with “https”. We can access any node attribute using the @attributename syntax.

Here we have a few additional examples using attributes:

Expression Meaning
//a[@href=”https://scrapy.org”] Selects the a elements pointing to https://scrapy.org.
//a/@href Selects the value of the href attribute from all the a elements in the document.
//li[@id] Selects only the li elements which have an id attribute.

More on Axes

We’ve seen only two types of axes so far:

  • descendant-or-self
  • child

But there’s plenty more where they came from and we’ll see a few examples. Consider this HTML document:

<html>
  <body>
    <p>Intro paragraph</p>
    <h1>Title #1</h1>
    <p>A random paragraph #1</p>
    <h1>Title #2</h1>
    <p>A random paragraph #2</p>
    <p>Another one #2</p>
    A single paragraph, with no markup
    <div id="footer"><p>Footer text</p></div>
  </body>
</html>

Now we want to extract only the first paragraph after each of the titles. To do that, we can use the following-sibling axis, which selects all the siblings after the context node. Siblings are nodes who are children of the same parent, for example all children nodes of the body tag are siblings. This is the expression:

//h1/following-sibling::p[1]

In this example, the context node where the following-sibling axis is applied to is each of the h1 nodes from the page.

What if we want to select only the text that is right before the footer? We can use the preceding-sibling axis:

//div[@id='footer']/preceding-sibling::text()[1]

In this case, we are selecting the first text node before the div footer (“A single paragraph, with no markup”).

XPath also allows us to select elements based on their text content. We can use such a feature, along with the parent axis, to select the parent of the p element whose text is “Footer text”:

//p[ text()="Footer text" ]/..

The expression above selects <div id="footer"><p>Footer text</p></div>. As you may have noticed, we used .. here as a shortcut to the parent axis.

As an alternative to the expression above, we could use:

//*[p/text()="Footer text"]

It selects, from all elements, the ones that have a p child which text is “Footer text”, getting the same result as the previous expression.

You can find additional axes in the XPath specification: https://www.w3.org/TR/xpath/#axes

Wrap up

XPath is very powerful and this post is just an introduction to the basic concepts. If you want to learn more about it, check out these resources:

And stay tuned, because we will post a series with more XPath tips from the trenches in the following months.

How to Crawl the Web Politely with Scrapy

How to Crawl the Web Politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners.

In this post we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers.

Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs, WALL-Es, and R2-D2s rather than an apocalyptic wasteland of HAL 9000s, T-1000s, and Megatrons.

Robots

What Makes a Crawler Polite?

A polite crawler respects robots.txt
A polite crawler never degrades a website’s performance
A polite crawler identifies its creator with contact information
A polite crawler is not a pain in the buttocks of system administrators

robots.txt

Always make sure that your crawler follows the rules defined in the website’s robots.txt file. This file is usually available at the root of a website (www.example.com/robots.txt) and it describes what a crawler should or shouldn’t crawl according to the Robots Exclusion Standard. Some websites even use the crawlers’ user agent to specify separate rules for different web crawlers:

User-agent: Some_Annoying_Bot
Disallow: /

User-Agent: *
Disallow: /*.json
Disallow: /api
Disallow: /post
Disallow: /submit
Allow: /

Crawl-Delay

Mission critical to having a polite crawler is making sure your crawler doesn’t hit a website too hard. Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive.

When a website gets overloaded with more requests that the web server can handle, they might become unresponsive. Don’t be that guy or girl that causes a headache for the website administrators.

User-Agent

However, if you have ignored the cardinal rules above (or your crawler has achieved aggressive sentience), there needs to be a way for the website owners to contact you. You can do this by including your company name and an email address or website in the request’s User-Agent header. For example, Google’s crawler user agent is “Googlebot”.

Scrapinghub Abuse Report Form

Hey folks using our Scrapy Cloud platform! We trust you will crawl responsibly, but to support website administrators, we provide an abuse report form where they can report any misbehaviour from crawlers running on our platform. We’ll kindly pass the message along so that you can modify your crawls and avoid ruining a sysadmin’s day. If your crawler’s are turning into Skynet and running roughshod over human law, we reserve the right to halt their crawling activities and thus avert the robot apocalypse.

How to be Polite using Scrapy

Scrapy is a bit like Optimus Prime: friendly, fast, and capable of getting the job done no matter what. However, much like Optimus Prime and his fellow Autobots, Scrapy occasionally needs to be kept in check. So here’s the nitty gritty for ensuring that Scrapy is as polite as can be.

scrapy

Robots.txt

Crawlers created using Scrapy 1.1+ already respect robots.txt by default. If your crawlers have been generated using a previous version of Scrapy, you can enable this feature by adding this in the project’s settings.py:

ROBOTSTXT_OBEY = True

Then, every time your crawler tries to download a page from a disallowed URL, you’ll see a message like this:

2016-08-19 16:12:56 [scrapy] DEBUG: Forbidden by robots.txt: <GET http://website.com/login>

Identifying your Crawler

It’s important to provide a way for sysadmins to easily contact you if they have any trouble with your crawler. If you don’t, they’ll have to dig into their logs and look for the offending IPs.

Be nice to the friendly sysadmins in your life and identify your crawler via the Scrapy USER_AGENT setting. Share your crawler name, company name and a contact email:

USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)'

Introducing Delays

Scrapy spiders are blazingly fast. They can handle many concurrent requests and they make the most of your bandwidth and computing power. However, with great power comes great responsibility.

To avoid hitting the web servers too frequently, you need to use the DOWNLOAD_DELAY setting in your project (or in your spiders). Scrapy will then introduce a random delay ranging from 0.5 * DOWNLOAD_DELAY to 1.5 * DOWNLOAD_DELAY seconds between consecutive requests to the same domain. If you want to stick to the exact DOWNLOAD_DELAY that you defined, you have to disable RANDOMIZE_DOWNLOAD_DELAY.

By default, DOWNLOAD_DELAY is set to 0. To introduce a 5 second delay between requests from your crawler, add this to your settings.py:

DOWNLOAD_DELAY = 5.0

If you have a multi-spider project crawling multiple sites, you can define a different delay for each spider with the download_delay (yes, it’s lowercase) spider attribute:

class MySpider(scrapy.Spider):
    name = 'myspider'
    download_delay = 5.0
    ...

Concurrent Requests Per Domain

Another setting you might want to tweak to make your spider more polite is the number of concurrent requests it will do for each domain. By default, Scrapy will dispatch at most 8 requests simultaneously to any given domain, but you can change this value by updating the CONCURRENT_REQUESTS_PER_DOMAIN setting.

Heads up, the CONCURRENT_REQUESTS setting defines the maximum amount of simultaneous requests that Scrapy’s downloader will do for all your spiders. Tweaking this setting is more about your own server performance / bandwidth than your target’s when you’re crawling multiple domains at the same time.

AutoThrottle to Save the Day

Websites vary drastically in the number of requests they can handle. Adjusting this manually for every website that you are crawling is about as much fun as watching paint dry. To save your sanity, Scrapy provides an extension called AutoThrottle.

AutoThrottle automatically adjusts the delays between requests according to the current web server load. It first calculates the latency from one request. Then it will adjust the delay between requests for the same domain in a way that no more than AUTOTHROTTLE_TARGET_CONCURRENCY requests will be simultaneously active. It also ensures that requests are evenly distributed in a given timespan.

To enable AutoThrottle, just include this in your project’s settings.py:

AUTOTHROTTLE_ENABLED = True

Scrapy Cloud users don’t have to worry about enabling it, because it’s already enabled by default.

There’s a wide range of settings to help you tweak the throttle mechanism, so have fun playing around!

Use an HTTP Cache for Development

Developing a web crawler is an iterative process. However, running a crawler to check if it’s working means hitting the server multiple times for each test. To help you to avoid this impolite activity, Scrapy provides a built-in middleware called HttpCacheMiddleware. You can enable it by including this in your project’s settings.py:

HTTPCACHE_ENABLED = True

Once enabled, it caches every request made by your spider along with the related response. So the next time you run your spider, it will not hit the server for requests already done. It’s a win-win: your tests will run much faster and the website will save resources.

Don’t Crawl, use the API

Many websites provide HTTP APIs so that third parties can consume their data without having to crawl their web pages. Before building a web scraper, check if the target website already provides an HTTP API that you can use. If it does, go with the API. Again, it’s a win-win: you avoid digging into the page’s HTML and your crawler gets more robust because it doesn’t need to depend on the website’s layout.

Wrap Up

Let’s all do our part to keep the peace between sysadmins, website owners, and developers by making sure that our web crawling projects are as noninvasive as possible. Remember, we need to band together to delay the rise of our robot overlords, so let’s keep our crawlers, spiders, and bots polite.

image03

To all website owners, help a crawler out and ensure your site has an HTTP API. And remember, if someone using our platform is overstepping their bounds, please fill out an Abuse Report form and we’ll take care of the issue.

For those new to our platform, Scrapy Cloud is forever free and is the peanut butter to Scrapy’s jelly. For our existing Scrapy and Scrapy Cloud users, hopefully you learned a few tips for how to both speed up your crawls and prevent abuse complaints. Let us know if you have any further suggestions in the comment section below!

Sign up for free

Incremental Crawls with Scrapy and DeltaFetch

Incremental Crawls with Scrapy and DeltaFetch

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality with your own middleware or pipeline.

This makes it easy for the Scrapy community to easily develop new plugins to improve upon existing functionality, without making changes to Scrapy itself.

In this post we’ll show how you can leverage the DeltaFetch plugin to run incremental crawls.

Incremental Crawls

Some crawlers we develop are designed to crawl and fetch the data we need only once. On the other hand, many crawlers have to run periodically in order to keep our datasets up-to-date.

In many of these periodic crawlers, we’re only interested in new pages included since the last crawl. For example, we have a crawler that scrapes articles from a bunch of online media outlets. The spiders are executed once a day and they first retrieve article URLs from pre-defined index pages. Then they extract the title, author, date and content from each article. This approach often leads to many duplicate results and an increasing number of requests each time we run the crawler.

Fortunately, we are not the first ones to have this issue. The community already has a solution: the scrapy-deltafetch plugin. You can use this plugin for incremental (delta) crawls. DeltaFetch’s main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages where no items were extracted before, to URLs from the spiders’ start_urls attribute or requests generated in the spiders’ start_requests method.

DeltaFetch works by intercepting every Item and Request objects generated in spider callbacks. For Items, it computes the related request identifier (a.k.a. fingerprint) and stores it into a local database. For Requests, Deltafetch computes the request fingerprint and drops the request if it already exists in the database.

Now let’s see how to set up Deltafetch for your Scrapy spiders.

Getting Started with DeltaFetch

First, install DeltaFetch using pip:

$ pip install scrapy-deltafetch

Then, you have to enable it in your project’s settings.py file:

SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

DeltaFetch in Action

This crawler has a spider that crawls books.toscrape.com. It navigates through all the listing pages and visits every book details page to fetch some data like book title, description and category. The crawler is executed once a day in order to capture new books that are included in the catalogue. There’s no need to revisit book pages that have already been scraped, because the data collected by the spider typically doesn’t change.

To see Deltafetch in action, clone this repository, which has DeltaFetch already enabled in settings.py, and then run:

$ scrapy crawl toscrape

Wait until it finishes and then take a look at the stats that Scrapy logged at the end:

2016-07-19 10:17:53 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/stored': 1000,
    ...
    'downloader/request_count': 1051,
    ...
    'item_scraped_count': 1000,
}

Among other things, you’ll see that the spider did 1051 requests to scrape 1000 items and that DeltaFetch stored 1000 request fingerprints. This means that only 51 page requests haven’t generated items and so they will be revisited next time.

Now, run the spider again and you’ll see a lot of log messages like this:

2016-07-19 10:47:10 [toscrape] INFO: Ignoring already visited: 
<GET http://books.toscrape.com/....../index.html>

And in the stats you’ll see that 1000 requests were skipped because items have been scraped from those pages in a previous crawl. Now the spider hasn’t extracted any items and it did only 51 requests, all of them to listing pages from where no items have been scraped before:

2016-07-19 10:47:10 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/skipped': 1000,
    ...
    'downloader/request_count': 51,
}

Changing the Database Key

By default, DeltaFetch uses a request fingerprint to tell requests apart. This fingerprint is a hash computed based on the canonical URL, HTTP method and request body.

Some websites have several URLs for the same data. For example, an e-commerce site could have the following URLs pointing to a single product:

  • www.example.com/product?id=123
  • www.example.com/deals?id=123
  • www.example.com/category/keyboards?id=123
  • www.example.com/category/gaming?id=123

Request fingerprints aren’t suitable in these situations as the canonical URL will differ despite the item being the same. In this example, we could use the product’s ID as the DeltaFetch key.

DeltaFetch allows us to define custom keys by passing a meta parameter named deltafetch_key when initializing the Request:

from w3lib.url import url_query_parameter

...

def parse(self, response):
    ...
    for product_url in response.css('a.product_listing'):
        yield Request(
            product_url,
            meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
            callback=self.parse_product_page
        )
    ...

This way, DeltaFetch will ignore requests to duplicate pages even if they have different URLs.

Resetting DeltaFetch

If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:

$ scrapy crawl example -a deltafetch_reset=1

Using DeltaFetch on Scrapy Cloud

You can also use DeltaFetch in your spiders running on Scrapy Cloud. You just have to enable the DeltaFetch and DotScrapy Persistence addons in your project’s Addons page. The latter is required to allow your crawler to access the .scrapy folder, where DeltaFetch stores its database.

image00

Deltafetch is quite handy in situations as the ones we’ve just seen. Keep in mind that Deltafetch only avoid sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders.

You can check out the project page on github for further information: http://github.com/scrapy-plugins/scrapy-deltafetch

Wrap-up

You can find many interesting Scrapy plugins in the scrapy-plugins page on Github and you can also contribute to the community by including your own plugin there.

If you have a question or a topic that you’d like to see in this monthly column, please drop a comment here letting us know or reach us out via @scrapinghub on Twitter.

Improving Access to Peruvian Congress Bills with Scrapy

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf.

However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable.

Congress Bills in Peru

For the sake of transparency, Peruvian Congress provides a website where people can check the list of bills that are being processed, voted and eventually become law. For each bill, there’s a page with its authorship, title, submission date and a brief summary. These pages are frequently updated when bills are moved between commissions, approved and then published as laws.

By having all of this information online, lawyers and the general public can potentially inspect bills that could be the result of lobbying. In Peruvian history, there have been many laws passed that were to benefit only one specific company or individual.

Screen Shot 2016-07-13 at 9.52.11 AM

However, having transparency doesn’t mean it’s accessible. This site is very clunky, and the information for each bill is spread across several pages. It displays the bills in a very long list with far too many pages, and until very recently there has been no way to search for specific bills.

In the past, if you wanted to find a bill, you would need to look through several pages manually. This is very time consuming as there are around one thousand bills proposed every year. Not long ago, the site added a search tool, but it’s not user-friendly at all:

Screen Shot 2016-07-13 at 9.53.53 AM

The Solution

My lawyer friends from the Peruvian NGOs Hiperderecho.org and Respeto.pe asked me about the possibilities to build a web application. Their goal was to organize all the data from the Congress bills, allowing people to easily search and discover bills by keywords, authors and categories.

The first step in building this was to grab all bill data and metadata from the Congress website. Since they don’t provide an API, we had to use web scraping. For that, Scrapy is a champ.

I wrote several Scrapy spiders to crawl the Congress site and download as much data as possible. The spiders wake up every 8 hours and crawl the Congress pages looking for new bills. They parse the data they scrape and save it into a local PostgreSQL database.

Once we had achieved the critical step of getting all the data, it was relatively easy to build a search tool to navigate the 5400+ bills and counting. I used Django to create a simple interface for users, and so ProyectosDeLey.pe was born.

Screen Shot 2016-07-13 at 10.09.55 AM

The Findings

All kinds of possibilities are open once we have the data. For example, we could now generate statistics on the status of the bills. We found that of the 5402 proposed bills, only 740 became laws, meaning most of the bills were rejected or forgotten on the pile and never processed.

Screen Shot 2016-07-13 at 10.15.01 AM

Quick searches also revealed that many bills are not that useful. A bunch of them are only proposals to turn some specific days into “national days”.

There are proposals for national day of peace, “peace consolidation”, “peace and reconciliation”, Peruvian Coffee, Peruvian Cuisine, and also national days for several Peruvian produce.

There were even more than one bill proposing the celebration of the same thing, on the very same day. Organizing the bills into a database and building our search tool allowed people to discover these redundant and unnecessary bills.

Call In the Lawyers

After we aggregated the data into statistics, my lawyer friends found that the majority of bills are approved after only one round of voting. In the Peruvian legislation, dismissal of the second round of voting for any bill should be carried out only under exceptional circumstances.

However, the numbers show that the use of one round of voting has become the norm, as 88% of the bills approved were only done so in one round. The second round of voting has been created to compensate for the fact that the Peruvian Congress has only one chamber were all the decisions are made. It’s also expected that members of Congress should use the time between first and second voting for further debate and consultation with advisers and outside experts.

Bonus

The nice thing about having such information in a well-structured machine-readable format, is that we can create cool data visualizations, such as this interactive timeline that shows all the events that happened for a given bill:

Screen Shot 2016-07-13 at 10.27.19 AM

Another cool thing is that this data allows us to monitor Congress’ activities. Our web app allows users to subscribe to a RSS feed in order to get the latest bills, hot off the Congress press. My lawyer friends use it to issue “Legal Alerts” in the social media when some of the bills intend to do more wrong than good.

Wrap Up

People can build very useful tools with data available on the web. Unfortunately, government data often has poor accessibility and usability, making the transparency laws less useful than they should be. The work of volunteers is key in order to build tools that turn the otherwise clunky content into useful data for journalists, lawyers and regular citizens as well. Thanks to open source software such as Scrapy and Django, we can quickly grab the data and create useful tools like this.

See? You can help a lot of people by doing what you love! 🙂

Scraping Infinite Scrolling Pages

Scraping Infinite Scrolling Pages

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics.

Scrapy Tips

In the era of single page apps and tons of AJAX requests per page, a lot of websites have replaced “previous/next” pagination buttons with a fancy infinite scrolling mechanism. Websites using this technique load new items whenever the user scrolls to the bottom of the page (think Twitter, Facebook, Google Images). Even though UX experts maintain that infinite scrolling provides an overwhelming amount of data for users, we’re seeing an increasing number of web pages resorting to presenting this unending list of results.

When developing our web scrapers, one of the first things we do is look for UI components with links that might lead us to the next page of results. Unfortunately, these links aren’t present on infinite scrolling web pages.

While this scenario might seem like a classic case for a JavaScript engine such as Splash or Selenium, it’s actually a simple fix. Instead of simulating user interaction with such engines, all you have to do is inspect your browser’s AJAX requests when you scroll the target page and then re-create those requests in your Scrapy spider.

Let’s use Spidy Quotes as an example and build a spider to get all the items listed on it.

Inspecting the Page

First things first, we need to understand how the infinite scrolling works on this page and we can do so by using the Network panel in the Browser’s developer tools. Open the panel and then scroll down the page to see the requests that the browser is firing:

scrapy tips from the pros june

Click on a request for a closer look. The browser sends a request to /api/quotes?page=x and then receives a JSON object like this in response:

{
   "has_next":true,
   "page":8,
   "quotes":[
      {
         "author":{
            "goodreads_link":"/author/show/1244.Mark_Twain",
            "name":"Mark Twain"
         },
         "tags":["individuality", "majority", "minority", "wisdom"],
         "text":"Whenever you find yourself on the side of the ..."
      },
      {
         "author":{
            "goodreads_link":"/author/show/1244.Mark_Twain",
            "name":"Mark Twain"
         },
         "tags":["books", "contentment", "friends"],
         "text":"Good friends, good books, and a sleepy ..."
      }
   ],
   "tag":null,
   "top_ten_tags":[["love", 49], ["inspirational", 43], ...]
}

This is the information we need for our spider. All it has to do is generate requests to “/api/quotes?page=x” for an increasing x until the has_next field becomes false. The best part of this is that we don’t even have to scrape the HTML contents to get the data we need. It’s all in a beautiful machine-readable JSON.

Building the Spider

Here is our spider. It extracts the target data from the JSON content returned by the server. This approach is easier and more robust than digging into the page’s HTML tree, trusting that layout changes will not break our spiders.

import json
import scrapy
 
 
class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5
 
    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

To further practice this tip, you can experiment with building a spider for our blog since it also uses infinite scrolling to load older posts.

Wrap Up

If you were feeling daunted by the prospect of scraping infinite scrolling websites, hopefully you’re feeling a bit more confident now. The next time that you have to deal with a page based on AJAX calls triggered by user actions, take a look at the requests that your browser is making and then replay them in your spider. The response is usually in a JSON format, making your spider even simpler.

And that’s it for June! Please let us know what you would like to see in future columns by reaching out on Twitter. We also recently released a Datasets Catalog, so if you’re stumped on what to scrape, take a look for some inspiration.

How to Debug your Scrapy Spiders

How to Debug your Scrapy Spiders

Welcome to Scrapy Tips from the Pros! Every month we release a few tricks and hacks to help speed up your web scraping and data extraction activities. As the lead Scrapy maintainers, we have run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with suggestions for future topics.

Scrapy Tips

Your spider isn’t working and you have no idea why. One way to quickly spot potential issues is to add a few print statements to find out what’s happening. This is often my first step and sometimes all I need to do to uncover the bugs that are preventing my spider from running properly. If this method works for you, great, but if it’s not enough, then read on to learn about how to deal with the nastier bugs that require a more thorough investigation. In this post, I’ll introduce you to the tools that should be in the toolbelt of every Scrapy user when it comes to debugging spiders.

Scrapy Shell is your Best Friend

Scrapy shell is a full-featured Python shell loaded with the same context that you would get in your spider callback methods. You just have to provide an URL and Scrapy Shell will let you interact with the same objects that your spider handles in its callbacks, including the response object.

$ scrapy shell http://blog.scrapinghub.com
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f0638a2cbd0>
[s]   item       {}
[s]   request    <GET http://blog.scrapinghub.com>
[s]   response   <200 https://blog.scrapinghub.com/>
[s]   settings   <scrapy.settings.Settings object at 0x7f0638a2cb50>
[s]   spider     <DefaultSpider 'default' at 0x7f06371f3290>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

After loading it, you can start playing around with the response in order to build the selectors to extract the data that you need:

>>> response.css("div.post-header > h2 ::text").extract()
...

If you’re not familiar with Scrapy Shell, give it a try. It’s a perfect fit for your development workflow, sitting right after the page inspection in the browser. You can create and test your spider’s extraction rules and use them in your spider’s code once you’ve built the ones you need.

Learn more about Scrapy Shell through the official documentation.

Start Scrapy Shell from your Spider Code

If your spider has been behaving unexpectedly for certain responses, you can quickly see what’s happening using the scrapy.shell.inspect_response method in your spider code. This will open a Scrapy shell session that will let you interact with the current response object.

For example, imagine that your spider is not extracting the expected amount of items from certain pages and you want to see what’s wrong with the response returned by the website:

from scrapy.shell import inspect_response
 
def BlogSpider(scrapy.Spider)
    ...
    def parse(self, response):
        if len(response.css('div.post-header > h2 ::text')) > EXPECTED:
            # generate the items
        else:
            inspect_response(response, self)
        ...

Once the execution hits the inspect_response call, Scrapy Shell is opened and you can interact with the response to see what’s happening.

Quickly Attaching a Debugger to your Spider

Another approach to debugging spiders is to use a regular Python debugger such as pdb or PuDB. I use PuDB because it’s quite a powerful yet easy-to-use debugger and all I need to do to activate it is to put this code in the line where I want a breakpoint:

import pudb; pudb.set_trace()

And when the breakpoint is reached, PuDB opens up a cool text-mode UI in your terminal that will bring back fond memories from the old days of using the Turbo Pascal debugger.

Take a look:image00

You can install PuDB using pip:

$ pip install pudb

Check out this video where our very own @eliasdorneles demonstrates a few tips on how to use PuDB: https://vimeo.com/166584837

Scrapy parse CLI command

There are certain scraping projects where you need your spiders to run for a long time. However, after a few hours of running, you might sadly see in the logs that one of your spiders had issues scraping specific URLs. You want to debug the spider, but you certainly don’t want to run the whole crawling process again and have to wait until that specific callback is called for that specific URL so that you can start your debugger.

Don’t worry, the parse command from Scrapy CLI is here to save the day! You just need to provide the spider name, the callback from the spider that should be used and the URL that you want to parse:

$ scrapy parse http://blog.scrapinghub.com/comments/bla --spider blog -c parse_comments

In this case, Scrapy is going to call the parse_comments method from the blog spider to parse the blog.scrapinghub.com/comments/bla URL. If you don’t specify the spider, Scrapy will search for a spider capable of handling this URL in your project based on the spiders’ allowed_domains settings.

It will then show you a summary of your callback’s execution:

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'comments': [
    {'content': u"I've seen this language ...",
     'username': u'forthemostpart'},
    {'content': u"It's a ...",
     'username': u'YellowAfterlife'},
    ...
    {'content': u"There is a macro for ...",
    'username': u'mrcdk'}]}]
# Requests  -----------------------------------------------------------------
[]

You can also attach a debugger inside the method to help you figure out what’s happening (see the previous tip).

Scrapy fetch and view commands

Inspecting page contents in browsers might be deceiving since their JavaScript engine could render some content that the Scrapy downloader will not do. If you want to quickly check exactly how a page will look when downloaded by Scrapy, you can use these commands:

  • fetch: downloads the HTML using Scrapy Downloader and prints to stdout.
  • view: downloads the HTML using Scrapy Downloader and opens it with your default browser.

Examples:

$ scrapy fetch http://blog.scrapinghub.com > blog.html
$ scrapy view http://scrapy.org

Post-Mortem Debugging Over Spiders with —pdb Option

Writing fail-proof software is nearly impossible. This situation is worse for web scrapers since they deal with web content that is frequently changing (and breaking). It’s better to accept that our spiders will eventually fail and to make sure that we have the tools to quickly understand why it’s broken and to be able to fix it as soon as possible.

Python tracebacks are great, but in some cases they don’t provide us with enough information about what happened in our code. This is where post-mortem debugging comes into play. Scrapy provides the --pdb command line option that fires a pdb session right where your crawler has broken, so you can inspect its context and understand what happened:

$ scrapy crawl blog -o blog_items.jl --pdb

If your spider dies due to a fatal exception, the pdb debugger will open and you can thoroughly inspect its cause of death.

Wrap-up

And that’s it for the Scrapy Tips from the Pros May edition. Some of these debugging tips are also available in Scrapy official documentation.

Please let us know what you’d like to see in the future since we’re here to help you scrape the web more effectively. We’ll see you next month!

 

Scrapy Tips from the Pros: March 2016 Edition

Scrapy Tips from the Pros: March 2016 Edition

Scrapy-Tips-March-2016

Welcome to the March Edition of Scrapy Tips from the Pros! Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we’ll cover how to use a cookiejar with the CookiesMiddleware to get around websites that won’t allow you to crawl multiple pages at the same time using the same cookie. We’ll also share a handy tip on how to use multiple fallback XPath/CSS expressions with item loaders to get data from websites more reliably.

**Students reading this, we are participating in Google Summer of Code 2016 and some of our project ideas involve Scrapy! If you’re interested, take a look at our ideas and remember to apply before Friday, March 25!

If you are not a student, please share with your student friends. They could get a summer stipend and we might even hire them at the end.**

Work Around Sites With Weird Session Behavior Using a CookieJar

Websites that store your UI state on their server’s sessions are a pain to navigate, let alone scrape. Have you ever run into websites where one tab affects the other tabs open on the same site? Then you’ve probably run into this issue.

While this is frustrating for humans, it’s even worse for web crawlers. It can severely hinder a web crawling session. Unfortunately, this is a common pattern for ASP.Net and J2EE-based websites. And that’s where cookiejars come in. While the cookiejar is not a frequent need, you’ll be so glad that you have it for those unexpected cases.

When your spider crawls a website, Scrapy automatically handles the cookie for you, storing and sending it in subsequent requests to the same site. But, as you may know, Scrapy requests are asynchronous. This means that you probably have multiple requests being handled concurrently to the same website while sharing the same cookie. To avoid having requests affect each other when crawling these types of websites, you must set different cookies for different requests.

You can do this by using a cookiejar to store separate cookies for different pages in the same website. The cookiejar is just a key-value collection of cookies that Scrapy keeps during the crawling session. You just have to define a unique identifier for each of the cookies that you want to store and then use that identifier when you want to use that specific cookie.

For example, say you want to crawl multiple categories on a website, but this website stores the data related to the category that you are crawling/browsing in the server session. To crawl the categories concurrently, you would need to create a cookie for each category by passing the category name as the identifier to the cookiejar meta parameter:

class ExampleSpider(scrapy.Spider):
    urls = [
        'http://www.example.com/category/photo',
        'http://www.example.com/category/videogames',
        'http://www.example.com/category/tablets'
    ]

    def start_requests(self):
        for url in urls:
            category = url.split('/')[-1]
            yield scrapy.Request(url, meta={'cookiejar': category})

Three different cookies will be managed in this case (‘photo’, ‘videogames’ and ‘tablets’). You can create a new cookie whenever you pass a nonexistent key as the cookiejar meta value (like when a category name hasn’t been visited yet). When the key we pass already exists, Scrapy uses the respective cookie for that request.

So, if you want to reuse the cookie that has been used to crawl the ‘videogames’ page, for example, you just need to pass ‘videogames’ as the unique key to the cookiejar. Instead of creating a new cookie, it will use the existing one:

yield scrapy.Request('http://www.example.com/atari2600', meta={'cookiejar': 'videogames'})

Adding Fallback CSS/XPath Rules

Item Loaders are useful when you need to accomplish more than simply populating a dictionary or an Item object with the data collected by your spider. For example, you might need to add some post-processing logic to the data that you just collected. You might be interested in something as simple as capitalizing every word in a title to more complex operations. With an ItemLoader, you can decouple this post-processing logic from the spider in order to have a more maintainable design.

This tip shows you how to add extra functionality to an Item Loader. Let’s say that you are crawling Amazon.com and extracting the price for each product. You can use an Item Loader to populate a ProductItem object with the product data:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()
    price = scrapy.Field()


class AmazonSpider(scrapy.Spider):
    name = "amazon"
    allowed_domains = ["amazon.com"]

    def start_requests(self):
        ...

    def parse_product(self, response):
        loader = ItemLoader(item=ProductItem(), response=response)
        loader.add_css('price', '#priceblock_ourprice ::text')
        loader.add_css('name', '#productTitle ::text')
        loader.add_value('url', response.url)
        yield loader.load_item()

This method works pretty well, unless the scraped product is a deal. This is because Amazon represents deal prices in a slightly different format than regular prices. While the price of a regular product is represented like this:

<span id="priceblock_ourprice" class="a-size-medium a-color-price">
    $699.99
</span>

The price of a deal is shown slightly differently:

<span id="priceblock_dealprice" class="a-size-medium a-color-price">
    $649.99
</span>

A good way to handle situations like this is to add a fallback rule for the price field in the Item loader. This is a rule that is applied only if the previous rules for that field have failed. To accomplish this with the Item Loader, you can add a add_fallback_css method:

class AmazonItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

    def get_collected_values(self, field_name):
        return (self._values[field_name]
                if field_name in self._values
                else self._values.default_factory())

    def add_fallback_css(self, field_name, css, *processors, **kw):
        if not any(self.get_collected_values(field_name)):
            self.add_css(field_name, css, *processors, **kw)

As you can see, the add_fallback_css method will use the CSS rule if there are no previously collected values for that field. Now, we can change our spider to use AmazonItemLoader and then add the fallback CSS rule to our loader:

def parse_product(self, response):
    loader = AmazonItemLoader(item=ProductItem(), response=response)
    loader.add_css('price', '#priceblock_ourprice ::text')
    loader.add_fallback_css('price', '#priceblock_dealprice ::text')
    loader.add_css('name', '#productTitle ::text')
    loader.add_value('url', response.url)
    yield loader.load_item()

This tip can save you time and make your spiders much more robust. If one CSS rule fails to get the data, there will be other rules that can be applied which will extract the data you need.

If Item Loaders are new to you, check out the documentation.

Wrap Up

And there you have it! Please share any and all problems that you’ve run into while web scraping and extracting data. We’re always on the lookout for new tips and hacks to share in our Scrapy Tips from the Pros monthly column. Hit us up on Twitter or Facebook and let us know if we’ve helped your workflow.

And if you haven’t yet, give Portia, our open source visual web scraping tool, a try. We know you’re attached to Scrapy, but it never hurts to experiment with your stack 😉

Please apply to join us for Google Summer of Code 2016 by Friday, March 25!

Migrate your Kimono Projects to Portia

Migrate your Kimono Projects to Portia

Heads up, Kimono Labs users!

Today, we are releasing a tool to help you migrate your Kimono projects to Portia.

Selection_237

All you have to do is provide your Kimono credentials and let it convert your Kimono projects into Portia projects.

You will then be able to run those projects on Scrapy Cloud or on your own Portia instance, since Portia is open source.

Stay tuned for the Portia 2.0 beta release coming soon!

Portia 2.0 comes with a brand new user interface that we designed based on a meticulous study of real world users led by our UX team – and of course all existing Portia spiders will continue to be supported.

It also includes some long-awaited features:

  • Automatic detection of similar items on a page
  • Extraction of items with nested structure
  • A single attribute can now be assigned to multiple fields

We’re excited to ensure that your work continues uninterrupted.

migrate-button