Skip to content

Scrapy Tips from the Pros May 2016 Edition

Welcome to Scrapy Tips from the Pros! Every month we release a few tricks and hacks to help speed up your web scraping and data extraction activities. As the lead Scrapy maintainers, we have run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with suggestions for future topics.

Scrapy Tips

How to Debug Your Spiders

Your spider isn’t working and you have no idea why. One way to quickly spot potential issues is to add a few print statements to find out what’s happening. This is often my first step and sometimes all I need to do to uncover the bugs that are preventing my spider from running properly. If this method works for you, great, but if it’s not enough, then read on to learn about how to deal with the nastier bugs that require a more thorough investigation. In this post, I’ll introduce you to the tools that should be in the toolbelt of every Scrapy user when it comes to debugging spiders.

Scrapy Shell is your Best Friend

Scrapy shell is a full-featured Python shell loaded with the same context that you would get in your spider callback methods. You just have to provide an URL and Scrapy Shell will let you interact with the same objects that your spider handles in its callbacks, including the response object.

$ scrapy shell http://blog.scrapinghub.com
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f0638a2cbd0>
[s]   item       {}
[s]   request    <GET http://blog.scrapinghub.com>
[s]   response   <200 https://blog.scrapinghub.com/>
[s]   settings   <scrapy.settings.Settings object at 0x7f0638a2cb50>
[s]   spider     <DefaultSpider 'default' at 0x7f06371f3290>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

After loading it, you can start playing around with the response in order to build the selectors to extract the data that you need:

>>> response.css("div.post-header > h2 ::text").extract()
...

If you’re not familiar with Scrapy Shell, give it a try. It’s a perfect fit for your development workflow, sitting right after the page inspection in the browser. You can create and test your spider’s extraction rules and use them in your spider’s code once you’ve built the ones you need.

Learn more about Scrapy Shell through the official documentation.

Start Scrapy Shell from your Spider Code

If your spider has been behaving unexpectedly for certain responses, you can quickly see what’s happening using the scrapy.shell.inspect_response method in your spider code. This will open a Scrapy shell session that will let you interact with the current response object.

For example, imagine that your spider is not extracting the expected amount of items from certain pages and you want to see what’s wrong with the response returned by the website:


from scrapy.shell import inspect_response

def BlogSpider(scrapy.Spider)
    ...
    def parse(self, response):
        if len(response.css('div.post-header > h2 ::text')) > EXPECTED:
            # generate the items
        else:
            inspect_response(response, self)
        ...

Once the execution hits the inspect_response call, Scrapy Shell is opened and you can interact with the response to see what’s happening.

Quickly Attaching a Debugger to your Spider

Another approach to debugging spiders is to use a regular Python debugger such as pdb or PuDB. I use PuDB because it’s quite a powerful yet easy-to-use debugger and all I need to do to activate it is to put this code in the line where I want a breakpoint:

import pudb; pudb.set_trace()

And when the breakpoint is reached, PuDB opens up a cool text-mode UI in your terminal that will bring back fond memories from the old days of using the Turbo Pascal debugger.

Take a look:image00

You can install PuDB using pip:

$ pip install pudb

Check out this video where our very own @eliasdorneles demonstrates a few tips on how to use PuDB: https://vimeo.com/166584837

Scrapy parse CLI command

There are certain scraping projects where you need your spiders to run for a long time. However, after a few hours of running, you might sadly see in the logs that one of your spiders had issues scraping specific URLs. You want to debug the spider, but you certainly don’t want to run the whole crawling process again and have to wait until that specific callback is called for that specific URL so that you can start your debugger.

Don’t worry, the parse command from Scrapy CLI is here to save the day! You just need to provide the spider name, the callback from the spider that should be used and the URL that you want to parse:

$ scrapy parse https://blog.scrapinghub.com/comments/bla --spider blog -c parse_comments

In this case, Scrapy is going to call the parse_comments method from the blog spider to parse the blog.scrapinghub.com/comments/bla URL. If you don’t specify the spider, Scrapy will search for a spider capable of handling this URL in your project based on the spiders’ allowed_domains settings.

It will then show you a summary of your callback’s execution:

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'comments': [
    {'content': u"I've seen this language ...",
     'username': u'forthemostpart'},
    {'content': u"It's a ...",
     'username': u'YellowAfterlife'},
    ...
    {'content': u"There is a macro for ...",
    'username': u'mrcdk'}]}]
# Requests  -----------------------------------------------------------------
[]

You can also attach a debugger inside the method to help you figure out what’s happening (see the previous tip).

Scrapy fetch and view commands

Inspecting page contents in browsers might be deceiving since their JavaScript engine could render some content that the Scrapy downloader will not do. If you want to quickly check exactly how a page will look when downloaded by Scrapy, you can use these commands:

  • fetch: downloads the HTML using Scrapy Downloader and prints to stdout.
  • view: downloads the HTML using Scrapy Downloader and opens it with your default browser.

Examples:

$ scrapy fetch http://blog.scrapinghub.com > blog.html
$ scrapy view http://scrapy.org

Post-Mortem Debugging Over Spiders with —pdb Option

Writing fail-proof software is nearly impossible. This situation is worse for web scrapers since they deal with web content that is frequently changing (and breaking). It’s better to accept that our spiders will eventually fail and to make sure that we have the tools to quickly understand why it’s broken and to be able to fix it as soon as possible.

Python tracebacks are great, but in some cases they don’t provide us with enough information about what happened in our code. This is where post-mortem debugging comes into play. Scrapy provides the --pdb command line option that fires a pdb session right where your crawler has broken, so you can inspect its context and understand what happened:

$ scrapy crawl blog -o blog_items.jl --pdb

If your spider dies due to a fatal exception, the pdb debugger will open and you can thoroughly inspect its cause of death.

Wrap-up

And that’s it for the Scrapy Tips from the Pros May edition. Some of these debugging tips are also available in Scrapy official documentation.

Please let us know what you’d like to see in the future since we’re here to help you scrape the web more effectively. We’ll see you next month!

 

Scrapy + MonkeyLearn: Textual Analysis of Web Data

We recently announced our integration with MonkeyLearn, bringing machine learning to Scrapy and Portia. MonkeyLearn offers numerous text analysis services via its API. Since there are so many uses to this platform addon, we’re launching a series of tutorials to help get you started.

Scrapinghub-MonkeyLearn-Addon-02

To kick off the MonkeyLearn Addon Tutorial series, let’s start with something we can all identify with: shopping. Whether you need to buy something for yourself, friends or family, or even the office, you need to evaluate cost, quality, and reviews. And when you’re working on a budget of both money and time, it can be helpful to automate the process with web scraping.

When scraping shopping and e-commerce sites, you’re most likely going to want product categories. Typically, you’d do this using the breadcrumbs. However, the challenge comes when you want to scrape several websites at once while keeping categories consistent throughout.

This is where MonkeyLearn comes in. You can use their Retail Classifier to classify products based on their descriptions, taking away the ambiguity of varied product categories.

This post will walk you through how to use MonkeyLearn’s Retail Classifier through the MonkeyLearn addon on Scrapy Cloud to scrape and categorise products from an online retailer.

Say Hello to Scrapy Cloud 2.0

For those new readers, Scrapy Cloud is our cloud-based platform that lets you easily deploy and run Scrapy and Portia web spiders without needing to deal with servers, libraries and dependencies, scheduling, storage, or monitoring. Scrapy Cloud recently underwent an upgrade and now features Docker support and a whole host of other updates.

In this tutorial, we’re using Scrapy to crawl and extract data. Scrapy’s decoupled architecture lets you use ready-made integrations for your spiders. The MonkeyLearn addon implements a Scrapy middleware. The addon takes every item scraped and sends the fields of your choice to MonkeyLearn for analysis. The classifier then stores the resulting category in another field of your choice. This lets you classify items without any extra code.

If you are a new user, sign up for Scrapy Cloud for free to continue on with this addon tutorial.

Meet MonkeyLearn’s Retail Classifier

We’ll begin by trying out the MonkeyLearn Retail Classifier with a sample description:

Enjoy speedy Wi-Fi around your home with this NETGEAR Nighthawk X4 AC2350 R7500-100NAS router, which features 4 high-performance antennas and Beamforming+ technology for optimal wireless range. Dynamic QoS prioritization automatically adjusts bandwidth.

Paste this sample in the test form under the Sandbox > Classify tab. And hit Submit:

1

You should get the following results:

2

MonkeyLearn’s engine analyzed the description and identified that the product belongs in the Electronics / Computers / Networking / Routers categories. As a bonus, it specifies how sure it is of its predictions.

The same example using curl would be:

curl --data '{"text_list": ["Enjoy speedy Wi-Fi around your home with this NETGEAR Nighthawk X4 AC2350 R7500-100NAS router, which features 4 high-performance antennas and Beamforming+ technology for optimal wireless range. Dynamic QoS prioritization automatically adjusts bandwidth."]}' \
-H "Authorization:Token <YOUR TOKEN GOES HERE>" \
-H "Content-Type: application/json" \
-D - \
"https://api.monkeylearn.com/v2/classifiers/cl_oFKL5wft/classify/?"

You can sign up for free on MonkeyLearn and replace <YOUR TOKEN GOES HERE>  with your particular API token to play with the retail classifier further.

Using MonkeyLearn with a Scrapy Cloud Project

Now we are going to deploy a Scrapy project to Scrapy Cloud and use the MonkeyLearn addon to categorize the scraped data. You can clone this project, or build your own spiders, and follow the steps described below.

1. Build your Scrapy spiders

For this tutorial, we built a spider for a fictional e-commerce website. The spider is pretty straightforward so that you can easily clone the whole project and try it by yourself. However, you should be aware of some details when building a spider from scratch to use with the MonkeyLearn addon.

First, the addon requires your spiders to generate Item objects from a pre-defined Item class. In our case, it’s the ProductItem class:

class ProductItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    category = scrapy.Field()

Second, you have to declare where the MonkeyLearn addon will store the analysis’ results as an additional field in your Item class. For our spider, these results will be stored in the category field of each of the items scraped.

2. Setup shub

Shub is the command line tool to manage your Scrapy Cloud services and you will use it to deploy your Scrapy projects there. You can install it by:

$ pip install shub

Now authenticate yourself on Scrapy Cloud:

$ shub login
Enter your API key from https://dash.scrapinghub.com/account/apikey
API key: <YOUR SCRAPINGHUB API CLOUD GOES HERE>
Validating API key...
API key is OK, you are logged in now.

You can get your API key in your account profile page.

3. Deploy your project

First go to Scrapy Cloud’s web dashboard and create a project there.

Then return to your command line terminal, go to your local project’s folder and run shub deploy. It will ask you what the target project id is (i.e. the Scrapy Cloud project that you want to deploy your spider to). You can get this information through the Code & Deploys link on your project page.
$ cd product-crawler
$ shub deploy
Target project ID: <YOUR PROJECT ID>

Now your project is ready to run in Scrapy Cloud.

4. Enable the MonkeyLearn addon on Scrapy Cloud

Note that before you enable the addon, you have to create an account on MonkeyLearn.

To enable the addon, head to the Addons Setup section in your Scrapy Cloud project’s settings:

3

You can configure the addon with the following settings:

  • MonkeyLearn token: your MonkeyLearn API token. You can access it from your account settings on the MonkeyLearn website.
  • MonkeyLearn field to process: a list of item text fields (separated by commas) that will be used as input for the classifier. In this tutorial it is: title,description.
  • MonkeyLearn field output: the name of the new field that will be added to your items in order to store the categories returned by the classifier.
  • MonkeyLearn module: the id of the classifier that you are going to use. In this tutorial, the id is ‘cl_oFKL5wft’.
  • MonkeyLearn batch size: the amount of items the addon will retain before sending to MonkeyLearn for analysis.

You can find the id of any classifier in the URL:

4

When you’re done filling out all the fields, the addon configuration should look something like this:

Selection_047

5. Run your Spiders

Now that you have the Retail Classifier enabled, run the spider by going to your project’s Jobs page. Click ‘Run Spider’, select the spider and then confirm.

Give the spider a couple of minutes to gather results. You can then view the job’s items and you should see that the category field has been filled by MonkeyLearn:

6

You can then download the results as a JSON or XML file and then categorize the products by the categories and probabilities returned by the addon.

Wrap Up

Using MonkeyLearn’s Retail Classifier with Scrapy or Portia on Scrapy Cloud allows you to immediately analyze your data for easier categorization and analysis. So the next time you’ve got a massive list of people to shop for, try using immediate textual analysis with web scraping to simplify the process.

We’ll continue the series with walkthroughs on using the MonkeyLearn addon for language detection, sentiment analysis, keyword extraction, or any custom classification or extraction that you may need personally or professionally. We’ll explore different uses and hopefully help you make the most of this new platform integration.

If you haven’t already, sign up for MonkeyLearn (for free) and sign up for the newly upgraded Scrapy Cloud (for free) and get to experimenting.

Introducing Scrapy Cloud 2.0

Scrapy Cloud has been with Scrapinghub since the beginning, but we decided some spring cleaning was in order. To that end, we’re proud to announce Scrapy Cloud 2.0!

This overhaul will help you improve and scale your web scraping projects. Among other perks, our upgraded cloud-based platform includes a brand new and much more flexible architecture based on containers.

While much of this upgrade is behind the scenes, what’s most important to you is pricing changes, the introduction of Docker support, and a more efficient allocation of resources.

Without further ado, let’s dive right into how Scrapy Cloud 2.0 will positively (hopefully) affect you!

What’s NOT Changing

I know this is a weird one to start with, but we wanted to show you how the awesome base of Scrapy Cloud 1.0 will still be a part of the 2.0 release. You will still be able to:

  • Deploy and schedule your spiders
  • Download your results in friendly formats such as CSV and JSON
  • Monitor your crawls and view logs and results in a clean, easy-to-use UI
  • Use our Portia integration to scrape data straight from your web browser
  • Configure Machine Learning addons like MonkeyLearn for extra functionality

And the major feature that is NOT changing is that you can still use Scrapy Cloud for free with all the perks that you’ve been enjoying including unlimited team members, unlimited projects, and no credit card required.

So for those who are worried, rest assured, this is just an improvement on the awesome base that made Scrapy Cloud a staple of Scrapinghub.

New Pricing Model

First things first, the price tag. A major benefit of Scrapy Cloud 2.0 is our new pricing model. We’re switching to customizable subscriptions so that you have greater flexibility in picking the number of container units (computing resources) that you need to run your job. There are no more standardized plans, you just build a subscription that is customized to you. Plus, this model will save you money since you can tailor monthly units to your scraping needs.

Each Container Unit provides 1GB of RAM, 2.5GB of disk space and computing power that is 3x better if compared with similar plans from Scrapy Cloud 1.0. One unit costs $9 per month and you can purchase as many units as you like, allocate them for your organization’s spiders and upgrade whenever you want.

When you sign up for our platform, you are immediately enrolled in a free subscription that contains 1 unit. Each job running on this free account will be able to run for up to 24 hours and has a data retention of 7 days.

And much as we love talking with our customers, Scrapy Cloud 2.0 makes upgrading your account easier and without needing to go through support. All you have to do is go to the Billing Portal for your organization (accessible via the billing page of your organization profile):

kumo organization · Billing · Scrapinghub - Google Chrome_008

And then change the amount of Container Units to fit your needs, as shown in the screenshot below:

image00

Once you purchase container units, your data retention period will be increased to 120 days and there will be no more limits to the running time.

Resource Allocation

Scrapy Cloud 2.0 features a new resource management model which provides you with more resources for the same price. For example, using Scrapy Cloud 1.0 would cost $150 for a worker with 3.45GB of RAM and 7 computing units. With Scrapy Cloud 2.0, you get 16GB of RAM and 12 computing units for a lower price ($144).

Plus, you can allocate your units (resources) however you need. This includes giving a lot of resources to big, complex jobs and making sure smaller jobs only use the resources they need.

You can assign units to groups and then move your projects from one group to another whenever you want to adjust resources:

kumo organization · Kumo Groups · Scrapinghub - Google Chrome_013

With this new release, your spiders run in containers isolated from the other jobs running on our platform. This means that your performance will not be affected by other resource consuming jobs. Disk quotas are also applied to ensure that a job does not consume all the disk space from a server, which would affect other jobs running on the same server. Long story short, no more DoS!

Docker Support

With Scrapy Cloud 2.0, your spiders run inside Docker containers. This new resource management engine ensures that you get the most out of the resources available for your subscription.

Scrapy Cloud 2.0 is fully backwards compatible with the previous version. We did our best to keep your current workflow untouched, so you can still use the shub command line tool to deploy your project and dependencies without the hassle of building Docker images. Just make sure to upgrade shub to version 2.1.1+ by running: pip install shub --upgrade.

And here’s some good news for power users: you can deploy your tailor-made Docker images to Scrapy Cloud 2.0. Dependencies are no longer a big deal since you can now compile libraries, install anything you want in your Docker image and then deploy it to our platform. This feature is currently only available for paying customers, but it will soon be released for free users as well.

Check out this walkthrough if you are a paying user and want to deploy custom Docker images to Scrapy Cloud.

Scrapy Cloud Stacks

Scrapy Cloud 2.0 introduces Stacks, a set of predefined Docker images that users can select when deploying a project. The inclusion of Stacks brings together the best of two worlds: you can deploy your project without having to build your own image and you can also choose the kind of environment where you want to run your project.

In Scrapy Cloud terms, a Stack is a pre-built Docker image that you can select when you are deploying a project. This will be used as the environment for this project. In this first release, there are two Stacks available:

  • Hworker: this is the Stack that’s going to be used if you just run a regular shub deploy. It provides backwards compatibility with the legacy platform, so your current projects should run without any issues.
  • Scrapy: this is a minimalist Stack featuring the latest stable versions of Scrapy along with all the basic requirements that you need to run a full featured Scrapy spider. You can choose this Stack if you’re having issues with outdated packages in your projects.

This is a scrapinghub.yml file example, using the Scrapy 1.1 Stack and passing additional requirements to be installed on the image via requirements.txt:

projects:
    default: 123
stacks:
    default: scrapy:1.1
requirements_file: requirements.txt

Just make sure that you are using shub >= 2.1.1 to be able to deploy your project based on the scrapinghub.yml file above.

In the near future, we are going to provide additional Stacks. For example, if you want to run a crawler that you built using a different crawling framework or programming language, we are looking to have a built-in Stack for that.

Heads Up Developers

There are some things that you should be aware of before migrating your projects to Scrapy Cloud 2.0:

  • Your spiders will now run from Germany. That means their requests will no longer go through US IP addresses. If your requests are getting redirected to a localized version of a website that breaks your spiders, consider using Crawlera. It is our smart proxy service that allows you to use location-based IPs in your crawlers, among other features.
  • Spiders can no longer write arbitrary data in the project folder. You must write to the /scrapy/ folder instead. This is the current directory for spiders running in Scrapy Cloud 2.0.
  • When you update your eggs, you must now also re-deploy your project. However, you can handle your dependencies using requirements.txt, which will be the only way to do this in the the future.
  • Your job can’t write millions of files to the disk because there are limited disk quotas per job (256k inodes). If you need to save large amounts of data, consider using an external storage such as S3.

Check out this walkthrough if you are a paying user and want to deploy custom Docker images to Scrapy Cloud.

Roll Out Timeline

Scrapy Cloud 2.0 is officially live. From today onward, the old model will not be available to new users and all free organizations will be migrated automatically.

All Scrapy Cloud projects will be upgraded to Scrapy Cloud 2.0 within the year. As a loyal customer, you have the opportunity to upgrade immediately to experience dramatically improved performance. Just log into your Scrapy Cloud account and click the “Migrate Now” button at the top of the page:

image02

Wrap Up

And that’s Scrapy Cloud 2.0 in a nutshell! We’re going to be tinkering a bit more in the coming months so keep your eyes peeled for an even more major upgrade.

button

A (not so) Short Story on Getting Decent Internet Access

by

This is a tale of trial, tribulation, and triumph. It is the story of how I overcame obstacles including an inconveniently placed grove of eucalyptus trees, armed with little more than a broom and a pair of borrowed binoculars, to establish a stable internet connection.

I am a remote worker and accessibility to stable internet ranks right up there with food and shelter. With the rise of digital nomads and distributed companies, I am sure many of you can identify with the frustration of nonexistent or slow internet. Read on to learn more than you ever thought you would about how to MacGyver your very own homemade fiber to the home (FTTH) connection.

Searching for a connection

This adventure took place in September 2015. My wife and I decided to move from Montevideo, Uruguay to our hometown Durazno. We bought a cozy house on the outskirts of the city, about 3 km away from the downtown.

The first thing we checked was availability to our main ISP, Antel. They had 2 options:

  • Fiber to the home (FTTH)
  • Mobile access (LTE)

I walked around the area in an attempt to find a Network Access Point (NAP). The NAP is a part of the ISPs infrastructure and the presence of one indicates that the zone is covered with FTTH. I know how NAPs look, specifically the ones used by Antel, so it was a fast way to determine if I would have access to FTTH coverage. To my surprise I was unable to find one and confirmed via their map that there were none available.

Copper lines were also a bust because my house was too far from the closest node. Besides, the speed was close to a 56K dial-up modem and there were many disconnection issues. My neighbour had the same service and cancelled it. He said it was useless and had trouble despite only using it for email and day-today web access.

I then bought a dongle and tested Antel’s LTE/4G service with both my laptop and smartphone. The results were appalling. The upload speed was 128-256Kbps and was by far the biggest let down.

3G coverage was also terrible and I experienced many disconnections. This put mobile technology off the table.

In case you care to see details, here are the speed tests I performed onsite.

Challenge accepted

I looked deeper into Antel’s FTTH deployed zone and saw that it was only 1.5km away in a straight line. I figured it was worth a shot to try creating a wireless Point-to-Point link.

Planning the link

Point-to-Point Link Requirements

Line of Sight: I needed to put the 2 radios in a place where they could see each other. Since I was going to use an unlicensed frequency (2.4Ghz or 5Ghz), this was a difficult requirement.

The second necessity was determining the Fresnel Zone clearance. Basically you need to have the two radios high enough that you can avoid interference and poor link quality.

Point-to-Point Link Issues

The most destructive force for radio electrical signals is water (radio signals just don’t get through easily or at all) and trees are full of water. I realized that the Eucalyptus grove (labeled Major issue in the image below) was right in the path of the signal path. Diving into Google Earth and Google Maps gave me some insight on dodging eucalyptus forest and other troublesome surface heights. I then stood on the roof of my house with some borrowed binoculars to help determine a placement for the remote endpoint.

graph

Through the binoculars, I spotted a white balloon that looked like some kind of tank in the FTTH zone. I drove up and realized that it was a water tank situated near a large warehouse for trucks. It made sense to put the radio on top of this tank to benefit from its height (labeled “FTTH” in the image above) for the Fresnel Zone clearance.

I rang up the owner and arranged a meeting with him to explain the situation. I found it difficult to explain what I needed and he had no idea what I was talking about, so he took some convincing.

The owner’s main concern was that I may interfere with his internet access. I reassured him and explained that I would be using a service different to the one he was using. He had nothing to worry about as I only needed the height of his tower.

He agreed and let me perform some tests to check the viability of my plan.

water tower

The Parameters

Google Earth revealed:

  • 93m of surface height at the tank + 7m for the height of the tower (labeled “FTTH”) = 100m
  • My house was 104m (labeled “No INET” in Google Earth picture) + 4m for my house’s small warehouse that would store the radio = 108m

The major concern was that at halfway point, the surface height of the land was 103m and I needed at least 4.4m of Fresnel zone clearance = 107.4m

There was nothing I could do to boost the height of the tank, so I had to hope for the best and test out the connection.

Testing the connection

I borrowed 2 Ubiquiti NanoStation Loco M5s (5Ghz band) to test the link. I knew that if I could establish a link with these smaller capacity devices, then I was on the right track.

At the FTTH point (the water tank) I placed a Nano.

setting it up

I put the other one onto a broomstick. I then wandered around my property (including the roof of the house and the warehouse) with this makeshift aligning tool, waving it around and trying to find the best connection spot.

I was shocked when I finally managed to get a one-way 90Mbps while holding the broom and thought, this might actually work!

iperf results

I used iperf to test the connection. It runs in a server/client model and I ran the server at the “FTTH” point and then the client at the “No INET” point.

testing

Here are the iperf results:

code

Choosing Hardware

I decided to commit to the Point-to-Point link as the solution to my internet woes. The next step was figuring out the right pair of antennas to get the most out of this link.

Product NSM5 NSM5 loco NanoBeam M5 LiteBeam M5
Gain 16 dBi 13 dBi 16 dBi 23 dBi
Max Pow 8W 5.5W 6W 4W
Wind Survival 200 Km/h 200 Km/h
Weight 0.4 Kg 0.18 Kg 0.320 Kg 0.750 Kg
CPU MIPS 24KC (400 Mhz) MIPS 24KC (400 Mhz) MIPS 74KC (560 Mhz) MIPS 74K
Mem 32 MB (SDRAM) 32 MB (SDRAM) 64 MB (DDR2) 64 MB
Network 2 x 10/100 eth0 1 x 10/100 eth0 1 x 10/100 eth0 1 x 10/100 eth0
54M -> -75dBm (802.11a) 54M -> -75dBm (802.11b/g) 54M -> -75dBm (802.11a) 54M -> -84dBm (802.11a)
Amazon Price usd 88 usd 62 usd 69 usd 55
Part Name NSM5 LOCOM5 NBE-M5-16 LBE-M5-23
MIMO 2×2 SISO 1×1

I picked the NanoBeam, one of the newest products on the Airmax line. I chose it because it was a MIMO 2×2 radio (compared to the LiteBeam) and had a nice antenna gain. It also featured a newer CPU, DDR2 memory, and it was also pretty light and had very high wind survivability.

Installing the link

List of Materials:

  • 2 x NanoBeam M5 ($139 + $40 [courier] )
  • 1 x APC UPS (used at home)
  • 1 x ZTE F660 ONT (provided by ISP)
  • 1 x watertight enclosure ($60)
  • 50m of Cat6 STP (~ $35)
  • 1 x TP Link WDR 3600 (with OpenWRT v14.07 “Barrier Breaker” – this was my home router)
  • apcupsd (to monitor the UPS)

Total Cost: $274

I enrolled in a basic fiber plan at the “FTTH” point (30/4 Mbps) for $25 a month.

with computer

Wrap Up

I’ve yet to experience any lags or disconnection issues since I set up the Point-to-Point FTTH link. I’ve even run Netflix and Spotify at the same time to test the connection, all while downloading several things on the computer. If a house is built right in the path of the signal, I’ll need to increase the height if I want the link to continue operating. But, so far so good!

end result

SUCCESS!

Just goes to show what a little perseverance can do when paired with a broomstick and borrowed binoculars.

Technical details

For those who are especially interested in the subject, here are the nitty gritty details on signal strength along with other information:

technical details

After performing a full duplex “Network Speed Test” with NanoBeam’s own speed test tool:

testSame test, but with one-way only:

one way test

ISP speed test:

ISP speed test

status

Scrapy Tips from the Pros: April 2016 Edition

Scrapy Tips

Welcome to the April Edition of Scrapy Tips from the Pros. Each month we’ll release a few tricks and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we only have one tip for you, but it’s a doozy! So if you ever find yourself scraping an ASP.Net page where you need to submit data through a form, step back and read this post.

Dealing with ASP.Net Pages, PostBacks and View States

Websites built using ASP.Net technologies are typically a nightmare for web scraping developers, mostly due to the way they handle forms.

These types of websites usually send state data in requests and responses in order to keep track of the client’s UI state. Think about those websites where you register by going through many pages while filling your data in HTML forms. An ASP.Net website would typically store the data that you filled out in the previous pages in a hidden field called “__VIEWSTATE” which contains a huge string like the one shown below:

ViewState example

I’m not kidding, it’s huge! (dozens of kB sometimes)

This is a Base64 encoded string representing the client UI state and contains the values from the form. This setup is particularly common for web applications where user actions in forms trigger POST requests back to the server to fetch data for other fields.

The __VIEWSTATE field is passed around with each POST request that the browser makes to the server. The server then decodes and loads the client’s UI state from this data, performs some processing, computes the value for the new view state based on the new values and renders the resulting page with the new view state as a hidden field.

If the __VIEWSTATE is not sent back to the server, you are probably going to see a blank form as a result because the server completely lost the client’s UI state. So, in order to crawl pages resulting from forms like this, you have to make sure that your crawler is sending this state data with its requests, otherwise the page will not load what it’s expected to load.

Here’s a concrete example so that you can see firsthand how to handle these types of situations.

Scraping a Website Based on ViewState

The scraping guinea pig today is spidyquotes.herokuapp.com/search.aspx. SpidyQuotes lists quotes from famous people and its search page allows you to filter quotes by author and tag:

image05

A change in the Author field fires up a POST request to the server to fill the Tag select box with the tags that are related to the selected author. Clicking Search brings up any quotes that fit the tag from the selected author:

image04

In order to scrape these quotes, our spider has to simulate the user interaction of selecting an author, a tag and submitting the form. Take a closer look at each step of this flow by using the Network Panel that you can access through your browser’s Developer Tools. First, visit spidyquotes.herokuapp.com/search.aspx and then load the tool by pressing F12 or Ctrl+Shift+I (if you are using Chrome) and clicking on the Network tab.

image00

Select an author from the list and you will see that a request to “/filter.aspx” has been made. Clicking on the resource name (filter.aspx) leads you to the request details where you can see that your browser sent the author you’ve selected along with the __VIEWSTATE data that was in the original response from the server.

image02

Choose a tag and click Search. You will see that your browser sent the values selected in the form along with a __VIEWSTATE value different from the previous one. This is because the server included some new information in the view state when you selected the author.

image01

Now you just need to build a spider that does the exact same thing that your browser did.

Building your Spider

Here are the steps that your spider should follow:

  1. Fetch spidyquotes.herokuapp.com/filter.aspx
  2. For each Author found in the form’s authors list:
    • Create a POST request to /filter.aspx passing the selected Author and the __VIEWSTATE value
  3. For each Tag found in the resulting page:
    • Issue a POST request to /filter.aspx passing the selected Author, selected Tag and view state
  4. Scrape the resulting pages

Coding the Spider

Here’s the spider I developed to scrape the quotes from the website, following the steps just described:

import scrapy

class SpidyQuotesViewStateSpider(scrapy.Spider):
    name = 'spidyquotes-viewstate'
    start_urls = ['http://spidyquotes.herokuapp.com/search.aspx']
    download_delay = 1.5

    def parse(self, response):
        for author in response.css('select#author > option ::attr(value)').extract():
            yield scrapy.FormRequest(
                'http://spidyquotes.herokuapp.com/filter.aspx',
                formdata={
                    'author': author,
                    '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first()
                },
                callback=self.parse_tags
            )

    def parse_tags(self, response):
        for tag in response.css('select#tag > option ::attr(value)').extract():
            yield scrapy.FormRequest(
                'http://spidyquotes.herokuapp.com/filter.aspx',
                formdata={
                    'author': response.css(
                        'select#author > option[selected] ::attr(value)'
                    ).extract_first(),
                    'tag': tag,
                    '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first()
                },
                callback=self.parse_results,
            )

    def parse_results(self, response):
        for quote in response.css("div.quote"):
            yield {
                'quote': response.css('span.content ::text').extract_first(),
                'author': response.css('span.author ::text').extract_first(),
                'tag': response.css('span.tag ::text').extract_first(),
            }

Step 1 is done by Scrapy, which reads start_urls and generates a GET request to /search.aspx.

The parse() method is in charge of Step 2. It iterates over the Authors found in the first select box and creates a FormRequest to /filter.aspx for each Author, simulating if the user had clicked over every element on the list. It is important to note that the parse() method is reading the __VIEWSTATE field from the form that it receives and passing it back to the server, so that the server can keep track of where we are in the page flow.

Step 3 is handled by the parse_tags() method. It’s pretty similar to the parse() method as it extracts the Tags listed and creates POST requests passing each Tag, the Author selected in the previous step and the __VIEWSTATE received from the server.

Finally, in Step 4 the parse_results() method parses the list of quotes presented by the page and generates items from them.

Simplifying your Spider Using FormRequest.from_response()

You may have noticed that before sending a POST request to the server, our spider extracts the pre-filled values that came in the form it received from the server and includes these values in the request it’s going to create.

We don’t need to manually code this since Scrapy provides the FormRequest.from_response() method. This method reads the response object and creates a FormRequest that automatically includes all the pre-filled values from the form, along with the hidden ones. This is how our spider’s parse_tags() method looks:

def parse_tags(self, response):
    for tag in response.css('select#tag > option ::attr(value)').extract():
        yield scrapy.FormRequest.from_response(
            response,
            formdata={'tag': tag},
            callback=self.parse_results,
        )

So, whenever you are dealing with forms containing some hidden fields and pre-filled values, use the from_response method because your code will look much cleaner.

Wrap Up

And that’s it for this month! You can read more about ViewStates here. We hope you found this tip helpful and we’re excited to see what you can do with it. We’re always on the lookout for new hacks to cover, so if you have any obstacles that you’ve faced while scraping the web, please let us know.

Feel free to reach out on Twitter or Facebook with what you’d like to see in the future.

 

Machine Learning with Web Scraping: New MonkeyLearn Addon

Say Hello to the MonkeyLearn Addon

We deal in data. Vast amounts of it. But while we’ve been traditionally involved in providing you with the data that you need, we are now taking it a step further by helping you analyze it as well.

To this end, we’d like to officially announce the MonkeyLearn integration for Scrapy Cloud. This feature will bring machine learning technology to the data that you extract through Scrapy and Portia. We also offer a MonkeyLearn Scrapy Middleware so you can use it on your own platform.

Scrapinghub-MonkeyLearn-Addon-02

MonkeyLearn is a classifier service that lets you analyze text. It provides machine learning capabilities like categorizing products or sentiment analysis to figure out if a customer review is positive or negative.

You can use MonkeyLearn as an addon for Scrapy Cloud. It only takes a minute to enable and once this is done, your items will flow through a pipeline directly into MonkeyLearn’s service. You specify which field you want to analyze, which MonkeyLearn classifier to apply to it, and which field it should output the result to.

Say you were involved in the production of Batman vs Superman and you’re interested in how people reacted to your high budget movie. You could use Scrapy or Portia to track mentions of this film across the web and then use MonkeyLearn to perform sentiment analysis on the samples that you collect. But don’t get too excited because you might not like the results of your search…

sad affleck

There are so many ways that you can use this addon with our platform, so we’ll be featuring a series of tutorials that will help you make the most out of this partnership.

Getting Started with MonkeyLearn

MonkeyLearn provides public modules that are already ready to go or you can create your own text analysis module by training a custom machine learning model.

For example, using traditional sentiment analysis on comments filled with trolls would return a 100% negative rating. To develop a “Troll Finder” you would need to create a custom model with a higher tolerance for the extreme negativity. You could create categories like “troll”, “ubertroll”, and “trollmaster” for further categorization. Check out MonkeyLearn’s tutorial to help you through this task.

Before you get started with the MonkeyLearn addon on Scrapy, you first need to sign up for the MonkeyLearn service. They offer a free account, so you don’t need to worry about the cash monies. Once you’ve signed up, you’ll be taken to your dashboard:

Screenshot 2016-04-12 16.48.46

Click the “Explore” option in the top menu to check out the whole range of ready-made classifiers that you can apply to the scraped data. There are a ton of different options to choose from including sentiment analysis for product reviews, language detectors, and extractors for useful data such as phone numbers and addresses.

Screenshot 2016-04-12 16.46.48

Choose the classifier that you’re interested in and make a note of its ID. You can find the ID in the URL:

Screenshot 2016-04-12 16.32.48 copy

And now that you’re all set on the MonkeyLearn side, it’s time to head back over to Scrapy Cloud.

Addon Walkthrough

You can access the MonkeyLearn addon through your dashboard. Navigate to Addons
Setup:

Add ons page

Enable the addon and click Configure:

Screenshot 2016-04-08 17.29.52

Head down to Settings:

Screenshot 2016-04-12 16.35.11 copy

To configure the addon, you need to set your MonkeyLearn API key, specify the classifier you want to use and the field in which you want the result to be stored. You’ll need the classifier ID you chose earlier from the MonkeyLearn platform.

MonkeyLearn reads the content from the classifier fields you’ve specified, performs the classification task on the data, and returns the result of the classification/analysis in the field that you defined as categories field.

For example, in order to detect the category of a movie based on the title, you would need to add the ID from the module you want to use in the first text box. In the second text box you would list your authorization token and the item field you want to analyze (title, in our case) in the third text box. In the fourth text box you would list the name of the field that is going to store the results from MonkeyLearn.

Screenshot 2016-04-12 16.34.57 copy

And you’re all done! Locked and loaded and ready to go with MonkeyLearn.

Using MonkeyLearn with Scrapy and Portia

Since the MonkeyLearn addon is a part of Scrapy Cloud, it works with both Scrapy (Python-based web scraping framework) and Portia (visual web scraping tool). These data extraction tools are both open source, so you can easily run both on your own system.

The addon means you don’t need to worry about learning MonkeyLearn’s API and how to route requests manually. If you need to use MonkeyLearn outside of Scrapy Cloud, you can use the middleware for the same purpose.

When to use MonkeyLearn

We’re really excited about this integration because it is a huge step in closing the gap between data acquisition and analysis.

MonkeyLearn offers a range of text analysis services including:

  • Classifying products into categories based on their name
  • Detecting the language of text
  • Sentiment analysis
  • Keyword extractor
  • Taxonomy classifier
  • News categorizer
  • Entity extraction

We’ll delve into what you can do with each of these tools in future tutorials. For now, feel free to experiment and explore this integration in your web scraping projects.

Wrap Up

Data and textual analysis is more efficient by combining MonkeyLearn’s machine learning capabilities with our data extraction platform. Whether you are using this for personal projects (tracking and monitoring advance reviews for Captain America: Civil War Team Cap) or for professional tasks, we’re excited to see what you come up with.

Keep your eyes peeled, the first tutorial will walk you through using the Retail Classifier with the MonkeyLearn addon. Sign up for free for Scrapy Cloud and for Monkeylearn and give this addon a whirl.

Mapping Corruption in the Panama Papers with Open Data

We are at a point in the digital age where corruption is increasingly difficult to hide. Information leaks are abundant and shocking.

We rely on whistleblowers for many of these leaks. They have access to confidential information that’s impossible to obtain elsewhere. However, we also live in a time where data is more open and accessible than at any other point in history. With the rise of Open Data, people can no longer shred away their misdeeds. Nothing is ever truly deleted from the internet.

It might surprise you how many insights into corruption and graft are hiding in plain sight through openly available information. The only barriers are clunky websites, inexperience in data extraction, and unfamiliarity with data analysis tools.

We now collectively have the resources to produce our own Panama Papers. Not just as one offs, but as regular accountability checks to those in situations of power. This is especially the case if we combine our information to create further links.

One example of this democratization of information is a recent project in Peru called Manolo and its intersection with the Panama Papers. Manolo used the webscraping of open data to collect information on Peruvian government officials and lobbyists.

Manolo

Manolo is a web application that uses Scrapy to extract records (2.2 million so far) of the visitors frequenting several Peruvian state institutions. It then repackages the data into an easily searchable interface, unlike the government websites.

Peruvian journalists frequently use Manolo. It has even helped them uncover illegal lobbying by tracking the visits of construction company representatives who are currently under investigation to specific government officials.

Developed by Carlos Peña, a Scrapinghub engineer, Manolo is a prime example of what a private citizen can accomplish. By opening access to the Peruvian media, this project has opened up a much needed conversation about transparency and accountability in Peru.

image02

Clunky government website

image01

Scrapy working its magic

image00

Extracted data in a structured format

image03

Final transformation into Manolo

Cross-Referencing Datasets

With leaks like the Panama Papers as a starting point, web scraping can be used to build datasets to discover wrongdoing and to call out corrupt officials.

For example, you could cross-reference names and facts from the Panama Papers with the data that you retrieve via web scraping. This would give you more context and could lead to you discovering more findings.

We actually tested this out ourselves with Manolo. One of the names found in the Panama Papers is Virgilio Acuña Peralta, currently a Peruvian congressman. We found his name in Manolo’s database since he visited the Ministry of Mining last year.

image04

According to the Peruvian news publication Ojo Público, Acuña wanted to use Mossack Fonseca to reactivate an offshore company that he could use to secure construction contracts with the Peruvian state. As a congressman, this is illegal. In Peru, there are efforts to investigate Virgilio Acuña and his brother, who recently ran for president, for money laundering.

image06

Virgilio Acuña, congressman (middle), on the right Cesar Acuña (campaigning for the Peruvian presidency last month) [Photo via @Ojo_Publico]

Another name reported in the Panama Papers by Ojo Público was Jaime Carbajal Peréz, a close associate of former Peruvian president Alan García.

Ojo Público states that in 2008, Carbajal, along with colleague Percy Uriarte and others, bought the offshore company Winscombe Management Corp. from Mossack Fonseca. Carbajal and Uriarte own a business that sells books to state-run schools. Further plot twist is that the third owner of the bookstore, José Antonio Chang, was the Minister of Education from 2006 to 2011 and the Prime Minister from 2010 to 2011.

A quick search of the Manolo database reveals that Percy Uriarte visited the Peruvian Ministry of Education 45 times between 2013 and 2015. IDL-Reporteros, another Peruvian news outlet, reported that the company led by Carbajal illegally sold books to the Peruvian government in 2010. He used a front company for these transactions since he was forbidden by law to engage in contracts with the state due to his close association with the former president.

image05

Data Mining

Data from previous leaks such as the Saudi Cables, Swiss Leaks and the Offshore Leaks have been released publicly. In many cases, the data has been indexed and catalogued, making it easy for you to navigate and search it. And now we’re just waiting on the full data dump from the Panama Papers leaks.

You can use information retrieval techniques to dig deeper into the leaks.You could find matching records, sift through the data by tagging specific words, parts of speech, or phrases, and identify different entities like institutions or people.

Creating Open Data

If the information is not readily available in a convenient database, then you can always explore open data yourself:

  • Identify sources of open data that have the information you need. These are often government websites and public records.
  • Scrape the data. You can do this in two ways: using a visual web scraper like Portia (no programming needed) or a framework like Scrapy that allows you to customize your code. Big bonus is that these are both open source tools and completely free.
  • Download your data and import into your favorite data analysis software. Then start digging!
  • Package your findings by creating visual representations of the data with tools like Tableau or Plot.ly.

We actually offer platform integrations with Machine Learning programs like BigML and MonkeyLearn. We’re looking into integrating more data tools later this year, so keep an eye out!

Wrap Up

Data and corruption are everywhere and they can both seem difficult to access.That’s where web scraping comes in. We are a point where citizens have the tools necessary to hold elected officials, businesses, and folks in power accountable for illegal and corrupt actions. By increasing transparency and creating further links between information leaks and readily available data, we can remove the loopholes where firms like Mossack Fonseca exist.

Frameworks like Scrapy are great for those who know how to code, but there shouldn’t be a barrier to acquiring data. Journalists who wish to take advantage of these vast sources of information can use visual web scrapers like Portia to get the data for current and future investigations.

Follow

Get every new post delivered to your Inbox.

Join 102 other followers

%d bloggers like this: