Browsed by
Category: Scrapinghub

Deploy your Scrapy Spiders from GitHub

Deploy your Scrapy Spiders from GitHub

Up until now, your deployment process using Scrapy Cloud has probably been something like this: code and test your spiders locally, commit and push your changes to a GitHub repository, and finally deploy them to Scrapy Cloud using shub deploy. However, having the development and the deployment processes in isolated steps might bring you some issues, such as unversioned and outdated code running in production.

The good news is that, from now on, you can have your code automatically deployed to Scrapy Cloud whenever you push changes to a GitHub repository. All you have to do is connect your Scrapy Cloud project with a repository branch and voilà!

Scrapy Cloud’s new GitHub integration will help you ensure that your code repository and your deployment environments are always in sync, getting rid of the error-prone manual deployment process and also speeding up the development cycle.

Check out how to setup automatic deploys in your projects:

If you are not that into videos, have a look at this guide.

Improving your workflow with the GitHub integration

You could use this feature to set up a multi-stage deploy workflow integrated with your repository. Let’s say you have a repo called foobar-crawler, with three main branches — development, staging and master — and you need one deployment environment for each one.

You create one Scrapy Cloud project for each branch:

  • foobar-dev
  • foobar-staging
  • foobar

And connect each of these projects with a specific branch from your foobar-crawler repository, as shown below for the development one:

Then, every time you push changes to one of these branches, the code is automatically deployed to the proper environment.

Wrapping up

If you have any feedback regarding this feature or the whole platform, leave us a comment.

Start deploying your Scrapy spiders from Github now.

Sign up for free

Looking Back at 2016

Looking Back at 2016

We started 2016 with an eye on blowing 2015 out of the water. Mission accomplished.

Together with our users, we crawled more in 2016 than the rest of Scrapinghub’s history combined: a whopping 43.7 billion web pages, resulting in 70.3 billion scraped records! Great work everyone!

In the what follows, we’ll give you a whirlwind tour of what we’ve been up to in 2016, along with a quick peek at what you can expect in 2017.

Platform

Scrapy Cloud

It’s been a great year for Scrapy Cloud as we wrap up with a massive growth in our yearly platform signups:

We proudly announced our biggest platform upgrade to date this year with the launch of Scrapy Cloud 2.0. Alongside technical improvements like Docker support and Python 3 support, this upgrade introduced an improved pricing model that is both less expensive for you while allowing you to better customize Scrapy Cloud based on your resource needs.

As we move into 2017, we will continue to focus on expanding the technical capabilities of our platform such as:

  • Support for non-Scrapy spiders: We’re well aware that there are alternatives to Scrapy in the wild. If you’ve got a crawler created using another framework, be it in Python or another language, you’ll soon be able to run it in our cloud-based platform.
  • GitHub integration: You’ll soon be able to sign up and easily deploy from GitHub. We’ll support automatic deploys shortly after: push an update to your GitHub repo and it will automatically be reflected within Scrapinghub.

Heads up, Scrapy Cloud is in for a massive change this year, so stay tuned!

Crawlera

Crawlera is our other flagship product and it basically helps your crawls continue uninterrupted. We were thrilled to launch our Crawlera Dashboard this year, which gives you the ability to visualize how you are using the product, to examine what sites and specific URLs you are targeting, and to manage multiple accounts.

Portia

The main goal of Portia is to lower the barrier of entry to web data extraction and to increase the democratization of data (it’s open source!).

Portia got a lot of love this year with the beta release of Portia 2.0. This 2.0 update includes new features like simple extraction of repeated data, loading start urls from a feed, the option to download Portia projects as python code, and the use of CSS selectors to extract specific data.

Next year we’re going to be bringing a host of new features that will make Portia an even more valuable tool for developers and non-developers alike.

Data Science

While we had been engaging in data science activities since our earliest days, 2016 saw us formalize a Data Science team proper. We’re continuing to push the envelope for machine learning data extraction, so get pumped for some really exciting developments in 2017!

Open Source

Scrapinghub has been as committed to open source as ever in 2016. Running Scrapinghub relies on a lot of open source software so we do our best to pay it forward by providing high quality and useful software to the world. Since nearly 40 open source projects maintained by Scrapinghub staff saw new releases this year, we’ll just give you the key highlights.

Scrapy

Scrapy is the most well-known project that we maintain and 2016 saw our first Python 3 compatible version (version 1.1) back in May (running it on Windows is still a challenge, we know, but bear with us). Scrapy is now the 11th most starred Python project on GitHub! 2017 should see it get many new features to keep it the best tool you have (we think) to tackle any web scraping project, so keep sending it some GitHub star love and feature requests!

Splash

Splash, our headless browser with an HTTP interface, hit a major milestone a few weeks ago with the addition of the long-awaited web scraping helpers: CSS selectors, form filling, interacting with DOM nodes… This 2.3 release came after a steady series of improvements and a successful Google Summer of Code (GSoC) project this summer by our student Michael Manukyan.

Dateparser

Our “little” library to help with dates in natural language got a bit of attention on GitHub when it was picked up as a dependency for Kenneth Reitz’ latest project, Maya. We’re quite proud of this little library :). Keep the bug reports coming if you find any, and if you can, please help us support even more languages.

Frontera

Frontera is the framework we built to allow you to implement distributed crawlers in Python. It provides scaling primitives and crawl frontier capabilities. 2016 brought us 11 releases including support for Python 3! A huge thank you to Preetwinder Bath, one of our GSoC students, who helped us to improve test coverage and made sure that all of the parts of Frontera support Python 3.

Google Summer of Code

As in 2014 and 2015, Scrapinghub participated in GSoC 2016 under the Python Software Foundation umbrella. We had four students complete their projects and two of them got their contribution merged into the respective code base (see the “Frontera” and “Splash” sections above). Another completed project was related to Scrapy performance improvements and is close to being integrated. The last one is a standalone set of helpers to use Scrapy with other programming languages. To our students, Aron, Preet, Michael, and Avishkar, thank you all very much for your contributions!

Conferences

Conferences are always a great opportunity to learn new skills, showcase our projects, and, of course, hang out with our clients, users, and coworkers. As a remote staff, we don’t have the opportunity to meet each other in person often, so tech conferences are always a great way to strengthen ties. The traveling Scrapinghubbers thoroughly enjoyed sharing their knowledge and web scraping experiences through presentations, tutorials, and workshops.

Check out some of the talks that were given this year:

Scrapinghubber Community

Being a fully remote company, we’re thrilled to confirm that we have Scrapinghubbers in almost every continent (we’re *just* missing Antarctica, any Scrapy hackers out there?).

Aside from the conferences we attended this year, we had a few localized team retreats. The Professional Service managers got together in Bangkok, Thailand; the Sales team had a retreat in Buenos Aires, Argentina, where asado dominated the show; and the Uruguayan Scrapinghubbers got together for an end-of-year meetup in La Paloma, Uruguay, hosted by Daniel, our resident hacker/surfer.

Currently our Crawlera team is having a meetup in Poznan, Poland, paving the way for what will become the next version of our flagship smart downloader product.

Wrap Up

And that’s it for 2016! From the whole team at Scrapinghub, I’d like to wish you happy holidays and the best wishes for the start of 2017. We’ve got a lot of exciting events lined up next year, so get ready!

How to Increase Sales with Online Reputation Management

How to Increase Sales with Online Reputation Management

One negative review can cost your business up to 22% of its prospects. This was one of the sobering findings in a study highlighted on Moz last year. With over half of shoppers rating reviews as important in their buying decision, no company large or small can afford to ignore stats like these – let alone the reviews themselves. In what follows I’ll let you in on how web scraping can help you stay on top.

What is online reputation management

Online reputation management is carefully maintaining and curating your brand’s image by monitoring social media, reviews, and articles about your company. When it comes to online reputation management, you can’t have too much information. This is a critical part of your business strategy that impacts pretty much every level of your organization from customer service to marketing to sales. BrightLocal found that, “84% of people trust online reviews as much as a personal recommendation.” The relationship between brands and customers has become a two-way street because of the multitude of channels for interaction. Hence the rise of influencer and guerilla marketing tactics.

A key part of online reputation management is highlighting positive reviews to send the message that you are a responsive company that rewards loyal and happy customers. Online reputation management is likewise critical to putting out any potential customer fires. The attrition rate of consumers shoots up to 70% when they stumble across four or more negative articles. You need to be able to act fast to address criticisms and to mitigate escalating issues. Ideally you should not delete negative feedback, but instead show the steps that you are taking to rectify the situation. Besides sparing you an occasional taste of the Streisand effect, this shows that you are responsible, transparent, and not afraid to own up to errors.

How to manage your reputation online

While you could manually monitor social media and review aggregators, in addition to Googling your company for unexpected articles, it’s much more effective to automate the process. There are a lot of different companies and services that specialize in this service including:

  1. Sprout Social
  2. Brandwatch
  3. Klear
  4. Sprinklr

If you want complete control over your data and the type of information that you’d like to monitor, web scraping is the most comprehensive and flexible choice.

Web scraping, the man behind the curtain

Web scraping provides reliable and up-to-date web data

There is an inconceivably vast amount of content on the web which was built for human consumption. However, its unstructured nature presents an obstacle for software. So the general idea behind web scraping is to turn this unstructured web content into a structured format for easy analysis.

Automated data extraction smooths the tedious manual aspect of research and allows you to focus on finding actionable insights and implementing them. And this is especially critical when it comes to online reputation management. Respondents to The Social Habit study showed that when customers contact companies through social media for customer support issues, 32% expect a response within 30 minutes and 42% expect a response within 60 minutes. Using web scraping, you could easily have constantly updating data feeds that alert you to comments, help queries, and complaints about your brand on any website, allowing you to take instant action.

You also need to be sure that nothing falls through the cracks. You can easily monitor thousands, if not millions of websites for changes and updates that will impact your company.

Sentiment analysis and review monitoring

Now, a key part of online reputation management is monitoring reviews for positive and negative feedback. Once the extracted web data is in, you can use machine learning to do sentiment analysis. This form of textual analysis can categorize messages as positive or negative, and the more data you use to train the program, the more effective it becomes. This is a great method for being able to quickly respond to negative reviews while keeping track of positive reviews to reward customers and highlight loyalty.

Straight From the Horse’s Mouth

Here are two entrepreneurs providing real world examples of how they use online reputation management and review monitoring to increase their business.

The Importance of Review Monitoring

Kent Lewis
President and Founder of Anvil Media, Inc.
http://www.anvilmediainc.com/about/team/kent-lewis

As a career agency professional who has owned my own agency for the past 16 years, I have a few thoughts regarding monitoring reviews and assessing sentiment analysis to move businesses forward:

Monitoring reviews (including sentiment) is essential to your business. Ignoring (negative) reviews can cause undue and unnecessary harm. Since 90% of customers read online reviews before visiting a business, negative reviews can directly affect sales. Conversely, a one-star increase on Yelp, leads to a 5-9% increase in business revenue.

Online reviews can be monitored manually (bookmarking and visiting sites like Google My Business, Yelp and others daily or as-needed). However, there are a host of tools available that automate the process. Utilize a mix of free (socialmention.com) and paid tools (Revinate.com) to regularly monitor reviews, in order to address negative reviews and celebrate positive reviews.

While the primary objective for monitoring reviews is identifying and mitigating negative reviews, there are a host of other benefits to capturing and analyzing the data. Harvesting and analyzing the data will provide insights that will improve your products and services. For starters, you can measure and trend sentiment for the brand overall. With additional insights, you can track and monitor reviews for specific products, services or locations. Social media and review sites are the largest (free) focus group in the world. Additionally, you can look at competitors and create benchmarks to track trends over time. Lastly, you can identify superfans that can be nurtured into brand ambassadors.

The sources of data vary by company and industry. Most businesses can be reviewed on Google My Business, Yelp, BBB and Glassdoor (for employees). Each industry has specific sites that also must be monitored, including Expedia and Travelocity for travel & hospitality.

To get maximum value from your monitoring efforts, always look at competitor reviews. Customers are telling you what business you should be in based on their feedback and suggestions for improvement… learn from the entire industry, not just your current or past customers.

Online reputation management, social media, and competitor monitoring

Max Robinson
Owner of Ace Work Gear
http://www.aceworkgear.com

We use tools like Sprout Social which helps us to track mentions on social media for our clients, as this where the majority of the discussion happens about their business. The main reason that our clients want to track these mentions is that people tend to speak more openly and honestly about their experiences with a business on social media than anywhere else online. This also gives our clients the chance to join in conversations with their customers in a casual manner, whereas interactions on review sites can be far more formal.

We report on the number of mentions, and whether our client is being discussed in a positive or negative manner, as well as what the discussion is specifically related to. We look at 3 main social media platforms – Facebook, Twitter and Reddit. We also monitor mentions of competitors across all of these platforms, as per the request of our clients.

Monitoring the online reputation of competitors

Do not neglect your competitors when monitoring reviews and social media. Keeping track of the online reputation of competitors allows you to:

  1. Correctly position and price your product or service offerings
  2. Snatch customers who are posting dissatisfied reviews and comments about your competition
  3. Launch more effective marketing campaigns that address pain points experienced by customers of your competition
  4. Determine what your competitors are doing right so that you can innovate off of their ideas

And that’s just the tip of the iceberg. Competitive intelligence and having an accurate overview of your industry only serves to help you sell your products more effectively. And to bring it back to online reputation management, having a negative perception of your brand is like shooting yourself in the foot. You’re already at a severe disadvantage, especially when compared to positively reviewed competitors.

How to use online reputation management to increase your sales

In an interview with Don Sorensen, president of Big Blue Robot, he shared that one company he worked with was losing an estimated $2 million and more in sales due to a poor online reputation. Don’t let this be you.

  1. The first step is to level the playing field by locating and responding to all of the negative reviews. With a damaged reputation, you should be in crisis mode and monitoring brand mentions around-the-clock so that you are never caught by surprise.
  2. Dominate your search results so that there is little room for people with vendettas to swoop in. This means posting regularly on social media, getting press coverage, and answering questions in forums related to your business or your industry.
  3. Curate your brand’s reputation by having an active blog that carefully frames the benefits of your business, tailored to your audience.

If you are proactive and have a positive reputation or have managed to repair your reputation, then enthusiastic reviews and word of mouth will increase and improve your lead generation prospects. Your sales team should also be fully aware of your online reputation so they can soothe potential concerns or draw attention to success stories.

Wrap up

They say that a good reputation is more valuable than money. Guard yours closely with web data and ensure that you are taking every precaution necessary to retain customers and win over new leads.

Explore ways that you can use web data or chat with one of our representatives to learn more.

How to Build your own Price Monitoring Tool

How to Build your own Price Monitoring Tool

Computers are great at repetitive tasks. They don’t get distracted, bored, or tired. Automation is how you should be approaching tedious tasks that are absolutely essential to becoming a successful business or when carrying out mundane responsibilities. Price monitoring, for example, is a practice that every company should be doing, and is a task that readily lends itself to automation.

In this tutorial, I’ll walk you through how to create your very own price monitoring tool from scratch. While I’m approaching this as a careful shopper who wants to make sure I’m getting the best price for a specific product, you could develop a similar tool to monitor your competitors using similar methods.

Why you should be monitoring competitor prices

Price monitoring is basically knowing how your competitors price their products, how your prices fit within your industry, and whether there are any fluctuations that you can take advantage of.

When it comes to mission critical tasks like price monitoring, it’s important to ensure accuracy, obtain up-to-date information, and have the capacity for massive scale. By pricing your products perfectly, you can make sure that your competitors aren’t undercutting you, which makes you more likely to nab customers.

In our article on how web data is used by startups, Max Robinson, owner of Ace Work Gear, shared his thoughts on the importance of price monitoring:

“But it occurred to me that if you aren’t offering competitive prices, then you’re essentially throwing money down the drain. Even if you have good visibility, users will look elsewhere to buy once they’ve seen your prices.”

And that’s part of why automation is so important. You don’t want to miss sudden sales or deals from competitors that might make your offerings less desirable.

Overview

In terms of using price monitoring as a consumer, the key is to be able to take advantage of rapid price drops so you can buy during lightning sales. For this tutorial, I used Scrapy, our open source web scraping framework, and Scrapy Cloud, our fully-featured production environment (there’s a forever free account option). Here is the basic outline of my approach:

  1. Develop web scrapers to periodically collect prices from a list of products and online retailers.
  2. Build a Python script to check whether there are price drops in the most recently scraped data and then send an email alert when there are.
  3. Deploy the project to Scrapy Cloud and schedule periodic jobs to run the spiders and the script every X minutes.

Collecting the Prices

I monitored prices from a couple online retailers. To scrape the prices, I built one Scrapy spider for each of these. The spiders work by:

  1. Reading a list of product URLs from a JSON file
  2. Scraping the prices for the listed products
  3. Storing the prices in a Scrapy Cloud Collection (an efficient key-value storage)

Here is a sample JSON file with product URLs:

{
    "headsetlogitech": [
        "https://www.retailer1.com/pagefor-logitech-headset",
        "http://www.retailer2.com/pagefor-headset-logitech",
        "http://www.retailer3.com/pagefor-headset-log"
    ],
    "webcamlogitech": [
        "https://www.retailer1.com/pagefor-logitech-webcam",
        "http://www.retailer2.com/pagefor-webcam-logitech",
        "http://www.retailer3.com/pagefor-webcam-log"
    ]
}

If you want to monitor more retailers than the three I implemented, all you need to do is add their URLs to the JSON file and then create the requisite Scrapy spider for each website.

The Spiders

If you are new to the world of Scrapy and web scraping, then I suggest that you check out this tutorial first. When building a spider, you need to pay attention to the layout of each retailer’s product page. For most of these stores, the spider code will be really straightforward, containing only the extraction logic using CSS selectors. In this case, the URLs are read during the spider’s startup.

Here’s an example spider for Best Buy:

class BestbuySpider(BaseSpider):
  name = "bestbuy.com"
  
  def parse(self, response):
    item = response.meta.get('item', {})
    item['url'] = response.url
    item['title'] = response.css(
      'div#sku-title > h1::text'
    ).extract_first().strip()
    item['price'] = float(response.css(
      'div.price-block ::attr(data-customer-price)'
    ).extract_first(default=0))
    yield item

BaseSpider contains the logic to read the URLs from the JSON file and generate requests. In addition to the spiders, I created an item pipeline to store product data in a Scrapy Cloud collection. You can check out the other spiders that I built in the project repository.

Building the Price Monitoring Script

Now that the spiders have been built, you should start getting product prices that are then stored in a collection. To monitor price fluctuations, the next step is to build a Python script that will pull data from that collection, check if the most recent prices are the lowest in a given time span, and then send an email alert when it finds a good deal.

Here is my model email notification that is sent out when there’s a price drop:

image00

You can find the source code for the price monitor in the project repository. As you might have noticed, there are customizable options via command line arguments. You can:

  • modify the time frame in which the prices are compared to find out whether the latest price is the best of the day, the week, the month, and so forth.
  • set a price margin to ignore insignificant price drops since some retailers have minuscule price fluctuations throughout the day. You probably don’t want to receive an email when the product that you’re interested in drops one cent…

Deployment and Execution

Now that you have the spider(s) and the script, you need to deploy both to Scrapy Cloud, our PaaS for web crawlers.

I scheduled my spiders to collect prices every 30 minutes and the script to check this data at 30 minute intervals as well. You can configure this through your Scrapy Cloud dashboard, easily changing the periodicity depending on your needs.

image03

Check out this video to learn how to deploy Scrapy spiders and this tutorial on how to run a regular Python script on Scrapy Cloud.

How to run this project in your own Scrapy Cloud account:

  • Clone the project:
    • git clone git@github.com:scrapinghub/sample-projects.git
  • Add the products you want to monitor to resources/urls.json
  • Sign up for Scrapy Cloud (it’s free!)
  • Create a project on Scrapy Cloud
  • Deploy your local project to Scrapy Cloud
  • Create a periodic execution job to run each spider
  • Create a periodic execution job to run the monitor script
  • Sit back, relax, and let automation work its magic

Scaling up

This price monitor is a good fit for individuals interested in getting the best deals for their wishlist. However, if you’re looking to scale up and create a reliable tool for monitoring competitors, here are some typical challenges that you will face:

  • Getting prices from online retailers who feature millions of products can be overwhelming. Scraping these sites requires advanced crawling strategies to make sure that you always have hot data that is relevant.
  • Online retailers typically have layout variations throughout their website and the smallest shifts can bring your crawler to a screeching halt. To get around this, you might need to use advanced techniques such as machine learning to help with data discovery.
  • Running into anti-bot software can shut your price gathering activities down. You will need to develop some sophisticated techniques for bypassing these obstacles.

If you’re curious about how to implement or develop an automated price monitoring tool, feel free to reach out with any questions.

Tell us about your needs

Wrap up

To sum up, there’s no reason why you should be manually searching for prices and monitoring competitors. Using Scrapy, Scrapy Cloud, a Python script, and just a little bit of programming know-how, you can easily get your holiday shopping done under budget with deals delivered straight to your inbox.

If you’re looking for a professional-grade competitor and price monitoring service, get in touch!

How You Can Use Web Data to Accelerate Your Startup

How You Can Use Web Data to Accelerate Your Startup

In just the US alone, there were 27 million individuals running or starting a new business in 2015. With this fiercely competitive startup scene, business owners need to take advantage of every resource available, especially given a high probability of failure. Enter web data. Web data is abundant and those who harness it can do everything from keeping an eye on competitors to ensuring customer satisfaction.

morpheus_meme

Web Data and Web Scraping

You can get web data through a process called web scraping. Since websites are created in a human readable format, software can’t meaningfully analyze this information. While you could manually (read: the time-consuming route) input this data into a format more palatable to programs, web scraping automates this process and eliminates the possibility of human error.

How You Can Use Web Data

If you’re new to the world of web data or looking for creative ways to channel this resource, here are three real world examples of entrepreneurs who use scraped data to accelerate their startups.

Web Data for Price Monitoring

max-robinson

Max Robinson
Owner of Ace Work Gear
http://www.aceworkgear.com

The key to staying ahead of your competitors online is to have excellent online visibility, which is why we invest so much in paid advertising (Google Adwords). But it occurred to me that if you aren’t offering competitive prices, then you’re essentially throwing money down the drain. Even if you have good visibility, users will look elsewhere to buy once they’ve seen your prices.

Although I used to spend hours scrolling through competitor sites to make sure that I was matching all of their prices, it took far too long and probably wasn’t the best use of my time. So instead, I started scraping websites and exporting the pricing information into easily readable spreadsheets.

This saves me huge amounts of time, but also saves my copywriter time as they don’t have to do as much research. We usually outsource the scraping, as we don’t really trust ourselves to do it properly! The most important aspect of this process is having the data in an easily readable format. Spreadsheets are great, but even they can become too muddled up with unnecessary information.

Enriched Web Data for Lead Generation

chris

Chris McCarron
Founder of GoGoChimp
https://www.gogochimp.com

We use a variety of different sources and data to get our clients more leads and sales. This is really beneficial to our clients that include national and international brands who all use this information to specifically target an audience, boost conversions, increase engagement and/or reduce customer acquisition costs.

Web data can help you know which age groups, genders, locations, and devices convert the best. If you have existing analytics already in place, you can enrich this data with data from around the web, like reviews and social media profiles, to get a more complete picture. You’ll be able to use this enriched web data to tailor your website and your brand’s message so that it instantly connects to who your target customer is.

For example, by using these techniques, we estimate that our client Super Area Rugs will increase their annual revenue by $450,000.

Web Data for Competitor Monitoring

mike-catania-high-res-400x400

Mike Catania
CTO of PromotionCode.org
http://www.promotioncode.org

The coupon business probably seems docile from the outside but the reality is that many sites are backed by tens of millions of dollars in venture capital and there are only so many offers to go around. That means exclusive deals can easily get poached by competitors. So we use scraping to monitor our competition to ensure they’re not stealing coupons from our community and reposting them elsewhere.

Both the IT and Legal departments use this data–in IT, we use it more functionally, of course. Legal uses it as research before moving ahead with cease and desist orders.

Wrap Up

And there you have it. Real use cases of web data helping companies with competitive pricing, competitor monitoring, and increasing conversion for sales. Keep in mind that it’s not just about having the web data, it’s also about quality and using a reputable company to provide you with the information you need to increase your revenue.

Please share any other ways that web data has helped you in the comments below.

Explore more ways that you can use web data or chat with one of our representatives to learn more.

An Introduction to XPath: How to Get Started

An Introduction to XPath: How to Get Started

XPath is a powerful language that is often used for scraping the web. It allows you to select nodes or compute values from an XML or HTML document and is actually one of the languages that you can use to extract web data using Scrapy. The other is CSS and while CSS selectors are a popular choice, XPath can actually allow you to do more.

With XPath, you can extract data based on text elements’ contents, and not only on the page structure. So when you are scraping the web and you run into a hard-to-scrape website, XPath may just save the day (and a bunch of your time!).

This is an introductory tutorial that will walk you through the basic concepts of XPath, crucial to a good understanding of it, before diving into more complex use cases.

Note: You can use the XPath playground to experiment with XPath. Just paste the HTML samples provided in this post and play with the expressions.

The basics

Consider this HTML document:

<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#">page</a></h2>
    <p>This is the first paragraph.</p>
    <!-- this is the end -->
  </body>
</html>

XPath handles any XML/HTML document as a tree. This tree’s root node is not part of the document itself. It is in fact the parent of the document element node (<html> in case of the HTML above). This is how the XPath tree for the HTML document looks like:

HTML tree

As you can see, there are many node types in an XPath tree:

  • Element node: represents an HTML element, a.k.a an HTML tag.
  • Attribute node: represents an attribute from an element node, e.g. “href” attribute in <a href=”http://www.example.com”>example</a>.
  • Comment node: represents comments in the document (<!-- … -->).
  • Text node: represents the text enclosed in an element node (example in <p>example</p>).

Distinguishing between these different types is useful to understand how XPath expressions work. Now let’s start digging into XPath.

Here is how we can select the title element from the page above using an XPath expression:

/html/head/title

This is what we call a location path. It allows us to specify the path from the context node (in this case the root of the tree) to the element we want to select, as we do when addressing files in a file system. The location path above has three location steps, separated by slashes. It roughly means: start from the ‘html’ element, look for a ‘head’ element underneath, and a ‘title’ element underneath that ‘head’. The context node changes in each step. For example, the head node is the context node when the last step is being evaluated.

However, we usually don’t know or don’t care about the full explicit node-by-node path, we just care about the nodes with a given name. We can select them using:

//title

Which means: look in the whole tree, starting from the root of the tree (//) and select only those nodes whose name matches title. In this example, // is the axis and title is the node test.

In fact, the expressions we’ve just seen are using XPath’s abbreviated syntax. Translating //title to the full syntax we get:

/descendant-or-self::node()/child::title

So, // in the abbreviated syntax is short for descendant-or-self, which means the current node or any node below it in the tree. This part of the expression is called the axis and it specifies a set of nodes to select from, based on their direction on the tree from the current context (downwards, upwards, on the same tree level). Other examples of axes are: parent, child, ancestor, etc — we’ll dig more into this later on.

The next part of the expression, node(), is called a node test, and it contains an expression that is evaluated to decide whether a given node should be selected or not. In this case, it selects nodes from all types. Then we have another axis,child, which means go to the child nodes from the current context, followed by another node test, which selects the nodes named as title.

So, the axis defines where in the tree the node test should be applied and the nodes that match the node test will be returned as a result.

You can test nodes against their name or against their type.

Here are some examples of name tests:

Expression Meaning
/html Selects the node named html, which is under the root.
/html/head Selects the node named head, which is under the html node.
//title Selects all the title nodes from the HTML tree.
//h2/a Selects all a nodes which are directly under an h2 node.

And here are some examples of node type tests:

Expression Meaning
//comment() Selects only comment nodes.
//node() Selects any kind of node in the tree.
//text() Selects only text nodes, such as “This is the first paragraph”.
//* Selects all nodes, except comment and text nodes.

We can also combine name and node tests in the same expression. For example:

//p/text()

This expression selects the text nodes from inside p elements. In the HTML snippet shown above, it would select “This is the first paragraph.”.

Now, let’s see how we can further filter and specify things. Consider this HTML document:

<html>
  <body>
    <ul>
      <li>Quote 1</li>
      <li>Quote 2 with <a href="...">link</a></li>
      <li>Quote 3 with <a href="...">another link</a></li>
      <li><h2>Quote 4 title</h2> ...</li>
    </ul>
  </body>
</html>

Say we want to select only the first li node from the snippet above. We can do this with:

//li[position() = 1]

The expression surrounded by square brackets is called a predicate and it filters the node set returned by //li (that is, all li nodes from the document) using the given condition. In this case it checks each node’s position using the position() function, which returns the position of the current node in the resulting node set (notice that positions in XPath start at 1, not 0). We can abbreviate the expression above to:

//li[1]

Both XPath expressions above would select the following element:

<li class="quote">Quote 1</li>

Check out a few more predicate examples:

Expression Meaning
//li[position()%2=0] Selects the li elements at even positions.
//li[a] Selects the li elements which enclose an a element.
//li[a or h2] Selects the li elements which enclose either an a or an h2 element.
//li[ a [ text() = "link" ] ] Selects the li elements which enclose an a element whose text is “link”. Can also be written as //li[ a/text()="link" ].
//li[last()] Selects the last li element in the document.

So, a location path is basically composed by steps, which are separated by / and each step can have an axis, a node test and a predicate. Here we have an expression composed by two steps, each one with axis, node test and predicate:

//li[ 4 ]/h2[ text() = "Quote 4 title" ]

And here is the same expression, written using the non-abbreviated syntax:

/descendant-or-self::node()
    /child::li[ position() = 4 ]
        /child::h2[ text() = "Quote 4 title" ]

We can also combine multiple XPath expressions in a single one using the union operator |. For example, we can select all a and h2 elements in the document above using this expression:

//a | //h2

Now, consider this HTML document:

<html>
  <body>
    <ul>
      <li id="begin"><a href="https://scrapy.org">Scrapy</a></li>
      <li><a href="https://scrapinghub.com">Scrapinghub</a></li>
      <li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
      <li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a></li>
    </ul>
  </body>
</html>

Say we want to select only the a elements whose link points to an HTTPS URL. We can do it by checking their href attribute:

//a[starts-with(@href, "https")]

This expression first selects all the a elements from the document and for each of those elements, it checks whether their href attribute starts with “https”. We can access any node attribute using the @attributename syntax.

Here we have a few additional examples using attributes:

Expression Meaning
//a[@href=”https://scrapy.org”] Selects the a elements pointing to https://scrapy.org.
//a/@href Selects the value of the href attribute from all the a elements in the document.
//li[@id] Selects only the li elements which have an id attribute.

More on Axes

We’ve seen only two types of axes so far:

  • descendant-or-self
  • child

But there’s plenty more where they came from and we’ll see a few examples. Consider this HTML document:

<html>
  <body>
    <p>Intro paragraph</p>
    <h1>Title #1</h1>
    <p>A random paragraph #1</p>
    <h1>Title #2</h1>
    <p>A random paragraph #2</p>
    <p>Another one #2</p>
    A single paragraph, with no markup
    <div id="footer"><p>Footer text</p></div>
  </body>
</html>

Now we want to extract only the first paragraph after each of the titles. To do that, we can use the following-sibling axis, which selects all the siblings after the context node. Siblings are nodes who are children of the same parent, for example all children nodes of the body tag are siblings. This is the expression:

//h1/following-sibling::p[1]

In this example, the context node where the following-sibling axis is applied to is each of the h1 nodes from the page.

What if we want to select only the text that is right before the footer? We can use the preceding-sibling axis:

//div[@id='footer']/preceding-sibling::text()[1]

In this case, we are selecting the first text node before the div footer (“A single paragraph, with no markup”).

XPath also allows us to select elements based on their text content. We can use such a feature, along with the parent axis, to select the parent of the p element whose text is “Footer text”:

//p[ text()="Footer text" ]/..

The expression above selects <div id="footer"><p>Footer text</p></div>. As you may have noticed, we used .. here as a shortcut to the parent axis.

As an alternative to the expression above, we could use:

//*[p/text()="Footer text"]

It selects, from all elements, the ones that have a p child which text is “Footer text”, getting the same result as the previous expression.

You can find additional axes in the XPath specification: https://www.w3.org/TR/xpath/#axes

Wrap up

XPath is very powerful and this post is just an introduction to the basic concepts. If you want to learn more about it, check out these resources:

And stay tuned, because we will post a series with more XPath tips from the trenches in the following months.

Why Promoting Open Data Increases Economic Opportunities

Why Promoting Open Data Increases Economic Opportunities

During the 2016 Collision Conference held in New Orleans, our Content Strategist Cecilia Haynes interviewed conference speaker Dr. Tyrone Grandison. At the time of the interview, he was the Deputy Chief Data Officer at the U.S. Department of Commerce. Tyrone is currently the Chief Information Officer for the Institute for Health Metrics and Evaluation.

tyrone-picture

Dr. Tyrone Grandison

Coming fresh off his talk on “Data science, apps and civic responsibility“, Cecilia was thrilled to chat with Tyrone all about the democratization of data and how open data can help anyone build innovative products and services.

Issues with Data Ownership and Privacy

Cecilia: Thanks for meeting with me! I saw your talk and I thought you would be the the perfect person to reach out to. Since you’re in government, you’re approaching data in a different way than the business or tech world. What is your take on open data?

Tyrone: Data within startups and companies is proprietary. I have this big issue with data ownership, data privacy, and data security and many companies feeling that because they collected and are stewards for data, they immediately have ownership rights.

For example, who does the data belong to if you were in a hospital and the hospital takes down your information for an evaluation? When a hospital generates data on your condition in the process of delivering care, you likely believe that that data is still yours. However hospitals don’t assume that.

Cecilia: I actually didn’t know that. That’s troubling.

Tyrone: I mean it’s basically their proprietary intellectual property at that point where they now have the right to sell it based upon the terms and conditions that you actually agreed to.

It’s the same thing that happens when you use something like a Fitbit.

*Note that Cecilia was wearing a Fitbit…

I looked at your hand and was just like, “That data is not yours.”

The US Government’s Approach to Data

“We want to reduce the barrier of entry for people working on and with data.”

Cecilia: What is the government’s approach to data?

Tyrone: So the government is more focused on the power of open data and how do we actually increase the accessibility and usability of it.

This includes exploring how to enable public-private data partnerships, and, in the process, help government be more data-driven in how it’s run. What I’ve observed is that the Department of Commerce, for example, has highly valuable data sets.

A quick example is NOAA, which is the National Oceanic and Atmospheric Administration. Commerce has twelve bureaus and NOAA is a bureau within Commerce. NOAA provides information for the weather industry globally.

It’s all free, but no one really knows this. It is technically all open, but it’s very difficult to find and it’s very difficult to actually understand.

And there are some companies that have leveraged this information by investing in understanding it and making it clean and accessible. That’s why you have theweather.com, that’s why you have the weather channel, that’s all NOAA. Even worse, NOAA collects around 20-30 terabytes of data per day. They even have satellites monitoring the sun’s surface. They have sensors monitoring sounds underwater, you name it, they monitor it. 30 terabytes a day, but they only actually release 2 terabytes of that data and it’s only a fraction of those 2 terabytes that funds the world’s weather system.

Cecilia: Oh, so is that why weather predictions are unreliable?

Tyrone: No, no, that’s not the data’s fault. That is on the analytical models on top of the data.

If you had access to more data, and you had a better understanding of the nuances of collection like what you have to filter out and what overlaps, then you can actually get better models. The prediction models are actually better now than they were like three years ago and current three-day predictions are pretty spot on. If we go farther than that, then okay, not so reliable…

Using Open Data to Find Targeted Demographics

Cecilia: What other data sources can benefit companies?

Tyrone: The Census Bureau has this thing called the American Community Survey which basically documents the daily lives of all Americans. So, if you want to know anything, they have tens of thousands of features, which means tens of thousands of descriptors on the lives of Americans.

Every single study that you see that actually talks about how Americans are living, or whatever else, that’s all from the Census Bureau. These studies don’t recognize the Census, they don’t like give attribution back to the Census. But there is nowhere else the data can come from.

Say I wanted to get access to senior citizens over 65 who collected social security benefits and who used to commute 10 miles to their job. Almost any attribute you could actually think of, you could find this demographic right now with open data.

Cecilia: And it’s all completely available?

Tyrone: It’s all open. There is a project called the Commerce Data Usability Project that we’re doing at the department where we produce tutorials that show you:

Here’s a valid data set, here is a story as to why you should care about the data set, here is how to get it, here is how it’s processed, here is how to actually make some visualizations from it, here is how to actually analyze it. Go.

Tools to Support the Democratization of Data

Cecilia: The democratization of data is such a big deal to us as well. It’s why we open source our software and products, and why we made an open source visual scraper, so that anyone can engage with web data.

One of our goals is to enable data journalists, data scientists, everyday people to be able to use our tools to seek out the information they need.

Tyrone: Commerce is really dedicated to this goal as well. That’s why we have a startup now within Commerce called Commerce Data Service whose mission it is to support all the bureaus on their data initiatives.

We want to fundamentally and positively change the way citizens and businesses interact with the data products from Commerce. We recognize the problems are tied to marketing, access, and usability.

The Data Service commits to having everything in the open, everything transparent as much as possible. If you want to see everything we’re working on right now, it’s on github.com/commercedataservice.

Take a look at the Data Usability Project since we have a bunch of tutorials on everything from census data to data from NIST which is the standard’s organization that has everything from internet security standards to time standards, you name it.

We also have satellite information. So there is a satellite that was launched, I think it was October 2011, called the Delta 2. It had on it this device called the Visible Infrared Imaging Radiometer Suite, VIIRS, which actually monitors all human activity as it goes around.

So a bunch of scientists have been looking at this VIIRS data set that no one knows exists and figured out that it’s a really good proxy for a lot of amazing stuff. For example, you could actually use satellite imagery to predict population very simply. You could even use it to predict the number of mental health related arrests in San Francisco. You could also use it to figure out economic activity in a particular place.

Machine Learning for Data Analysis

Cecilia: So do you incorporate machine learning into analyzing the data?

Tyrone: We’ve got the platform and we have examples that show you how to use machine learning with the data sets. If you want to use machine learning algorithms on a data set, you can find everything you need. If you want to use the data sets with something else in a really straight forward way to do straight mapping, for example, then you have it on our platform.

Cecilia: This is actually really helpful to me because we have partnerships with BigML, a company that specializes in predictive analytics, along with MonkeyLearn, a machine learning company that works on text analysis.

We’re always looking for new ways to highlight our collaboration, so we’ll have to check out VIIRS.

Using Open Data to Create Economic Opportunity

Cecilia: What is your role in the Department of Commerce?

Tyrone: I’m the Deputy Chief Data Officer. I’m one of three people that leads this Commerce Data Service and the office itself is the lead for the data pillar across Commerce.

The Secretary has a strategic plan that has five initiatives that everyone has to tie into, data is one of them and we’re responsible for making sure that data is successful.

Cecilia: Have you found it really challenging so far?

Tyrone: The support from the Secretary and the senior staff at Commerce has been amazing. The challenge has actually been that we are not in the private sector. Since it is a little bit different delivering products in government than it is in private industry.

In the private industry, you’re focused on clicks, and buys, and elastic problems where it’s all about growing and shrinking some base. Whereas with government, it’s more of the hardest, most difficult problems that can considered baseline needs like, “I’d like to have health care. I’d like not to be homeless.”

These are problems that you know no company that will actually tackle because there is no profit motive, but these should be basic intrinsic rights for anyone who lives in the US. These are the problems that the government has to handle, and we have to produce amazing data products to make sure that we approach them in the right way.

Cecilia: So your goal is to create products that allow people to access data more easily?

Tyrone: Our approach is two-fold: One, we’re building the products to help people engage with the services that we’re offering. And two, we’re building a platform that’s an enabler. I hope that the platform is something that citizens can use to help solve local issues.

The Commerce Department’s mission is to create conditions for economic growth and opportunity. We want to empower citizens to take this data and build businesses and create more jobs.

That’s why we want to open as much data as possible and just encourage and engage with people so that they can build great things.

No Such Thing as Bad Data

Cecilia: So open data is a critical part of your strategy?

Tyrone: The more data you have, the more you can shed light on issues. However, you can’t let the data speak for itself because you have to recognize that there is bias in data. If you recognize the bias first, you can try and filter out for it, and if you can’t, chuck it and use a different data source.

It’s important to have a data source that is real, legitimate, and sound so you can find a signal and get meaningful information out of it. It’s helpful if you have a purpose, or a direction, or a question you’re asking. Then you can actually say, “I want to see spending trends. I want to see who’s spending X on Y.” And just do an analysis of this one feature over time.

Cecilia: How do you determine the difference between good data vs bad data?

Tyrone: There is no good or bad since data is a product of the collection process and the people that handle it. It’s more about the people that clean, process, and provide it.

Cecilia: So there is a lot of importance in having a reliable group who gathers the data?

Tyrone: There is a lot of value in having the people responsible for ETL (Extract, Transform, Load) who can create a data set that is a gold standard. They reduce biases as much as possible, and they minimize errors as much as possible.

It’s important that they’re honest with the upstream consumer about the problems with any data sets they provide. If you’re really honest about it, then somebody else can know what are the right techniques to use on the data. Otherwise people might just use it willy-nilly and not know that it shouldn’t be used for that purpose or in a particular way.

So the good and bad thing, there is no dichotomy, it’s all data and the interpretation of it.

Advice for Getting Started with Using Open Data

Cecilia: Do you have any advice to people who are looking to get into open data or data security in this industry?

Tyrone: I’d say just go in with a problem or a question, something that’s burning in your heart that you want to solve. And then figure out what data sets you can use.

You have a backpack of methods and technologies available and it all starts with the question or problem you’re fundamentally trying to solve. You need to understand the user, the problem, and the context in which you have to deliver something.

That determines what tools you need to actually use to solve that problem, not the other way around. Don’t approach this with, “I have a hammer, I’m going to smash everything with it.”

slack-for-ios-upload-1
One of my favorite interviews of the entire conference, thank you again for meeting with me, Tyrone!

Learn more about how you can use web data in your business and take a look at how anyone can get started with open data, no coding required.

Interview: How Up Hail uses Scrapy to Increase Transparency

Interview: How Up Hail uses Scrapy to Increase Transparency

During the 2016 Collision Conference held in New Orleans, Scrapinghub Content Strategist Cecilia Haynes had the opportunity to interview the brains and the brawn behind Up Hail, the rideshare comparison app.

up-hail

avi

Avi Wilensky is the Founder of Up Hail

Avi sat down with Cecilia and shared how he and his team use Scrapy and web scraping to help users find the best rideshare and taxi deals in real time.

Fun fact, Up Hail was named one of Mashable’s 11 Most Useful Web Tools of 2014.

Meet Team Up Hail

CH: Thanks for meeting with me! Can you share a bit about your background, what your company is, and what you do?

AW: We are team Up Hail and we are a search engine for ground transportation like taxis and ride-hailing services. We are now starting to add public transportation like trains and buses, as well as bike shares. We crawl the web using Scrapy and other tools to gather data about who is giving the best rates for certain destinations.

scrapy

Scrapy for the win

There’s a lot of data out there, especially public transportation data on different government or public websites. This data is unstructured and a mess and without APIs. Scrapy’s been very useful in gathering it.

CH: How has your rate of growth been so far?

AW: Approximately 100,000 new users a month search our site and app, which is nice and hopefully we will continue to grow. There’s a lot more competition now than when we started, and we’re working really hard to be the leader in this space.

Users come to our site to compare rates and to find the best deals on taxis and ground transportation. They are also interested in finding out if the different service providers are available in their cities. There are many places in the United States and across the world that don’t have these services, so we attract those who want find out more information.

We also crawl and gather a lot of different product attributes such as economy vs. luxury, shared vs. private, how many people each of these options fit, whether they accept cash, and whether you can book in advance.

Giving users transparency on different car services and transportation options is our mission.

CH: By the way, where are you based?

AW: We’re based in midtown Manhattan in a place called A Space Apart. This is run by a very notable web designer and author named Jeffrey Zeldman who has been gracious enough to host us. He also runs A Book Apart, An Event Apart, and A List Apart, which are some of the most popular communities for web developers and designers.

Why the Team Members at Up Hail are Scrapy Fans

CH: You have really found some creative applications for Scrapy. I have to ask, why Scrapy? What do you appreciate about it?

AW: A lot of the sites that we’re crawling are a mess. Especially the government transit ones and local taxi companies. As a framework, Scrapy has a lot of features built in right out the box that are useful for us.

CH: Is there anything in particular that you’re like, “I’m obsessed with this aspect of Scrapy?”

AW: We’re a Python shop and Scrapy is the Python library for building web crawlers. That’s primarily why we use it. Of course, Scrapy has such a vibrant ecosystem of developers and it’s just easy to use. The documentation is great and it was super simple to get up and started. It just does the job.

We’re grateful that you make such a wonderful tool [Note: We are the original authors and lead maintainers of Scrapy] that is free and open source to startups like us. There’s a lot of companies in your space that are charging a lot of money and making it cost prohibitive to use.

CH: That’s really great to hear! We’re all about open source, so keeping Scrapy forever free is a really important aspect of this approach.

On Being a Python Shop

CH: So tell me a bit more about why you’re a Python shop?

AW: Our application runs on the Python Flask framework and we’re using Python libraries to do a lot of the back-end work.

CH: Dare I ask why you’re using Python?

AW: One of the early developers on the project is a Xoogler, and Python is one of Google’s primary languages. He really inspired us to use Python and we just love the language because it’s the philosophy of readability, brevity, and making it simple and powerful enough to get the job done.

I think developer time is scarce and Python makes it faster to deploy, especially for a startup that needs to ship fast.

Introducing Scrapy Cloud and the Scrapinghub Platform

CH: May I ask you’ve used our Scrapy Cloud Platform to deploy Scrapy crawlers?

AW: We haven’t tried it out yet. We just found out about Scrapy Cloud, actually.

CH: Really? Where did you hear about us?

AW: I listen to a Python podcast [Talk Python To Me] which was with Pablo, one of your co-founders. I didn’t know about how Scrapy originated from your co-founders. When I saw your name in the Collision Conference app, I was like, “Oh, I know these guys from the podcast! They’re maintainers of Scrapy.” Now that we know about Scrapy Cloud, we’ll give it a try.

We usually run Scrapy locally or we’ll deploy Scrapy on an EC2 instance on Amazon Web Services.

CH: Yeah, Scrapy Cloud is our forever free production environment that lets you build, deploy, and scale your Scrapy spiders. We’ve actually just included support for Docker. Definitely let me know what you think of Scrapy Cloud when you use it.

AW: Definitely, I’ll have to check it out.

Plans for Up Hail’s Expansion

CH: Where are you hoping to grow within the next five years?

AW: That’s a very good question. We’re hoping to, of course, expand to more regions. Right now, we’re in the United States, Canada, and Europe. There’s a lot of other countries that have a tremendous population that we’re not covering. We’d like to add a lot more transportation options into the mix. There’s all these new things like on-demand helicopters and we want to just show users going from point A to point B all their available options. We’re kind of like the Expedia of ground transportation.

Also, we’re adding a lot of interesting new things like a scoring system. We’re scoring how rideshare-friendly a city is. New York and San Francisco, of course, get 10s, but maybe over in New Jersey, where there are less options, some cities will get 6 or 7. It depends on how many options are available. Buffalo, New York, for example, doesn’t have Uber or Lyft and they would probably get like a 1 because they only have yellow taxis. This may be useful for users that are thinking of moving to a city and want to know how accessible taxi and rideshares are. We want to give users even more information about taxis and transportation options.

Increasing Transparency through Web Scraping

CH: It seems that increasing transparency is a large part of where you want to continue to grow.

AW: The transportation industry is not as transparent as it should be. We’ve heard stories at the Expo Hall [at Collision Conference] of taxi drivers ripping off tourists because they don’t know in advance what it’s going to cost. By us scraping these sites, taking the rate tables, and computing the estimates, they can just fire up our app and have a good idea of what it’s going to cost.

CH: Is your business model based on something like Expedia’s approach?

AW: Similar. We get a few dollars when we sign up new users to the various providers. We’re signing up a few thousand users a month. While it’s been really good so far, we need to grow it tremendously and we’re looking for other business models. Also, advertising on the site has been good for us as well, but, of course, it’s limited. Don’t want to be too intrusive to our users by being overly aggressive with ads, so we’re trying to keep it clean there.

Opening Up Up Hail’s API

AW: Within the next few months we hope to launch a public API to outside developers and other sites. We’ve talked with a lot of other vendors here at the expo like travel concierge apps and the like that want to bring in our data.

CH: Oh, that’s great! Seems to be the makings for a lot of cross-platform collaboration.

AW: We’ve gathered a lot of great data, thanks to Scrapy and other crawling tools, and we hope to make it available for others to use.

In fact, I specifically reached out to you to tell you how awesome Scrapy was.

CH: Well I’m thrilled you did! And I’m so glad we also got to talk about Python and how you use it in your stack.

AW: Definitely. We are heavily using Python to get the job done. We think it’s the right tool for the job and for what we’re doing.

up-hail-group

Team Up Hail at Collision Conference 2016

Interest piqued? Learn more about what web scraping and web data can do for you.

How to Run Python Scripts in Scrapy Cloud

How to Run Python Scripts in Scrapy Cloud

You can deploy, run, and maintain control over your Scrapy spiders in Scrapy Cloud, our production environment. Keeping control means you need to be able to know what’s going on with your spiders and to find out early if they are in trouble.

image03
No one wants to lose control of a swarm of spiders. No one. (Reuters/Daniel Munoz)

This is one of the reasons why being able to run any Python script in Scrapy Cloud is a nice feature. You can customize to your heart’s content and automate any crawling-related tasks that you may need (like monitoring your spiders’ execution). Plus, there’s no need to scatter your workflow since you can run any Python code in the same place that you run your spiders and store your data.

While just the tip of the iceberg, I’ll demonstrate how to use custom Python scripts to notify you about jobs with errors. If this tutorial sparks some creative applications, let me know in the comments below.

Setting up the Project

We’ll start off with a regular Scrapy project that includes a Python script for building a summary of jobs with errors that have finished in the last 24 hours. The results will be emailed to you using AWS Simple Email Service.

You can check out the sample project code here.

Note: To use this script you will have to modify the settings at the beginning with your own AWS keys so that the email function works.

In addition to the traditional Scrapy project structure, it also contains a check_jobs.py script in the bin folder. This is the script responsible for building and sending the report.

.
├── bin
│   └── check_jobs.py
├── sc_scripts_demo
│   ├── __init__.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── bad_spider.py
│       └── good_spider.py
├── scrapy.cfg
└── setup.py

The deploy will be done via shub, just like your regular projects.

But first, you have to make sure that your project’s setup.py file lists the script that you’ll want to run (see the line highlighted in the snippet below):

from setuptools import setup, find_packages

setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    scripts      = ['bin/check_jobs.py'],
    entry_points = {'scrapy': ['settings = sc_scripts_demo.settings']},
)

Note: If there’s no setup.py file in your project root yet, you can run shub deploy and the deploy process will generate it for you.

Once you have included the scripts parameter in the setup.py file, you can deploy your spider to Scrapy Cloud with this command:

$ shub deploy

Running the Script on Scrapy Cloud

Running a Python script is very much like running a Scrapy spider in Scrapy Cloud. All you need to do is set the job type as “Scripts” and then select the script you want to execute.

The check_jobs.py script expects three arguments: your Scrapinghub API key, an email address to send the report to, and the project ID (the numeric value in the project URL):

image05

Scheduling Periodic Execution on Scrapy Cloud

Since this script is meant to be executed once a day, you need to schedule it under Periodic Jobs, as shown below:

image01

Select the script to run, configure when you want it to run and specify any arguments that may be necessary.

image04

After scheduling the periodic job (I’ve set it up to run once a day at 7 AM UTC), you will see a screen like this:

image02

Note: You can run the script immediately by clicking the play button as well.

And you’re done! The script will run every day at 7 AM UTC and send a report of jobs with errors (if any) right into your email inbox. This is how the report email looks:

image00

 

Helpful Tips

Heads up, here’s what else you should know about Python scripts in Scrapy Cloud:

  • The output of print statements show up in the log with level INFO and prefixed with [stdout]. It’s generally better to use a Python standard logging API to log messages with proper levels (e.g. to report errors or warnings).
  • After about an hour of inactivity, jobs are killed. If you plan to leave a script running for hours, make sure that it logs something in the output every few minutes to avoid this grisly fate.

Wrap Up

While this specific example demonstrated how to automate the reporting of jobs with errors, keep in mind that you can use any Python script with Scrapy Cloud. This is helpful for customizing your crawls, monitoring jobs, and also handling post-processing tasks.

Read more about this and other features at Scrapy Cloud online documentation.

Scrapy Cloud is forever free, so no need to worry about a bait-and-switch. Try it out and let me know what Python scripts you’re using in the comments below.

Sign up for free

Embracing the Future of Work: How To Communicate Remotely

Embracing the Future of Work: How To Communicate Remotely

What does “the Future of Work” mean to you? To us, it describes how we approach life at Scrapinghub. We don’t work in a traditional office (we’re 100% distributed) and we allow folks the freedom to make their own schedules (you know when you work best). By finding ways to break away from the traditional 9-to-5 mode, we ended up creating a framework for the Future of Work.

Maybe you’ve heard of this term and want to learn more or maybe you’re thinking about implementing aspects of it at your own company, regardless we can’t stress enough that effective communication is a key part.

The Future of Work: a Definition

According to Jacob Morgan (who literally wrote the book, The Future of Work), this broad term can be broken into three parts: Freedom/Flexibility, Autonomy, and Choice/Customization. Not mentioned in his list is the rise of AI, although we might as well prepare for our inevitable robot overlords.

The Future of Work both describes how the employment landscape will look in the future as well as how humans will need adapt to technological leaps and changing expectations of employment. Along these lines, we are remote (always have been, always will be) and we are into machine learning. We’re also open source to the core and if you think wrangling regular developers is a challenge, wait until you meet scarily intelligent super brains who know how to maneuver their pet projects into benefiting their company.

artificial-intelligence

We’re living, breathing, and occasionally chaotic proof that the Future of Work is already in the present.

Remote Life Culture

Our two co-founders, Shane from Ireland and Pablo from Uruguay, established a company based on talent, not geography. And that’s not even a marketing line.

Full disclosure, I’m a millennial who was drawn to Scrapinghub because of its remote stance. I wanted the flexibility to be a digital nomad while not needing to rely on the uncertainty of freelance work. This opportunity was the best of both worlds and I’ve since come to understand how parents can also benefit from remote life (and introverts, and folks who like to work at odd hours).

Flexibility

Remote work is not for everyone. While obviously there are co-working spaces and coffee shops, you need to be comfortable with establishing your own schedule along with having the discipline to work by your lonesome.

On the flip side, you have the flexibility and freedom to sort out your responsibilities so that they fit within your life. And this is a pretty important point for companies looking to transition into the remote space. You need to understand how time zones impact the way that teams operate and how to trust your team members to get their work done on time.

Holidays

When hiring team members from a variety of countries and backgrounds, it’s important to keep local holidays in mind. Adopting an open holiday policy both respects the diversity of your company (we’re based in 48 countries) and also recognizes the importance of having time off.

Autonomy

Cultural fit is especially important in a remote team. Autonomy and trust is a huge part of how we operate because there is no one looking over your shoulder (quite literally). Finding motivated people who can finish their work and stay on top of their responsibilities without needing oversight is crucial to running a successful remote operation.

Remote Work

We make sure that our teammates feel comfortable voicing concerns and sharing ideas for improving the company by promoting an open Slack policy. No matter our level or seniority, we remain accessible to all members of Scrapinghub. This policy facilitates collaboration and helps create a sense of community.

Managing Miscommunication

Tone is incredibly difficult to convey through writing. Think of every misunderstanding that you’ve ever had through written communication (text messages, tweets, comments on Facebook, etc.), and then add in work-related stress. This is not a great situation unless you develop straightforward channels of communication.

Our Workflow

This is our system, so feel free to steal the workflow. Honestly, different methods work for different teams. As long as you have one unifying communication stream, go with whatever feels right:

Communication

Slack: Used for day-to-day communication. Slack is mainly how we stay in touch. The Snooze control is great, as is the Tomatobot Pomodoro timer. We’re also using Leo the Slack bot as a way for upward feedback and to get a handle on the pulse of our colleagues.

Google Hangouts: This is for team meetings and it can also be used for impromptu watercooler moments. In Growth specifically, we plan our sprints and have eclectic meetings about non-work related topics like aphantasia.

GoToMeeting: We use this platform for our larger gatherings so that the majority of the company can join in. We have Town Halls where our founders share company-wide information in order to increase transparency. GoToMeeting is also used for our Friday lightning talks (affectionately known as Shub talks) where team members present on interesting topics ranging from recipes to machine learning techniques.

Email: Optimal for handling outside communications and for sharing external chats with the rest of our teams.

Intercom: Intercom is how we stay in touch with users and customers.

Sprint Planning and Assigning Tasks

Redmine: Some teams use this as an issue tracker and as a way to assign necessary tasks.

Jira: Used by other teams for support ticket management and sprint planning.

Trello: Used by our Growth team for sprint planning and to keep track of daily activities. I use Trello for my editorial calendar since it’s easy for me to organize categories, assign writers, and set due dates.

Work

GitHub: We’re open source folks and so GitHub is a huge part of how we work. Even vacations are managed through pull requests.

BaseCRM: The sales team uses this for lead management.

Lever.co: Used by the HR team for managing job applications.

Updates and Improvements

Scrapinghub Enhancement Proposals (SHEP): SHEPs (a play on Python Enhancement Proposals) are plans created by employees with suggestions for how Scrapinghub can be improved. SHEPs include everything from HR perks to business planning. SHEPs can be created and submitted by anyone in the company.

Confluence: We recently adopted Confluence as a knowledge base in addition to Google Drive. We want to reduce the silo mentality and Confluence has been especially beneficial in accomplishing this goal. Team updates, meeting notes, and ongoing projects are easily reviewed and shared within the company using this program.

Newsletter: Our weekly newsletter (sent through MailChimp) shares information on more personal topics like vacations, new hires, and birthdays. The newsletters help us to keep up-to-date on the quirky and HR-related aspects of Scrapinghub. We highlight exemplary employees, team bios, and conference activities as a way to keep everyone in the loop and to remain connected.

Wrap Up

We’re on the front lines of exploring the Future of Work. Between our technological advances (and we welcome open source collaborators) and our remote organization, we’re experimenting with how to best move forward with maximum flexibility while not turning into robots. We stress effective and clear communication because no on wants to play a game of telephone across international waters.

fry

What are your thoughts on the Future of Work and on remote companies? What applications do you use that we don’t? What workflows are you implementing that you would like to share? Please let us know in the comments or reach out on Twitter.