Improved Frontera: Web Crawling at Scale with Python 3 Support

Improved Frontera: Web Crawling at Scale with Python 3 Support

Python is our go-to language of choice and Python 2 is losing traction. In order to survive, older programs need to be Python 3 compatible. And so we’re pleased to announce that Frontera will remain alive and kicking because it now supports Python 3 in full! Joining the ranks of Scrapy and Scrapy Cloud, you can officially continue to quickly create and scale fully formed crawlers without any issues in your Python 3-ready stack. As a key web crawling toolbox…

Read More Read More

How to Crawl the Web Politely with Scrapy

How to Crawl the Web Politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners. In this post we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers. Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs,…

Read More Read More

Introducing Scrapy Cloud with Python 3 Support

Introducing Scrapy Cloud with Python 3 Support

It’s the end of an era. Python 2 is on its way out with only a few security and bug fixes forthcoming from now until its official retirement in 2020. Given this withdrawal of support and the fact that Python 3 has snazzier features, we are thrilled to announce that Scrapy Cloud now officially supports Python 3. If you are new to Scrapinghub, Scrapy Cloud is our production platform that allows you to deploy, monitor, and scale your web scraping…

Read More Read More

What the Suicide Squad Tells Us About Web Data

What the Suicide Squad Tells Us About Web Data

Web data is a bit like the Matrix. It’s all around us, but not everyone knows how to use it meaningfully. So here’s a brief overview of the many ways that web data can benefit you as a researcher, marketer, entrepreneur, or even multinational business owner. Since web scraping and web data extraction are sometimes viewed a bit like antiheroes, I’m introducing each of the use cases through characters from the Suicide Squad film. I did my best to pair…

Read More Read More

This Month in Open Source at Scrapinghub August 2016

This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera. If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub. Scrapy This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars…

Read More Read More

Meet Parsel: the Selector Library behind Scrapy

Meet Parsel: the Selector Library behind Scrapy

We eat our own spider food since Scrapy is our go-to workhorse on a daily basis. However, there are certain situations where Scrapy can be overkill and that’s when we use Parsel. Parsel is a Python library for extracting data from XML/HTML text using CSS or XPath selectors. It powers the scraping API of the Scrapy framework. Not to be confused with Parseltongue/Parselmouth We extracted Parsel from Scrapy during Europython 2015 as a part of porting Scrapy to Python 3….

Read More Read More

Incremental Crawls with Scrapy and DeltaFetch

Incremental Crawls with Scrapy and DeltaFetch

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics. Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality…

Read More Read More

Improving Access to Peruvian Congress Bills with Scrapy

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf. However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable. Congress Bills in Peru For the sake of transparency, Peruvian Congress provides a website…

Read More Read More

Scrapely: The Brains Behind Portia Spiders

Scrapely: The Brains Behind Portia Spiders

Unlike Portia labiata, the hunting spider that feeds on other spiders, our Portia feeds on data. Considered the Einstein in the spider world, we modeled our own creation after the intelligence and visual abilities of its arachnid namesake. Portia is our visual web scraping tool which is pushing the boundaries of automated data extraction. Portia is completely open source and we welcome all contributors who are interested in collaborating. Now is the perfect time since we’re on the beta launch of Portia…

Read More Read More

Introducing Portia2Code: Portia Projects into Scrapy Spiders

Introducing Portia2Code: Portia Projects into Scrapy Spiders

We’re thrilled to announce the release of our latest tool, Portia2Code! With it you can convert your Portia 2.0 projects into Scrapy spiders. This means you can add your own functionality and use Portia’s friendly UI to quickly prototype your spiders, giving you much more control and flexibility. A perfect example of where you may find this new feature useful is when you need to interact with the web page. You can convert your Portia project to Scrapy, and then…

Read More Read More