Embracing the Future of Work: How To Communicate Remotely

Embracing the Future of Work: How To Communicate Remotely

What does “the Future of Work” mean to you? To us, it describes how we approach life at Scrapinghub. We don’t work in a traditional office (we’re 100% distributed) and we allow folks the freedom to make their own schedules (you know when you work best). By finding ways to break away from the traditional 9-to-5 mode, we ended up creating a framework for the Future of Work. Maybe you’ve heard of this term and want to learn more or maybe you’re…

Read More Read More

How to Deploy Custom Docker Images for Your Web Crawlers

How to Deploy Custom Docker Images for Your Web Crawlers

What if you could have complete control over your environment? Your crawling environment, that is… One of the many benefits of our upgraded production environment, Scrapy Cloud 2.0, is that you can customize your crawler runtime environment via Docker images. It’s like a superpower that allows you to use specific versions of Python, Scrapy and the rest of your stack, deciding if and when to upgrade. With this new feature, you can tailor a Docker image to include any dependency…

Read More Read More

Improved Frontera: Web Crawling at Scale with Python 3 Support

Improved Frontera: Web Crawling at Scale with Python 3 Support

Python is our go-to language of choice and Python 2 is losing traction. In order to survive, older programs need to be Python 3 compatible. And so we’re pleased to announce that Frontera will remain alive and kicking because it now supports Python 3 in full! Joining the ranks of Scrapy and Scrapy Cloud, you can officially continue to quickly create and scale fully formed crawlers without any issues in your Python 3-ready stack. As a key web crawling toolbox…

Read More Read More

How to Crawl the Web Politely with Scrapy

How to Crawl the Web Politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners. In this post we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers. Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs,…

Read More Read More

Introducing Scrapy Cloud with Python 3 Support

Introducing Scrapy Cloud with Python 3 Support

It’s the end of an era. Python 2 is on its way out with only a few security and bug fixes forthcoming from now until its official retirement in 2020. Given this withdrawal of support and the fact that Python 3 has snazzier features, we are thrilled to announce that Scrapy Cloud now officially supports Python 3. If you are new to Scrapinghub, Scrapy Cloud is our production platform that allows you to deploy, monitor, and scale your web scraping…

Read More Read More

What the Suicide Squad Tells Us About Web Data

What the Suicide Squad Tells Us About Web Data

Web data is a bit like the Matrix. It’s all around us, but not everyone knows how to use it meaningfully. So here’s a brief overview of the many ways that web data can benefit you as a researcher, marketer, entrepreneur, or even multinational business owner. Since web scraping and web data extraction are sometimes viewed a bit like antiheroes, I’m introducing each of the use cases through characters from the Suicide Squad film. I did my best to pair…

Read More Read More

This Month in Open Source at Scrapinghub August 2016

This Month in Open Source at Scrapinghub August 2016

Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera. If you’re interested in learning more or even becoming a contributor, reach out to us by emailing opensource@scrapinghub.com or on Twitter @scrapinghub. Scrapy This past May, Scrapy 1.1 (with Python 3 support) was a big milestone for our Python web scraping community. And 2 weeks ago, Scrapy reached 15k stars…

Read More Read More

Meet Parsel: the Selector Library behind Scrapy

Meet Parsel: the Selector Library behind Scrapy

We eat our own spider food since Scrapy is our go-to workhorse on a daily basis. However, there are certain situations where Scrapy can be overkill and that’s when we use Parsel. Parsel is a Python library for extracting data from XML/HTML text using CSS or XPath selectors. It powers the scraping API of the Scrapy framework. Not to be confused with Parseltongue/Parselmouth We extracted Parsel from Scrapy during Europython 2015 as a part of porting Scrapy to Python 3….

Read More Read More

Incremental Crawls with Scrapy and DeltaFetch

Incremental Crawls with Scrapy and DeltaFetch

Welcome to Scrapy Tips from the Pros! In this monthly column, we share a few tricks and hacks to help speed up your web scraping activities. As the lead Scrapy maintainers, we’ve run into every obstacle you can imagine so don’t worry, you’re in great hands. Feel free to reach out to us on Twitter or Facebook with any suggestions for future topics. Scrapy is designed to be extensible and loosely coupled with its components. You can easily extend Scrapy’s functionality…

Read More Read More

Improving Access to Peruvian Congress Bills with Scrapy

Improving Access to Peruvian Congress Bills with Scrapy

Many governments worldwide have laws enforcing them to publish their expenses, contracts, decisions, and so forth, on the web. This is so the general public can monitor what their representatives are doing on their behalf. However, government data is usually only available in a hard-to-digest format. In this post, we’ll show how you can use web scraping to overcome this and make government data more actionable. Congress Bills in Peru For the sake of transparency, Peruvian Congress provides a website…

Read More Read More