Browsed by
Category: web-crawling

How to use a proxy in Puppeteer

Puppeteer is a high-level API for headless chrome. It’s one of the most popular tools to use for web automation or web scraping in Node.js. In web scraping, many developers use it to handle javascript rendering and web data extraction. In this article, we are going to cover how to set up a proxy in Puppeteer and what your options are if you want to rotate proxies.

...

Building Blocks of an Unstoppable Web Scraping Infrastructure

More and more businesses leverage the power of web scraping. Extracting data from the web is becoming popular. But it doesn't mean that the technical challenges are gone. Building a sustainable web scraping infrastructure takes expertise and experience. Here, at Scrapinghub we scrape 9 billion pages per month. In this article, we are going to summarize what the essential elements of web...

GDPR Update: Scraping Public Personal Data

One common misconception about scraping personal data is that public personal data does not fall under the GDPR. Many businesses assume that because the data has already been made public on another website that it is fair game to scrape. In actuality, GDPR makes no blanket exceptions for public personal data and the same analysis for any other personal data must be conducted prior to scraping...

Visual Web Scraping Tools: What to Do When They Are No Longer Fit For Purpose?

Visual web scraping tools are great. They allow people with little to no technical know-how to extract data from websites with only a couple hours of upskilling, making them great for simple lead generation, market intelligence and competitor monitoring projects. Removing countless hours of manual entry work for sales and marketing teams, researchers, and business intelligence team in the...

A Sneak Peek Inside Crawlera: The World’s Smartest Web Scraping Proxy Network

“How does Scrapinghub Crawlera work?” is the most common question we get asked from customers who after struggling for months (or years) with constant proxy issues, only to have them disappear completely when they switch to Crawlera. 

Today we’re going to give you a behind the scenes look at Crawlera so you can see for yourself why it is the world’s smartest web scraping proxy network and the...

Why We Created Crawlera? The World’s Smartest Web Scraping Proxy Network

Let’s face it, managing your proxy pool can be an absolute pain and the biggest bottleneck to the reliability of your web scraping! 

Nothing annoys developers more than crawlers failing because their proxies are continuously being banned.

The Rise of Web Data in Hedge Fund Decision Making & The Importance of Data Quality

Over the past few years, there has been an explosion in the use of alternative data sources in investment decision making in hedge funds, investment banks and private equity firms.

These new data sources, collectively known as “alternative data”, have the potential to give firms a crucial informational edge in the market, enabling them to generate alpha.

Scraping the Steam Game Store with Scrapy

This is a guest post from the folks over at Intoli, one of the awesome companies providing Scrapy commercial support and longtime Scrapy fans.

Improved Frontera: Web Crawling at Scale with Python 3 Support

Python is our go-to language of choice and Python 2 is losing traction. In order to survive, older programs need to be Python 3 compatible.

How to Crawl the Web Politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners.