Browsed by
Category: data-extraction

Extracting clean article HTML with News API

The Internet offers a vast amount of written content in the form of articles, news, blog posts, stories, essays, tutorials that can be leveraged by many useful applications:

Job Postings Beta API: Extract Job Postings at Scale

We’re excited to announce our newest data extraction API, Job Postings API. From now on, you can use AutoExtract to extract Job Postings data from many job boards and recruitment sites. Without writing any custom data extraction code!

News Web Data Extraction to Predict Irish Election Results

Can pre-election news coverage of political parties predict the trend of the elections?

On February 9th, 2020, Ireland elected a new parliament. Prior to the elections, the political parties invested a lot of time, money and energy to get their political message to the people. A lot of research goes into selecting the right platform and the right medium.

Looking Back at 2019

2019 was an exciting year for Scrapinghub. We created things we have never created before and did things nobody in our industry had ever done before. Let’s revisit what happened in 2019!

How to use a proxy in Puppeteer

Puppeteer is a high-level API for headless chrome. It’s one of the most popular tools to use for web automation or web scraping in Node.js. In web scraping, many developers use it to handle javascript rendering and web data extraction. In this article, we are going to cover how to set up a proxy in Puppeteer and what your options are if you want to rotate proxies.

...

Building Blocks of an Unstoppable Web Scraping Infrastructure

More and more businesses leverage the power of web scraping. Extracting data from the web is becoming popular. But it doesn't mean that the technical challenges are gone. Building a sustainable web scraping infrastructure takes expertise and experience. Here, at Scrapinghub we scrape 9 billion pages per month. In this article, we are going to summarize what the essential elements of web...

The Web Data Extraction Summit 2019

The Web Data Extraction Summit was held last week, on 17th September, in Dublin, Ireland. This was the first-ever event dedicated to web scraping and data extraction. We had over 140 curious attendees, 16 great speakers from technical deep dives to business use cases, 12 amazing presentations, a customer panel discussion and unlimited Guinness.

News Data Extraction at Scale with AI powered AutoExtract

A huge portion of the internet is news. It’s a very important type of content because there are always things happening either in our local area or globally that we want to know about. The amount of news published everyday on different sites is ridiculous. Sometimes it’s good news and sometimes it’s bad news but one thing’s for sure: it’s humanly impossible to read all of it everyday.

Gain a Competitive Edge with Product Data

Product data - whether from e-commerce sites, auto listings or product reviews, offers a treasure trove of insights that can give your business an immense competitive edge in your market. Getting access to this data in a structured format can unleash new potential for not only business intelligence teams, but also their counterparts in marketing, sales, and management that rely on accurate...

Meet Spidermon: Scrapinghub’s Battle Tested Spider Monitoring Library [Now Open Sourced]

Your spider is developed and we are getting our structured data daily, so our job is done, right?

Absolutely not! Website changes (sometimes very subtly), anti-bot countermeasures and temporary problems often reduce the quality and reliability of our data.