Imagine a long crawling process, like extracting data from a website for a whole month. We can start it and leave it running until we get the results. Though, we can agree that a whole month is plenty of time for something to go wrong. The target website can go down for a few minutes/hours, there can be some sort of power outage in your crawling server or even some other internet connection...
The web is complex and constantly changing. It is one of the reasons why web data extraction can be difficult, especially in the long term. It’s necessary to understand how a website works really well, before you try to extract data. Luckily, there are lots of inspection and code tools available for this and in this article we will show you some of our favorites.
The Internet offers a vast amount of written content in the form of articles, news, blog posts, stories, essays, tutorials that can be leveraged by many useful applications:
Can pre-election news coverage of political parties predict the trend of the elections?
On February 9th, 2020, Ireland elected a new parliament. Prior to the elections, the political parties invested a lot of time, money and energy to get their political message to the people. A lot of research goes into selecting the right platform and the right medium.
2019 was an exciting year for Scrapinghub. We created things we have never created before and did things nobody in our industry had ever done before. Let’s revisit what happened in 2019!
More and more businesses leverage the power of web scraping. Extracting data from the web is becoming popular. But it doesn't mean that the technical challenges are gone. Building a sustainable web scraping infrastructure takes expertise and experience. Here, at Scrapinghub we scrape 9 billion pages per month. In this article, we are going to summarize what the essential elements of web...
The Web Data Extraction Summit was held last week, on 17th September, in Dublin, Ireland. This was the first-ever event dedicated to web scraping and data extraction. We had over 140 curious attendees, 16 great speakers from technical deep dives to business use cases, 12 amazing presentations, a customer panel discussion and unlimited Guinness.
A huge portion of the internet is news. It’s a very important type of content because there are always things happening either in our local area or globally that we want to know about. The amount of news published everyday on different sites is ridiculous. Sometimes it’s good news and sometimes it’s bad news but one thing’s for sure: it’s humanly impossible to read all of it everyday.