Browsed by
Author: Richard Dowinton

Splash 2.0 Is Here with Qt 5 and Python 3

Splash 2.0 Is Here with Qt 5 and Python 3

We’re pleased to announce that Splash 2.0 is officially live after many months of hard work. For those unfamiliar with Splash, it’s a headless browser we developed specifically for web crawling. Splash executes and renders JavaScript so you can deal with dynamic content. It also supports scripting so you can perform actions on the page. Splash is open source and fully integrated with Scrapy and Portia. You can also use its API to integrate with any project that needs to…

Read More Read More

Python 3 is Coming to Scrapy

Python 3 is Coming to Scrapy

Scrapy with beta Python 3 support is finally here! Released through Scrapy 1.1.0rc1, this is the result of several months of hard work on the part of the Scrapy community and Scrapinghub engineers. This is a huge milestone for all you Scrapy users (and those who haven’t used Scrapy due to the lack of Python 3). Scrapy veterans and new adopters will soon be able to move their entire stack to Python 3 once the release becomes stable. Keep in…

Read More Read More

Parse Natural Language Dates with Dateparser

Parse Natural Language Dates with Dateparser

We recently released Dateparser 0.3.1 with support for Belarusian and Indonesian, as well as the Jalali calendar used in Iran and Afghanistan. With this in mind, we’re taking the opportunity to introduce and demonstrate the features of Dateparser. Dateparser is an open source library we created to parse dates written using natural language into Python. It translates specific dates like ‘5:47pm 29th of December, 2015’ or ‘9:22am on 15/05/2015’, and even relative times like ‘10 minutes ago’, into Python datetime…

Read More Read More

EuroPython 2015

EuroPython 2015

EuroPython 2015 is happening this week and we’re having the largest company meetup so far as a part of it, with more than 30 members from our fully remote-working team attending. The event which is held in Bilbao started on Monday and is providing great quality talks, sessions and plenty of tasty Spanish dishes. As sponsors we’ve been privileged with a nice booth where we are connecting with Pythonistas, discussing scraping practises and giving away lots of swag such as…

Read More Read More

StartupChats Remote Working Q&A

StartupChats Remote Working Q&A

Earlier this week, Scrapinghub was invited along with several other fully-distributed companies to participate in a remote working Q&A hosted by Startups Canada. The various companies invited, as well as their guests attending, shared their insights into a number of questions related to building a remote company and being a remote worker. We’re sharing our answers and our favourite answers from other companies. Q1 What are some of the major benefits to using remote workers? Q2 What are some of…

Read More Read More

PyCon Philippines 2015

PyCon Philippines 2015

Earlier this month we attended PyCon Philippines as a gold sponsor, presenting on the 2nd day. This was particularly exciting as it was the first time the whole Philippines team was together in one place and it was nice meeting each other in person! Checkout the slides below: The talk started with how people would scrape manually in the past, the pain of dealing with handling timeouts, retries, HTTP errors and so forth. We presented Scrapy as a solution to…

Read More Read More

Link Analysis Algorithms Explained

Link Analysis Algorithms Explained

When scraping content from the web, you often crawl websites which you have no prior knowledge of. Link analysis algorithms are incredibly useful in these scenarios to guide the crawler to relevant pages. This post aims to provide a lightweight introduction to page ranking algorithms so you have a better understanding of how to implement and use them in your spiders. There will be a follow up post soon detailing how to use these algorithms in your crawl. PageRank PageRank is perhaps…

Read More Read More

Gender Inequality Across Programming Languages

Gender Inequality Across Programming Languages

Gender inequality is a hot topic in the tech industry. Over the last several years we’ve gathered business profiles for our clients, and we realised this data would prove useful in identifying trends in how gender and employment relate to one another. The following study is based on UK profiles to determine the gender of a profile using the given name, which covered approximately 80% of the users. We had collected data from 2010 through to 2015, so we were able to identify changes…

Read More Read More

A Career in Remote Working

A Career in Remote Working

This year I have reached a major milestone in my life, which is getting my bachelor’s degree in mathematics. When I made the decision to go back to college, it was solely because my experience working at Scrapinghub, I figured out that having a math background would be a great foundation for getting into ML-related stuff. Now my life journey has a new beginning and I wouldn’t be here if it weren’t for the opportunity you gave me. I will be…

Read More Read More

Frontera: The Brain Behind the Crawls

Frontera: The Brain Behind the Crawls

At Scrapinghub we’re always building and running large crawls–last year we had 11 billion requests made on Scrapy Cloud alone. Crawling millions of pages from the internet requires more sophistication than getting a few contacts of a list, as we need to make sure that we get reliable data, up to date lists of item pages and are able to optimise our crawl as much as possible. From these complex projects emerge technologies that can be used across all of…

Read More Read More