How to Architect a Web Scraping Solution: The Step-by-Step Guide
For many people (especially non-techies), trying to architect a web scraping solution for their needs and estimate the resources required to develop it, can be a tricky process.
Oftentimes, this is their first web scraping project and as a result have little reference experience to draw upon when investigating the feasibility of a data extraction project.
In this series of articles we’re going to break down each step of Scrapinghub’s four step solution architecture process so you can better scope and plan your own web scraping projects.
- Step 1: Define Your Data Requirements
- Step 2: Conduct a Legal Review
- Step 3: Evaluate the Technical Feasibility
- Step 4: Architect a Solution & Estimate Resources
At Scrapinghub we have a full-time team of solution architects who architect over 90 custom web scraping projects each week for everything from e-commerce and media monitoring, to lead generation and alternative finance use cases. So odds are if you are thinking of investigating a web scraping project our team has already architected a solution for something very similar.
As a result, throughout this series we will be sharing with you the exact checklists and processes we use, along with some insider tips and rules of thumb that will make investigating the feasibility of your projects much easier.
In this article, the first in the series, we’re going to give you a high level overview of our solution architecture process so you replicate it for your own projects.
Step 1: Define Your Data Requirements
The ultimate goal of the requirement gathering phase is to minimize the number of unknowns, if possible to have zero assumptions about any variable so the development team can build the optimal solution for the business need.
This is a critical process for us at Scrapinghub when working on customer projects so we can ensure we are developing a solution that meets their business need for web data, manage expectations and reduce risk in the project.
However, the same is true if you are developing a web scraping infrastructure for your projects. Accurately capturing the project requirements will allow your development team to ensure your web scraping project precisely meets your overall business goals.
Here you want to capture two things:
- Your user needs - what business or personal objective do you want to achieve? How will web data help you achieve this objective?
- Your data requirements - precisely what data do you need to achieve you business or personal objective? From which websites and how often? etc.
It is critical for you and/or your business that your data requirements accurately match your underlying business goals and your need for the data. A constant supply of high quality data can give your business a huge competitive edge in the market, but what is important is having the right data.
It is very easy to extract data from the web, what’s difficult is extracting the right data at a frequency and data quality that makes it useful for your business processes.
With every customer we speak with, we dive deep into their underlying business goal to better understand not only their specific data requirements, but also why they want the data and how it fits into the bigger picture. Because oftentimes, our team of solution architects are able to work with them to find the alternative or additional data sources for their specific requirements that better suit their business goals.
However, when investigating the feasibility of any web scraping project you should always be trying to clarify:
- What data you require? i.e. the precise data you want to obtain during the scraping process.
- From which websites would you like to obtain this data from?
- How often would you like to extract this data? Daily, weekly, monthly, once off, etc?
- How do you want to consume the data?
- How will you verify that the extracted data is accurate? i.e. matches exactly the data on the target websites?
- How would you like to interact with the solution? i.e would you just like to receive data at a predefined frequency, or would you like to have control over the entire web scraping infrastructure and the associated source code?
In this article (coming soon), we will walk you through the exact steps our team uses to gather project requirements and scope the best possible solution to meet them.
Step 2: Conduct a Legal Review
The second step of any solution architecture process is to check if there are any legal barriers to extracting this data.
With the increased level of awareness about data privacy and web scraping in the last number of years, ensuring your web scraping project is legally compliant is now a must. Otherwise you could land you or your company in a lot of bother.
In this article (coming soon), we will share with you our exact legal assessment checklist that our solution architecture team uses to review every project request we receive. Our legal team has created a best practice guide for the solution architecture team to utilise so they know when to flag a project with legal for a review. Once flagged with legal, our legal team will review the project based on the criteria below, as well as others, to determine if we are able to proceed with the project.
However, in general you need to be assessing your project against the following criteria:
- Personal Data - will your web scraping project require you to extract personal data? If yes, where do these people reside? Are you complying with the local regulations? GDPR comes to mind.
- Copyrighted Data - is the data being extracted subject to copyright? If so, are there any exceptions to copyright that you may avail of?
- Database Data - a subset of copyright, does the website the data is being extracted from have database rights?
- Data Behind A Login - to extract the data do you need to scrape behind a login? What do the website’s terms and conditions state regarding web scraping?
- Sensitive Data - are you extracting any sensitive data (financial, health, demographic data)?
If your answers to any of the above questions raise concerns, you should be ensuring that a thorough legal review of the issue is conducted prior to scraping. Once you've completed this review then you are in a good position to move forward to assessing the technical feasibility and architecting your web scraping solution.
Step 3: Technical Feasibility
Assuming your data collection project passed the legal review, the next step in the solution architecture process is to assess the technical feasibility of executing the project successfully.
This is a critical step in our solution architecture process and a step most independent developers working on their own or for their company’s projects overlook.
There is a strong tendency amongst developers and business leaders to start developing their solution straight away. For simple projects this often isn’t an issue, however, for more complex projects, developers can quickly discover that they run into a brick wall and can’t overcome the challenges.
We’ve found that a bit of upfront testing and planning, can save countless wasted man hours down the line if you start developing a fully featured solution only to hit a technical brick wall.
During the technical review phase, one of our solution architects will examine the website and run a series of small scale test crawls to evaluate the technical feasibility of developing a solution that meets the customers requirements (crawl speed, coverage and budgetary requirements).
These tests are primarily designed to determine the difficulty of extracting data from the site, will there be any limitations on crawl speed & frequency, is the data easily discoverable, is there any post-processing or data science requirements, does the project require any additional technologies, etc.
Once complete, this technical feasibility review gives the solution architect the information they need to firstly determine if the project is technically feasible and then what is the optimal architecture for the solution.
In this article (coming soon), we’ll give you a behind the scenes look at some of the tests we carry out when assessing the technical feasibility of our customers projects.
Step 4: Architect a Solution & Estimate Resource Requirements
The final step in the process is architecting the solution and estimating the technical and human resources required to deliver the project.
Oftentimes, the solution need to be approached and scoped in phases, to balance the tradeoff of timeline, budget, and technical feasibility. Our team will propose the best first step to tackle your project while keeping the bigger goal in mind.
Here you need to architect a web scraping infrastructure using the following building blocks:
- Crawler architecture - data discovery and extraction architecture spiders.
- Spider deployment
- Proxy management
- Headless browser requirements
- Data quality assurance
- Maintenance requirements
- Data post-processing
- Any non-standard technologies that might be required.
In this article (coming soon), we share the process we use to architect a solution, give examples of real solutions we have architected and the resources required to execute them.
Once a solution has been architected and the resources estimated, our team has all the information they need to present the solution to the customer, estimate the cost of the project and draft a statement of work capturing all their requirements and the proposed solution.
Your Web Scraping Project
At Scrapinghub we have extensive experience architecting and developing data extraction solutions for every possible use case.
Our legal and engineering teams work with clients to evaluate the technical and legal feasibility of every project and develop data extraction solutions that enable them to reliably extract the data they need.
If you have a need to start or scale your web scraping project then our Solution Architecture team are available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.
At Scrapinghub we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.