Solution architecture part 3: Conducting a web scraping legal review

In this third post in our solution architecture series, we will share with you our step-by-step process for conducting a legal review of every web scraping project we work on.

At Zyte , it’s absolutely critical that our services respect the rights of the websites and companies whose data we scrape. Scraping, as a process, is not illegal - however, the data you extract, the manner in which you extract the data, and what exactly you’re scraping all need to be held to rigorous legal standards to ensure legal compliance.

In ensuring that your solution architecture follows both legal guidelines as well as industry best practices, we’ve established a checklist for your ease and to protect the reputation and integrity of web scraping as a practice. Personal and commercial data regulations are in flux across the world, and given the inherently international nature of the internet, establishing clearly legal practices within your solutions should be considered an executive priority.

In this article, we will discuss the three critical legal checks you need to make when reviewing the legal feasibility of any web scraping project and the exact questions you should be asking yourself when planning your data extraction needs.

Disclaimer: I am a lawyer, but I'm not your lawyer, so none of the opinions or recommendations in this guide constitute legal advice from me to you. The commentary and recommendations outlined below are based on Zyte's experience helping our clients (startups to Fortune 100s) maintain compliance while scraping billions of web pages each month. If you want assistance with your specific situation then you should consult with your lawyer.

Pre-check: Define the use case

Data comes in all shapes and sizes. However, before we start extracting this data, we need to determine the exact status and legality of extracting this data for each project.

There are three forms of data that can be present a legal risk if extracted:

Personal data
Copyrighted data
Data behind a login

However, the first step of the legal review process is to identify the use case for the data - i.e. what will the data be used for, and do you have the data owners explicit consent to extract, store and use their data.

The ultimate use case of the data can have a large bearing on the legal status of scraping the data from a website, particularly in the case of personal data which we will discuss later.

So the first step of any legal review process is to define:

What will you be using this data for?
Who owns the data? The site, an individual person, nobody, etc.
Do you have the permission of the data owner to extract the data?

Once this has been defined, you will be in a position to carry out your legal checks.

Check #1 - Personal data

Personal data, or personally identifiable information (PII) as it is technically known, is any data that could be used to directly or indirectly identify a specific individual. With the increased awareness and regulation governing how personal data is used, extracting personal data has resulted in increasingly stringent data protection regulations coming into force - the General Data Protection Regulation, or GDPR, is a prime example.

First, you need to check whether you plan to extract any form of personal data. Common examples include:

Name
Email
Phone Number
Address
User Name
IP Address
Date of Birth
Employment Info
Bank or Credit Card Info
Medical Data
Biometric Data

If you’re not extracting any personal data, then you can move onto the next step of the legal review. However, if you are extracting any of the personal data types listed above then you need to investigate the data protection regulations associated with this data.

Every legal jurisdiction (US, EU, etc.) has different regulations governing personal data. So the next step is to identify which jurisdiction do the owners of this personal data reside in: the EU, US, Canada, etc.

For a detailed step-by-step process for evaluating the legal regulations of the personal data, you want to extract then be sure to check out our GDPR compliance guide for web scrapers.

Check #2 - Copyrighted data

Copyrighted data generally describes content owned by businesses and individuals with explicit control over its reproduction and capture. Just because web data is publicly available on the internet doesn’t mean that anyone can extract and store the data.

In some cases, the data itself might be copyrighted, and depending on how/what data you extract you could be found to have infringed the owner’s copyright, creating additional risks for the users of this data.

First, you need to check whether you plan to extract any form of data that is at risk of being subject to copyright. Common examples include:

Articles
Videos
Pictures
Stories
Music
Databases

If you are extracting any of these forms of web data, then you need to determine if you will violate copyright by extracting and using the data in your projects.

Cases like these need to be evaluated on a case-by-case basis as copyright issues often aren’t black and white like personal data issues, they are sometimes surmountable if there is a valid exception to copyright within your use case. Some methods to achieve this are:

Fair Use: For example, instead of extracting all the data from an article, you extract short snippets, which might constitute fair use.
Facts: Facts are typically not covered by copyright laws, so if firms limit what is being scraped to just the factual matters -- i.e. names of products, prices, etc, then it may be acceptable to scrape without violating copyright.

Database rights

Database rights are a subset of copyright, that needs further explanation on its own. A database is an organized collection of materials that permits a user to search for and access individual pieces of information contained within the materials.

Database rights can create additional risks for the use of web data in your projects if the data hasn’t been extracted in a compliant manner.

In the US, a database is protected by copyright when the selection or arrangement is original and creative. Copyright only protects the selection and organization of the data, not the data itself.

In the EU, databases are protected under the Database Directive which offers much broader protection for EU databases. The Directive has two purposes: (1) protect IP, like in the US, and (2) protect the work and risk in creating the database.

If you believe a data source might fall under database rights then decision-makers should always consult with their legal team before scraping the data and ensure they either:

only scrape some of the available data;
only scrape the data itself and not replicate the organization of that data; and
try to limit the data scraped to factual or other non-copyrighted data.

Copyright can be a tricky topic, so it is always best to talk to a qualified legal professional prior to scraping potentially copyrightable data for your projects. At Zyte, every web scraping project request we receive is reviewed for copyright issues by our legal team prior to commencing the project. Ensuring our clients know they are extracting data in a legally compliant manner.

Check #3 - Data behind a login

Extracting data from a website that first requires you to log in to access the data can raise potential legal issues. In most situations, logging it requires you to accept the terms and conditions of the website which might explicitly state that automatic data extraction is prohibited.

If this is the case, you should review the terms and conditions to determine whether you would be in breach of the T&C’s by extracting data from the website. As the terms and conditions of some of these websites can sometimes be quite intricate, it is advisable that you have them reviewed by an experienced legal professional prior to scraping data from behind the login.

Our recommendation

In order to maintain compliance with today’s data regulations, it’s incredibly important to keep your legal team up-to-date and to ensure data protection specialists regularly monitor your scraping operation. Legal checks are integral to keeping compliant and demonstrate goodwill - furthermore, by performing consistent legal assessments of your projects, you can streamline the scraping process and make absolutely certain that your scraping remains respectful and productive.

Your project’s compliance requirements

As we have seen, there is more to web scraping than just the technical implementation of a project. There are numerous legal compliance requirements that need to be taken into account when deciding if a web scraping project is viable.

If the guidelines outlined in this article are followed, then there is no reason why you can’t extract data from the web without exposing yourself to undue compliance and regulatory risks.

At Zyte we have extensive experience developing data extraction solutions that overcome these challenges and mitigate the compliance risks associated with using web scraped data in your business.

If you have a need to start or scale your web scraping projects then our Solution architecture team is available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.

At Zyte we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.