As you know we held the first ever Web Data Extraction Summit last month. During the talks, we had a lot of questions from the audience. We have divided the questions into two parts - in the first part, we will cover questions on Web Scraping at Scale - Proxy and Anti-Ban Best Practice, and Legal Compliance, GDPR in the World of Web Scraping. Enjoy! You can also check out the full talks on these topics here.
Please note: The answers outlined below are opinions based on the knowledge and experience of the experts at Scrapinghub. The answers are for informational purposes only and we do not offer any warranties of any kind, either express or implied, regarding the information contained herein.
Web Scraping at Scale - Proxy and Anti-Ban Best Practice
Q: Can you imagine a future where antibot companies find a way to block all scrapers and web data extraction will no longer be feasible?
A: As both antibots as well as bot developers have access to similar tools, there will be a constant ebb and flow and never a complete stop to web scraping.
Q: Do you have your own proxies? How many proxies do you have?
A: We handle all types of datacenter proxies from multiple providers to ensure a diverse pool for every use case. Our proxy pools are in the order of hundreds of thousands, while that's an important figure, we are constantly focusing on delivering successful responses to our customers.
Q: Will Crawlera deal with browser fingerprinting or offer specific solutions for specific anti-bot companies?
A: While Crawlera does provide browser profiles, going forward there will be more features built under the hood which will help with more sophisticated anti-bots.
Q: How do you avoid captcha requirements?
A: By crawling responsibly. Using proxies.
Q: Do you use chrome headless browsers or are they detected to easily?
A: We use headless browsers and all browsers can be detected as bots.
Q: How do you know what sites are using to detect you so you can adapt to it?
A: It requires careful inspection of the response body, headers and at times the entire network traffic to understand the behaviour of the underlying anti-bot. We do use some internal tools to identify the type of detection used. There are also open source tools like "don't fingerprint me" which allow you to assess the browser fingerprinting used by the website.
Q: Are HTTP headers ordered?
A: Yes they are and antibot companies do have a signature directory which is able to identify inconsistencies in the request headers.
Q: Do you know cases of companies feeding fake data instead of blocking requests?
A: While we cannot name them, there are several examples of ecommerce websites faking the data. It is thus important to perform QA and look for these anomalies by checking past data.
Q: Any experiences with global anti scrape/website security services like Cloudflare when doing broad crawls?
A: Cloudflare and Akamai are quite ubiquitous and are encountered frequently on websites. They come in various flavours so there are a whole host of different approaches to scrape such websites right from proxy rotation to get around geoblocking to using headless browsers.
Q: Does Scrapy handle Cloudflare challenges or integrate with cfscrape well?
A: By default Scrapy does not handle CloudFlare challenges, you need to write code in order to do so.
Q: Does uppercase matter in HTTP headers?
A: Yes, HTTP headers are case-sensitive.
Q: How requests/second relates to getting banned? Does slower crawling lead to fewer bans?
A: Yes, it's an important value to consider. Crawlera handles the optimal rate depending on the concurrency input from the client, how the proxies are performing and how the site is responding.
Q: How do you ensure customers behave responsibly - i.e. don't DDOS sites?
A: Crawlera has a proven throttling algorithm that uses the sites' stats and concurrency from the customer to ensure a reasonable rate is used when targeting the site.
Q: Is there an automated way to identify and act on captcha and blockage?
A: Yes, there are many default ban rules that can be automated, while others such as captchas could be more customized. The different mechanisms to obtain successful responses are rotating proxies, do proper throttling and making sure requests are neat.
Q: How do you handle Incapsula and reCAPTCHA v3?
A: Handling different CDN's or antibot mechanisms depends mainly on every site. Using proxies is one big part of it, while ensuring client is following user based patterns of crawling is important as well.
Q: What happens with burned IPs, is there any way to "recycle" them?
A: Yes, recycling is an important part of the process. Usually web servers unban proxies after a period of time.
Legal Compliance, GDPR in the World of Web Scraping
Disclaimer: None of the opinions or information below constitute legal advice to you. If you want assistance with your specific situation then you should consult a lawyer.
Q: Will UK companies have to comply with GDPR post Brexit?
A: The UK Information Commissioner's Office provides the following guidance: The government intends to incorporate the GDPR into UK data protection law when the UK exit the EU – so in practice there will be little change to the core data protection principles, rights and obligations found in the GDPR. The EU version of the GDPR may also still apply directly to a UK company if it operates in Europe, offers goods or services to individuals in Europe, or monitors the behaviour of individuals in Europe.
Q: Does GDPR apply if my company is based in the US and is scraping personal data of EU citizens living in the US?
A: If the US company is not established in the EU and is not targeting the EU market then the processing will not come within the GDPR.
Q: Is it legal to scrape information from a company, like email or phone number?
A: A company email address or phone number will come within GDPR if it is considered to be personal data. GDPR defines personal data as “any information relating to an identified or identifiable natural person.” A generic company contact email such as email@example.com is unlikely to be considered personal data under GDPR. The email address of an individual working at that company will be considered personal data if it that individual is identifiable, for example, if the email address contains that individual's name. If you are scraping personal data you need to have a lawful basis to do so under GDPR.
Q: If the personal information is not enough to be able to reach the subject (e.g. just their name but no contact details) what are the obligations?
A: When personal data is collected from other sources (i.e. not collected directly from the individual by you), there are certain exceptions to your obligation to provide privacy information which may apply, for example, if providing the information is impossible or involves a disproportionate effort. If you intend to rely on one of these exceptions you must carry out a Data Protection Impact Assessment.
Q: If collecting data on behalf of another organization, who has GDPR requirements?
A: A data controller determines why and how personal data is processed. If you are processing personal data on behalf of a data controller then you are a data processor under GDPR. Both data controllers and data processors must comply with GDPR and have obligations under GDPR. However, a data processor has fewer obligations than a data controller.
Q: Does data minimization apply when no personal data is involved?
A: The principle of data minimization under GDPR will not apply if no personal data is being processed. However the data collection may be subject to other non-GDPR related restrictions.
Q: Are usernames considered personal information?
A: Yes usernames may be personal data. If an individual is identified or identifiable from a username then it will be considered to be personal data under GDPR.
If you have any more questions or queries on the above discussed topics, feel free to leave a comment below and we will try our best to answer them. In our next post we will cover questions on The Next Generation of Web Scraping and How Machine Learning can be used in Web Scraping. Stay tuned! Also you can access the recordings of the Extract Summit talks.