St Patrick’s Day Special: Finding Dublin’s Best Pint of Guinness With Web Scraping
St Patrick’s Day Special: Finding Dublin’s Best Pint of Guinness With Web Scraping
At Scrapinghub we are known for our ability to help companies make mission critical business decisions through the use of web scraped data.
But for anyone who enjoys a freshly poured pint of stout, there is one mission critical question that creates a debate like no other…
“Who serves the best pint of Guinness?”
So with St Patrick's day quickly approaching, we decided to turn our expertise in large scale data extraction to answering this mission critical question.
Although this is a somewhat humorous question, the data extraction and analysis methods used are applicable to numerous high value business use cases and are used by the world’s leading companies to gain a competitive edge in their respective markets.
In this article, we’re going to explore how to architect a web scraping and data science solution to find the best pint of Guinness in Dublin. But most importantly which Dublin pub serves the best pint of Guinness?
Step #1 - Identify Rich Data
For anyone who has ever enjoyed a pint of the black stuff, they know that the taste of a pint of Guinness is highly influenced by the skill of the person and the quality of the equipment they use.
With that in mind, our first task is to identify where we can find web data that contains rich insights into the quality of a pub’s Guinness and where the coverage levels are sufficient for all pubs in Dublin.
After a careful analysis of our options - pub websites, social media, reviews, articles, etc. we decided customer reviews would be our best option. They provide the best combination of relevant high granularity data and coverage to answer this question.
Step #2 - Extract Review Data
The next step would be to develop a web scraping infrastructure to extract this review data at scale using Scrapy. To do so we’d need to create two separate types of spiders:
- Pub discovery spiders - Designed to find and index pub listings for the data extraction spiders.
- Data Extraction spiders - a spider to extract the details of the pub once it was discovered by the discovery spider. This spider would extract data such as pub name, location, description, customer rating and customer reviews.
We’d also need to run these spiders on a web scraping infrastructure that can reliably extract the review data with no data quality issues. To do so, we’d configure the web scraping infrastructure as follows:
- Spider Hosting & Scheduling - to enable our spiders to run at scale in the cloud we’d use Scrapy Cloud.
- Proxies - critical to reliably extracting raw data is the ability to make successful requests to the target website. To do this we’d use Crawlera, which manages your proxies so you don’t have to.
- Spider Monitoring & Quality Assurance - we’d apply Scrapinghub’s 4-layer QA system to the project that would monitor the data extraction 24/7 and alert and alert our QA team to any malfunctioning spiders.
Due to data protection regulations such as GDPR, it is important that the extraction spiders don’t extract any personal information of the customers who submitted the review. As a result, the data extraction spiders need to anomyonise the customer reviews.
Step #3 - Text Pre-processing
Once the unstructured review data was extracted from the site, the next step is to convert the text data into a collection of text documents, or “Corpus”, and pre-process the review data in advance of analysis.
Natural Language Processing (NLP) techniques have difficulty modelling unstructured and messy text, preferring instead well defined fixed-length inputs and outputs. As a result, typically this raw data needs to be converted into numbers. Specifically, vectors of numbers. The more similar the words are, the closer the number assigned to the words are.
The simplest way to convert a corpus to a vector format is the bag-of-words approach, where each unique word is represented by a unique number.
To use this approach the review data first needs to be cleaned up and structured. Here are some of the common pre-processing steps that can be implemented using a library such as Python’s NLTK, the most frequently used library in Python for text processing:
- Convert all text to lower case - to ensure the most accurate data mining, we need to ensure there was only a single format for any word. Example: if there were two words, “Guinness” and “guinness”, all instances of “Guinness” would be converted to “guinness”.
- Remove stopwords - to remove filler words from the reviews as text generally consists of a large number of prepositions, pronouns, conjunctions etc (common stop words include - “the”, “a”, “an”, etc.).
- Remove punctuation - remove punctuation such as full-stops and commas, etc.
- Stemming words - here to avoid having multiple versions of similar words in the text, inflected (or derived) words need to be reduced to their word stem, base or root form.
The goal of this pre-processing step is to ensure the text corpus is clean and contains only the core words required for text mining.
Once cleaned the review data then needs to be vectorised to enable analysis of the data. Here is an example review prior to it being vectorised:
review_21 = X
Output: "One of the greatest of Dublin's great bars, the Guinness here is always terrific, the
atmosphere is friendly and it is perfect especially around Christmas -- snug warm and welcoming."
Here was how the review should be represented once it has been vectorised using the bag-of-words approach. Each unique word is assigned a unique number, and the frequency of the words appearance recorded.
bow_21 = bow_transformer.transform([review_21])
(0, 2079) 1
(0, 2006) 1
(0, 6295) 1
(0, 8609) 1
(0, 9152) 1
(0, 13620) 1
(0, 14781) 1
(0, 12165) 1
(0, 16179) 1
(0, 17816) 1
(0, 22077) 1
(0, 24797) 1
(0, 26102) 1
Step #4 - Exploratory Text Analysis
Once the text corpus was cleaned, structured and vectorised, the next step is to analysis the review data to determine which pubs had the best Guinness reviews.
Although there is no definitive method of achieving this goal, for the purposes of this project we decided not to overcomplicate things and instead do a simple analysis of the data to see what insights we can yield.
One approach would be to filter the review data looking for the word “guinness”. This would enable us to identify all the reviews that specifically mention “guinness”, an essential requirement when trying to determine who pours the best pint of the black stuff.
Next we need to create a way to determine if the mentioning of Guinness was done in a positive or negative context.
One powerful method would be to build a classifier model using a labelled training dataset (30% of the overall dataset with reviews labelled as having positive or negative sentiment) developed with the Multinomial Naive Bayes library from Scikit-learn (a specialised version of Naive Bayes designed more for text documents) and apply our trained sentiment classifier model to the entire dataset. Categorising all the reviews as either positive or negative.
To ensure the accuracy of these sentiment predictions, the results need to be analysed and compared to the actual reviews. Our aim is to have an accuracy of 90% and above.
Step #5 - Who Serves The Best Pint of Guinness?
Finally, with a fully classified database of Guinness reviews we should now be in a position to analyse this data and determine which pub serves the best Guinness in Dublin.
In this simple analysis project, we carried out analysis using the following assumptions and weighting criteria:
- There is a strong correlation between overall review sentiment (high star rating and positive sentiment) and the sentiment in the context of a pint of Guinness. I.e. if the overall review is very positive and they mention Guinness then likely the pub has good Guinness, and vice versa for negative sentiment.
- There is a strong correlation between the number of times Guinness is mentioned in a pub’s reviews and the quality of Guinness the pub serves.
- The ratio of overall reviews to the number of reviews mentioning Guinness is indicative of how known the pub is for serving great pints of Guinness.
Using this methodology we were able to get an interesting insight into the quality of Guinness in every bar in Dublin and find the best place to get a pint of the black stuff.
So enough with the data science mumbo jumbo, what do our results say?
Winner: Kehoes Pub - 9 South Anne Street
Of the 74 reviews analysed, 36 display positive sentiment for pints of Guinness. 48.6% of all reviews. The highest ratio of reviews mentioning Guinness in a positive light and the highest number of total reviews mentioning Guinness in their reviews. A great sign that they serve the best Guinness in Dublin.
To validate our results, the Scrapinghub team did our due diligence and sampled Kehoes’ Guinness. We can safely say that those reviews weren’t lying, great pint of stout!
Worthy runners up…
Runners Up #1: John Kavanagh The Gravediggers - 1 Prospect Square
Of the 54 reviews analysed, 25 display positive sentiment for pints of Guinness. 46.3% of all reviews.
Runners Up #2: Mulligan’s Pub - 8 Poolbeg St
Of the 49 reviews analysed, 21 display positive sentiment for pints of Guinness. 42.9% of all reviews.
So if you’re looking for the best place to find a great pint of Guinness this Saint Patrick’s Day, be sure to check out these great options.
At Scrapinghub we specialize in turning unstructured web data into structured data. If you would like to learn more about how you can use web scraped data in your business then feel free to contact our Solution Architecture team, who will talk you through the services we offer startups right through to Fortune 100 companies.
We always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on right now.
Until next time…
Happy St Patrick's Day! ☘️