News Web Data Extraction to Predict Irish Election Results
Can pre-election news coverage of political parties predict the trend of the elections?
On February 9th, 2020, Ireland elected a new parliament. Prior to the elections, the political parties invested a lot of time, money and energy to get their political message to the people. A lot of research goes into selecting the right platform and the right medium.
In recent years, social media has gained importance, however, traditional newspaper coverage is still crucial to political parties to get their messages across. Which political party is achieving the widest newspaper coverage? Does the coverage correspond with the political parties' voter shares? These are questions of interest for the political parties which strive for more newspaper coverage, as well as for the voters as they want to know whether their preferred newspaper is biased.
In this post, we investigate the impact of news media coverage on the results of the Irish general election. To that end, more than 200,000 newspaper articles are analyzed to find the mentions of political parties. We consider ‘mentions’ as an important indicator because it estimates the overall space that media houses give to the political parties.
Building on this we first investigate the correlation between the media coverage and the vote share. Next, we examine coverage bias by analyzing the political coverage across different media houses of Ireland. Finally, based on our analysis we determine if specific media houses are better suited to predict the trends of the Irish elections.
Political Coverage Analysis
Steps we followed for our analysis:
- Data collection: Major news websites in Ireland were visited over the period of the last four months, last date being February 9th. Data from a total of 209,778 articles available to us is collected. The data collection is done using the open-source AutoExtract-spiders deployed in Scrapy-Cloud.
- Extraction: For each article, data from the headline, the body, and the summary are extracted using our Automatic Extraction API - Scrapinghub's AutoExtract.
- Analysis: For each article, we search for the names of all the political parties in the parsed body text and summary. Publication houses are also determined from the link.
- Observations: We investigate the correlation between the mentions and the actual number of seats and present the results.
During the data collection, the data from articles is extracted from the popular category links where the new articles appear. However, not all these articles are on the political topic. So we first filter out the political articles by selecting the articles which mention at least one political party. This results in a total of 9,878 political articles out of the articles available to us.
For our analysis, the vote share numbers are taken from this link.
The independent candidates are not included in our analysis as monitoring the coverage of independent candidates is a complex task, therefore the vote share of independent candidates is discarded in our analysis. To reduce the bias we only consider one mention for a party per article, because multiple mentions for a party in the same article do not provide extra coverage. It should also be noted that the mention does not indicate whether its a positive or a negative mention, it can be either.
Does media coverage correlate to party vote share?
To analyze the correlation between the media coverage and the vote share we first check the media coverage number computed from our data and current vote share of different political parties.
The media coverage for each party is defined as the ratio of party mentions w.r.t total number of political mentions.
Fig 1. Media coverage and vote share for different parties
It can be seen that the vote share for different parties, in general, is quite proportional to the media coverage for these parties. It can be observed that the two main parties, Fianna Fail and Fine Gael, get a bit of disproportionate coverage. These are the only parties that get more media coverage than their vote share, for all other parties the vote share is relatively higher than the media coverage.
To quantify our analysis we calculate the correlation coefficient between the media coverage and the vote share. Pearson coefficient is a number between -1 and 1, which in this case tells us how closely is vote share correlated to media coverage. A value close to 1 indicates a very high correlation while a value close to 0 indicates no correlation. For our data, we get a Pearson coefficient value of 0.97526, which indicates there is a very high correlation between the media coverage and the vote share.
Are media houses biased towards certain political parties?
From the overall coverage, it can be seen that most of the coverage is dominated by three main political parties. Is it the case for all the media houses or do some media houses have a preference towards some specific parties? To understand this we check the coverage/vote-share charts for different media houses. The results are shown for the popular newspapers in Ireland along with the number of articles analyzed from each.
Fig 2. Media coverage and vote share for irishtimes.com (Total articles 4738)
Fig 3. Media coverage and vote share for independent.ie. (Total articles 1919)
Fig 4. Media coverage and vote share for thejournal.ie. (Total articles 1521)
Fig 5. Media coverage and vote share for irishexaminer.com (Total articles 635)
Fig 6. Media coverage and vote share for rte.ie. (Total articles 374)
Fig 7. Media coverage and vote share for donegallive.ie. (Total articles: 62)
It is interesting to note that the independent.ie gives a relatively high coverage to the Fine Gael, while donegallive.ie provided higher coverage to Sinn Fein when compared with its vote share.
Is coverage of specific media houses better than overall media coverage for prediction?
We have seen that there is a high correlation between media coverage and the vote share of the parties. If we are interested in predicting the trends for the next election should we consider the coverage from all the media houses together or is coverage from specific media houses more accurate for prediction? In order to determine this we again consider the Pearson coefficient, but this time we calculate for each newspaper separately.
|Newspaper||Pearson coefficient (correlation indicator)|
Fig 8. Pearson Coefficient calculated for each newspaper.
There are few media houses whose media coverage is quite highly correlated to the vote share. For instance, RTE.ie has a correlation coefficient of 0.99410 which is quite high compared to correlation coefficient of the combined coverage. Therefore, for the prediction of next election it would make sense to also consider the coverage of the individual media house and not just the combined coverage.
Even though in this post we only consider the mentions of the political parties for our analysis, more interesting conclusions can be drawn by doing sentiment analysis of the article body and article comments. Furthermore, machine learning models can be trained to perform predictive analysis for the elections using information such as article body, comments, date of publication, sentiment score, etc. Our initial analysis done over large scale data shows intriguing insights, in the future, we plan to extend it by applying more advanced machine learning models on top of the extracted data.
Although the elections in Ireland are over, there are a number of cases where relevant information can be retrieved from articles. Extraction, when done at scale, has the potential to reveal interesting insights from a plethora of data available out there.
If you need to extract news or other article data, but you don't want to deal with the technical challenges, try AutoExtract for free, and extract web data at scale.