Extracting clean article HTML with News API

The Internet offers a vast amount of written content in the form of articles, news, blog posts, stories, essays, tutorials that can be leveraged by many useful applications:

  • News and articles monitoring and analytics
  • “Reading modes” for browsers and applications
  • Brand monitoring, mentions and sentiment analysis
  • Competitive intelligence, product launches, mergers and acquisitions, financial results, patent activity
  • Contextual marketing, mentions, sentiment
  • Generating dataset to train machine learning models for NLP
  • Media personalization, summarization, topic extraction, curation
  • Market and Industry Signals for predictive modeling for hedge funds
  • Monitoring regulatory/governance news for risk management

But anyone interested in using all this data available, will face some challenges. 

Web pages are built of many components (menus, sidebars, ads, etc) and only a few of them represent the true article content, the actual valuable information. Being able to extract only the relevant content from the page is the first challenge. Especially when you want to obtain this information from a diverse set of sources that can have a different structure, styling and even be in different languages.

This challenge is not minor as it is very common to find irrelevant content not only outside of the article body but also within the body itself: elements like ads, links to content not directly related with the article, call to action boxes, social media buttons, etc are very common nowadays. 

Extracting texts, images, videos, tables, etc...

Articles content is not just plain text, but rich content like images, figure captions, videos, tables, quotes, tweets, etc. A lot of meaning is lost if we only focus on plain text. Converting the content to some standardized and simplified format that is independent of the source page is required to leverage all this rich content. Having such a standard format would open the door to apply the same styling rules for any content, independently of the source. What is more, it would provide the flexibility to enable/disable particular components of the articles or even rearrange them. But converting the diverse content into this format is a big challenge

Relying on HTML for that is a good starting point but it is so flexible and it is used in so many different ways that simplifying it to a standard set of content elements is a titanic effort. The following case can serve as an example: the HTML tag figcaption is the right way of annotating figure captions in the pages according to the HTML spec, but only a fraction of pages with figure captions really use it. Instead, they might use some div tags marked with some class (each page uses its own classes), or they might use a table structure to include both the image and the caption. So identifying figure captions within a page is a difficult problem. The same difficulties can be seen with other elements, like block quotes.

Our News API deals with all these challenges, offering an extraction service for articles where all the content is cleaned up (no irrelevant content on it) and served in a standard format for any source page. Several attributes are extracted (headline, author, publishing date, content in text format, etc) but the rest of this post will focus on the advantages of the attribute articleBodyHtml which is offering the article content in a rich and standardized format as a subset of the HTML. 

Rich content

The full description of the AutoExtract articleBodyHtml format can be found in the documentation but let’s introduce here the main concepts:

  • It is just HTML code (compliant with HTML standard), so it is ready to be embedded in other pages or applications. 
  • Only a subset of HTML tags is supported. In general, it is formed by the HTML elements with semantic meaning (like headers, tables, links, images, lists, etc) and the very simple styling HTML elements (like strong, em, br, etc). Visit this page to see the full list of elements supported.
  • Most styling and formatting elements (like div or span) are removed meanwhile the article structure is preserved. The resulting article is a “flattened”  structure where most of the content can be found at the first level.
  • Machine learning is applied in many cases to “fix” the HTML and apply the right tag (i.e. detecting that some div is really a figcaption). 
  • Embeddings from other pages and social networks (Twitter, Facebook, Youtube, Instagram, Google Maps, etc) are supported. They are integrated in such a way that they will be rendered properly.

As a result, articleBodyHtml attribute returns a clean version of the article content where all irrelevant stuff has been removed (framing, ads, links to content no directly related with article, call to actions elements, etc) and where the resultant HTML is simplified and normalized in such a way that it is consistent across content from different sites.

The resultant HTML offers great flexibility to:

  • Apply custom and consistent styling to content from different sites
  • Pick which content elements to show or hide or even rearrange the elements in the article

But that’s enough theory for now. The following sections will show some examples. You can also use this online notebook if you want to experiment and play with them by yourself.

Making requests

News API is a simple REST API that can be accessed from any programming language. In this case, we are going to interact with the API using the AutoExtract library for Python. Once installed, one way of performing requests to extract articles from URLs is by invoking the function request_batch. The following function will take care of making requests:

def autoextract_article(url):
    return request_batch([url], page_type='article')[0]['article']

Let’s perform one request over this page:

utd_article = autoextract_article(
    "https://www.independent.ie/sport/soccer/premier-league/manchester-united"
    "/rotten-to-the-core-craig-burley-goes-on-huge-rant-at-ole-gunnar-solskjaer-"
    "and-manchester-uniteds-owners-38845446.html")
print(utd_article['articleBodyHtml'])

utd_article contains now a dictionary with all the extracted attributes (headline, author, etc). Here we are going to show what is the content of the attribute articleBodyHml:

See the Pen First extraction by Iván de Prado (@ivanprado) on CodePen.

See the Pen First extraction by Iván de Prado (@ivanprado) on CodePen.

Note that only the relevant content of the article was extracted, avoiding elements like ads, unrelated content, etc. AutoExtract relies on advanced machine learning models that can discriminate between what is relevant and what is not.

Besides, note that figures with captions were extracted. Many other elements can also be present.

CSS styling

Having normalized HTML code has some cool advantages. One is that the content can be formatted independently of the original style with simple CSS rules. That means that the same consistent formatting can be applied even if the content is coming from very different pages with different formats.

Now let's see how the extracted article looks like after some CSS style rules are applied:

See the Pen First styling by Iván de Prado (@ivanprado) on CodePen.

It looks better, doesn't it? And the best is that this style (with a little bit more of work) would work consistently across content from different websites. These are the CSS rules applied:

See the Pen First styling by Iván de Prado (@ivanprado) on CodePen.

The very same CSS style sheet is going to be used for all the following examples. Note how this single styling is working consistently across content coming from diverse websites.

Tweets and other embeddings

AutoExtract is friendly with embedded content like social network content, videos, audios, interactive widgets, etc. In general, all content that was embedded using an iframe tag will also be included in articleBodyHtml. This is covering many cases, like Youtube, Google Maps or Vimeo. There are other cases like Instagram, Facebook and Twitter that are integrated in such a way that the content is ready to be rendered using the corresponding javascript library provided by the vendor. In other words, if you want the content from Twitter, Facebook and Instagram to look pretty, you only have to add the proper javascript library to your page or app. 

In the following example, we are applying the Twitter javascript library widgets.js to the article extracted from this page:

See the Pen Show embedding tweets by Iván de Prado (@ivanprado) on CodePen. 

Cherry-picking

Another advantage of having a normalized structure is that we can pick only the parts we are interested in.

In the following example, we are going to just pick the images from this article with its corresponding caption to compose an image array.

def extract_images(article):
    sel = Selector(article['articleBodyHtml'])
    return [{'img_url': fig.xpath(".//img/@src").get(), 
             'caption': html_text.selector_to_text(fig.xpath("(.//figcaption)"))}
            for fig in sel.xpath("//figure")]

Let's see it working for this article:

queen_article = autoextract_article(
    "https://www.theguardian.com/uk-news/2019/aug/23/prince-albert-passions-digitised-"
    "website-photos-200th-anniversary")
print(json.dumps(extract_images(queen_article), indent=4))

The result is:

[
    {
        "img_url": "https://i.guim.co.uk/img/media/711078a89d3e006870648d184c94c9b63d992180/191_365_2468_2945/master/2468.jpg?width=300&quality=85&auto=format&fit=max&s=73c48362628a8aab24a26fe5b9aec803",
        "caption": "A framed photograph of Queen Victoria and Prince Albert, 1860, by John Jabez Edwin Mayall. Photograph: Royal Collection Trust"
    },
    {
        "img_url": "https://i.guim.co.uk/img/media/92dccc5e4dec7ededeba385ad9eefd94a9b8b8b1/97_53_2451_3376/master/2451.jpg?width=300&quality=85&auto=format&fit=max&s=fb9310280223555b0e52750e4d792222",
        "caption": "Victoria and Albert\u2019s children Prince Alfred and Princess Beatrice c1859. Photograph: Royal Collection Trust"
    },
    {
        "img_url": "https://i.guim.co.uk/img/media/49920d12d5be17ec98b19194bf629ee3a51ffad5/25_10_1957_1433/master/1957.jpg?width=300&quality=85&auto=format&fit=max&s=b52cdf2ed497e0eed609155123f18243",
        "caption": "A daguerreotype of the Chartist meeting at Kennington Common. Photograph: Royal Collection Trust"
    },
    {
        "img_url": "https://i.guim.co.uk/img/media/7587ffd485ad50246c3f6b6c0b1d2613c4eb3bbd/9_0_1568_2164/master/1568.jpg?width=300&quality=85&auto=format&fit=max&s=ba809ce07554be491ce56997b98abbe5",
        "caption": "Princess Helena and Princess Louise, April 1859. Photograph: Royal Collection Trust"
    },
    {
        "img_url": "https://i.guim.co.uk/img/media/0857d6e5dcb83ae29dbde830279894fc50712544/30_34_2940_3285/master/2940.jpg?width=300&quality=85&auto=format&fit=max&s=71b474c4383b4886d6f6dfc4a2804fbb",
        "caption": "Queen Victoria with her four eldest children, 1854, c.1880 copy of original by Roger Fenton. Photograph: Royal Collection Trust"
    },
    {
        "img_url": "https://i.guim.co.uk/img/media/09df4d3885443b469fd5f81fc6988f46c6b5d9dc/0_84_7284_4373/master/7284.jpg?width=300&quality=85&auto=format&fit=max&s=cb2eeaf8fc007b3612b7abc94ebe5b8b",
        "caption": "Queen Victoria kept volumes of reminiscences between 1840 and 1861. Photograph: Royal Collection Trust"
    },
    {
        "img_url": "https://i.guim.co.uk/img/media/eb0425c5d8683784c2ae96bd9964ee76470248b5/0_0_4896_2938/master/4896.jpg?width=300&quality=85&auto=format&fit=max&s=bdaebdfb072ce811aa4fa8304bab3553",
        "caption": "Osborne House was built between for Victoria and Prince Albert as a summer home and rural retreat. Photograph: Eamonn McCabe/The Guardian"
    }
]

parsel and html-text libraries were used as helpers for the task. parsel makes it possible to query the content using XPath and CSS expressions and html-text converts HTML content to raw text.

Note that there is not any figcaption tag in the source code of the page in question: AutoExtract machine learning capabilities can detect that a particular section of the page is really a figure caption even if it was not annotated with the right HTML tag. Such intelligence is also applied to other elements like blockquote.

Let's go further. We are now going to compose a summary page that also includes independent sections for figures and tweets. It is really easy to cherry-pick such elements from articleBodyHtml. Let's see it applied to the Musk page:

def summary(article):
    sel = Selector(article['articleBodyHtml']) 
    only_tweets = sel.css(".twitter-tweet")
    only_figures = sel.css("figure")
    return f"""
        <article'>
            <h2>{article['headline']}</h2>
                <dl>
                <dt>Author</dt>       <dd>{article['author']}</dd>
                <dt>Published</dt>    <dd>{article['datePublished'][:10]}</dd>
                <dt>Time to read</dt> <dd>{len(article['articleBody'].split()) / 130:.1f}
                                           minutes
                                      </dd>
            </dl>
            <h3>First paragraph</h3>
            {sel.css("article > p").get()}
            <h3>Tweets ({len(only_tweets)})</h3>
            {"".join(only_tweets.getall())}
            <h3>Figures ({len(only_figures)})</h3>
            {"".join(only_figures.getall())}
        </article>
    """

Let's apply it the Musk page:

summary(musk_article)

This is the result:

See the Pen Article summary by Iván de Prado (@ivanprado) on CodePen.

The normalized HTML brings the flexibility to adapt the article content to your own purposes: you might decide to exclude figure captions or to exclude multimedia content from iframes, or show figures in a separate carousel for example.

Including figure captions in the text body

The textual attribute articleBody does not include any text from figure elements (i.e. figure captions) by default. This is generally desired because images cannot be included in raw text and showing a caption without its figure is disturbing for humans.

But sometimes the body textual information is used as the input for some analysis algorithm. For example, you could be grouping articles by similarity using the simple technique of K-Nearest Neighbors. Or even you can be feeding very advance neural networks using deep learning models for NLP.

In all these cases you might want to have the textual information for figure captions included. It is very easy to do. The following does it for the United article:

def text_with_captions(article):
    """ Converting `articleBodyHtml` into text is enough to have figure captions included """
    return html_text.selector_to_text(Selector(article['articleBodyHtml']))

The following does it for the United article:

text_with_captions(utd_article)

Removing pull quotes

Pull quotes are being used very often in articles nowadays. A pull quote is an excerpt of the article content which is repeated within the article but highlighted with a different format (i.e appearing in its own box and using a bigger font). A pair of examples can be seen on this page.

Pull quotes are a nice formatting element, but it might be better to strip them out if we are converting the document to plain text because having repeated content should be avoided here: formatting is lost in raw text and therefore pull quotes are not useful but disturbing for the reader. The attribute articleBody already contains a text version of the article, but pull quotes are not removed there. In the following example, we are going to convert the article to raw text but excluding all pull quotes.

Note that AutoExtract detects quotes using machine learning techniques and returns them in articleBodyHtml under blockquote tags.

chris_article = autoextract_article("https://www.vox.com/the-highlight/2020/1/15/20863236/chris-hughes-break-up-facebook-economic-security-basic-income-new-republic")

def drop_elements(selectors):
    """ Drops HTML subtrees for given selectors """
    for element in selectors:
        tree = element.root
        if tree.getparent() is not None:
            tree.drop_tree()

# First let’s get the text of the article without any quote. 
# We'll search over it to detect which quotes are pull quotes.
sel = Selector(chris_article['articleBodyHtml'])
drop_elements(sel.css("blockquote"))
text_without_quotes = html_text.selector_to_text(sel)

# Some quotes can change the case, or add some '""' characters. 
# Using some normalization helps with the matching
normalized = lambda text: re.sub(r'"|“|”|', '', ' '.join(text.split()).lower().strip())

# Now let's iterate over all `blockquote` tags
sel = Selector(chris_article['articleBodyHtml'])
pull_quotes = []
for quote in sel.css("blockquote"):
    # bq_text contains the quote text
    bq_text = html_text.selector_to_text(quote)
    # The quote is a pull quote if the quote text was already in the text without quotes
    if normalized(bq_text) in normalized(text_without_quotes):        
        pull_quotes.append(quote)
        
# Let's show found pull quotes
print(f"Found {len(pull_quotes)} pull quotes from {len(sel.css('blockquote'))} "
       "source quotes:\n")
for idx, quote in enumerate(pull_quotes):
    print(f"Pull quote {idx}:")
    print("------------------")
    print(html_text.selector_to_text(quote))
    print()

The result is:

Found 2 pull quotes from 2 source quotes:

Pull quote 0:
------------------
“I haven’t heard from Mark. That’s what everybody asks.”

Pull quote 1:
------------------
“I guess I thought I was gonna go up into some ivory tower and read enough books, and then I was gonna come down [with the answer].”

Finally, we can obtain the full text but with pull quotes stripped out:

# Removing figures as well as probably you will also want them removed
drop_elements(chain(pull_quotes, sel.css("figure")))
cleaned_text = html_text.selector_to_text(sel)

# Printing first 500 characters of the clean text
print(cleaned_text[:500])

Let's verify that we have removed the duplicated text:

def count(needle, haystack):
    return len(re.findall(needle, haystack))

pquote_excerpt = "haven’t heard from Mark"
cases_before = count(pquote_excerpt, chris_article['articleBodyHtml'])
cases_after = count(pquote_excerpt, cleaned_text)
print(f"Occurrences before: {cases_before} and after the clean up: {cases_after}")

The output is:

Occurrences before: 2 and after the clean up: 1

The former code works regardless the website used. Having it applied to a different page is just a matter of changing the URL.

Extract News and Articles at Scale!

During the course of this post, we tried to show that News API is a powerful web data extraction tool to extract content from articles and news. But there is nothing as convincing as trying it by yourself. If you want to experience how easy it is to extract news and article content, sign up here for free and try News API! After that, you can also go to this online notebook and run the examples.

The source code used in this article can be found in this Github repository.

March 05, 2020 In "Web Scraping" , "Autoscraping" , "data extraction" , "Developer API" , "AutoExtract" , "Jobs Data"
February 24, 2020 In "data extraction" , "Media Monitoring" , "AutoExtract" , "Article Data Extraction" , "News Data Extraction"
January 30, 2020 In "Web Scraping" , "data extraction" , "Web Data Extraction Summit" , "AutoExtract" , "2019"
Autoscraping, data extraction, AutoExtract, News Data Extraction