We recently released Dateparser 0.3.1 with support for Belarusian and Indonesian, as well as the Jalali calendar used in Iran and Afghanistan. With this in mind, we’re taking the opportunity to introduce and demonstrate the features of Dateparser.
Dateparser is an open source library we created to parse dates written using natural language into Python. It translates specific dates like ‘5:47pm 29th of December, 2015’ or ‘9:22am on 15/05/2015’, and even relative times like ‘10 minutes ago’, into Python datetime objects. From there, it’s simple to convert the datetime object into any format you like.
Who benefits from Dateparser
When scraping dates from the web, you need them in a structured format so you can easily search, sort, and compare them.
Dates written in natural language aren’t suitable for this. For example, the 24th of December shows up first if you sort the 25th of November and the 24th of December alphanumerically. Dateparser solves this by taking the natural language date and parsing it into a datetime object.
A bonus perk of Dateparser is that you don’t need to worry about translation. It supports a range of languages including English, French, Spanish, Russian, and Chinese. Better yet, Dateparser autodetects languages, so you don’t need to write any additional code.
This makes Dateparser especially useful when you want to scrape data from websites in multiple languages, such as international job boards or real estate listings, without necessarily knowing what language the data you’re scraping is in. Think Indeed.
Why we didn’t use similar libraries
Dateparser developed while we were working on a broad crawl project that involved scraping many forums and blogs. The websites were written in numerous languages and we needed to parse dates in a consistent format.
None of the existing solutions met our needs. So we created Dateparser as a simple set of functions that sanitised the input and passed it to the dateutil library, using parserinfo objects to work with other languages.
This process worked well at first. But as the crawling project matured we ran into problems with short words, strings containing several languages, and a host of other issues. We decided to move away from parserinfo objects and handle language translation on our own. With the help of contributors from the Scrapy community, we significantly improved Dateparser’s language detection feature and made it easier to add languages by using YAML to store the translations.
Dateparser is simple to use and highly extendable. We have successfully used it to extract dates on over 100 million web pages. It’s well tested and robust.
Peeking under the hood
You can install Dateparser via pip. Import the library and use the dateparser.parse method to try it out:
$ pip install dateparser … $ python >>> import dateparser >>> dateparser.parse('1 week and one day ago') datetime.datetime(2015, 9, 27, 0, 17, 59, 738361)
Contributing new languages
Supporting new languages is simple. If yours is missing and you’d like to contribute, send us a pull request after updating the languages.yaml file. Here is what the definitions for French look like:
name: French skip: ["le", "environ", "et", "à", "er"] monday: - Lundi ... november: - Novembre - Nov december: - Décembre - Déc ... year: - an - année - années ... simplifications: - avant-hier: 2 jour - hier: 1 jour - aujourd'hui: 0 jours - d'une: 1 - un: 1 - une: 1 - (\d+)\sh\s(\d+)\smin: \1h\2m - (\d+)h(\d+)m?: \1:\2 - moins\s(?:de\s)?(\d+)(\s?(?:[smh]|minute|seconde|heure)): \1\2
When parsing dates, you don’t need to set the language explicitly. Dateparser will detect it for you:
$ python >>> import dateparser >>> dateparser.parse('aujourd\'hui') # French for 'today' datetime.datetime(2015, 10, 13, 12, 3, 19, 262752) >>> dateparser.parse('il ya 2 jours') # French for '2 days ago' datetime.datetime(2015, 10, 11, 12, 3, 19, 262752)
See the documentation for more examples.
How we measure up
Dateutil is the most popular Python library to parse dates. Dateparser actually uses dateutil.parser as a base, and builds its features on top of it. However, Dateutil was designed for formatted dates, e.g. ‘22-10-15’, rather than natural language dates such as ‘10pm yesterday’.
Parsedatetime is closer to Dateparser in that it also parses natural language dates. One advantage of Parsedatetime is that it supports future relative dates like ‘tomorrow’. However, while Parsedatetime also supports non-English languages, you must specify the location manually, whereas Dateparser detects the language for you.
Parsedatetime also has more boilerplate code, compare:
from datetime import datetime from time import mktime time_struct, parse_status = cal.parse('today') datetime.fromtimestamp(mktime(time_struct))
import dateparser dateparser.parse('today')
If you are dealing with multiple languages and want a simple API with no unnecessary boilerplate, then Dateparser is likely a good fit for your needs.
Give Dateparser a go
Dateparser is an extensible, easy-to-use, and effective method for parsing international dates from websites. Its unique features arose from the specific problems we needed to address. Namely, parsing dates from websites whose language we did not know in advance.
The library has been well tested against a large number of sites in over 20 languages and we continue to refine and improve it. Contributors are most welcome, so if you’re interested, please don’t hesitate to get involved!