Introducing the Datasets Catalog

Introducing the Datasets Catalog

catal3

Folks using Portia and Scrapy are engaged in a variety of fascinating web crawling projects, so we wanted to provide you with a way to share your data extraction prowess with the world.

With this need in mind, we’re pleased to introduce the latest addition to our Scrapinghub platform: the Datasets Catalog!

This new feature allows you to immediately share the results of your Scrapinghub projects as publicly searchable datasets. Not only is this a great way to collaborate with others, you can also save time by using other people’s datasets in your projects.

datasets_central_page

As fans of the open data movement, we hope that this new feature will ease the process of disseminating data. Open data has been used to help foster transparency in governmental and corporate systems worldwide. Researchers and developers have also benefited from the mutual sharing of information. A couple of our own engineers have even used open data to power transportation apps and to help journalists expose corruption.

Read on to get some ideas on how to use the Datasets Catalog in your workflow.

The Datasets Catalog at a Glance

We are launching the Datasets Catalog with the following features:

  • Publish the data collected by your Portia or Scrapy spiders/web crawlers as easily accessible datasets
  • Highlight your scraped data and help others locate the information they need by giving each dataset a name and a description
  • Let others discover your datasets through search engines like Google
  • Browse publicly available datasets that other people are sharing: https://app.scrapinghub.com/datasets
  • Choose how to share your dataset using three different privacy settings:
    • Public datasets are accessible by anyone (even those without a Scrapinghub account) and are indexed by search engines
    • Restricted datasets are accessible only to the users that you explicitly grant access (they need to have a Scrapinghub account)
    • Private datasets are accessible only by the members of your organization

How Does it Work?

publish datasetYou can find this new “Datasets” option in the menu located at the top navigation bar. On the main Datasets Catalog page, you can browse available datasets along with those that you have recently visited.

Publishing your scraped data into complete datasets takes just one click. This tutorial will get you started on publishing and sharing your extracted data.

 

Wrap Up

And there you have it, a way to not only showcase your web crawling and data extraction skills, but to also help others with the information that you provide.

We invite you to contribute your datasets and play your part in helping drive the open source movement forward. Reach out to us on Twitter and let us know what datasets you would like to see featured and if you have any recommendations for improving the whole Datasets experience.

We’re excited to see what you come up with!

Be the first to know. Gain insights. Make better decisions.

Use web data to do all this and more. We’ve been crawling the web since 2010 and can provide you with web data as a service.

Tell me more

One thought on “Introducing the Datasets Catalog

Leave a Reply

Your email address will not be published. Required fields are marked *