Announcing Portia, the Open Source Visual Web Scraper!

Announcing Portia, the Open Source Visual Web Scraper!

We’re proud to announce the developer release of Portia, our new open source visual scraping tool based on Scrapy. Check out this video:

As you can see, Portia allows you to visually configure what’s crawled and extracted in a very natural way. It provides immediate feedback, making the process of creating web scrapers quicker and easier than ever before!

Portia is available to developers on github. We plan to offer a hosted version on Scrapinghub soon, which will be compatible with Autoscraping and fully integrated with our platform.

Please send us your feedback!

47 thoughts on “Announcing Portia, the Open Source Visual Web Scraper!

  1. This looks promising. Only a couple days back I was thinking about what needs to catch on for such a scraping service to become available and could not reason as to why this was not released earlier. Great to see that it is finally here and can’t wait to get my hands dirty.

  2. Just came from Reddit and a user replied, more ore less, “People have to seriously think about the date before they release the project”

    Now that is something to think about.. 🙂

  3. looking great, easy to get started… but where is my json output stored? I can’t find it, except in the log mixed with all the other log output.

    thanks!

  4. Installing this is a hair ripping nightmare. Doesn’t anyone think that installation instructions might be required? Those I’ve found elsewhere leave MUCH to be desired. Bottom line – this tool is completely and utterly useless since it can’t be installed. Prove me wrong with detailed installation instructions – in my case, for ubuntu 12.04

    1. There are some installation instructions in the github README. The vagrant VM might be a good option if you are having difficulty. Take a look at the provision.sh script which should also be useful for your platform if you don’t want to use vagrant.

      Please keep in mind that this is an early developer release of an open source project. We wanted to share it and get feedback and contributions. Documentation is one of the many things we plan to improve.

  5. When i run the portiacrawl script, it’s going to loop, and i must to stop script manually. How to make stop portiacrawl automatically when all items are scraped?

  6. for a single page,it is useful. But,I want to crawl all the similar pages on one’s website. I don’t know how I can gain all the urls. I can’t find in doc

  7. Not a very good demo. What about multiple pages that are lists of similar elements, and you’d like each element separated in a *.csv file, and then applied to a series of pages? There’s nothing here you can’t just as easily do with any of the other scraper plugins for browsers. It seems to eliminate the need to “inspect element” which is a good start, but if this thing is capable of scraping thousands of data records off dozens of similar pages, you can’t tell from this video.

    1. Hi Zinc,
      This demo demonstrates how to create a sample that can be applied across many pages and how to follow links.

      Once you have created your spider, which consists of rules for following links, start urls and samples to be used for extracting data, you can run it. When you run the spider it will follow links and extract data until it runs out of links or you tell it to stop. All of the data extracted while the spider is running can be output to a file in CSV, JSON or XML format.

      Portia is a tool for scraping anything from a single page up to a whole site of thousands or millions of items. If you would like to give it a try you can sign up for an account at dash.scrapinghub.com and use it for free.

      I hope this clears up any questions you have.
      Ruairi.

  8. Is the hosted version live already? I have been digging around my scrapinghub dashboard for an hour now and have still not able to figure out how to get here without downloading from Github,

    1. Hey, Anand!

      You have to create a Portia project in your dashboard: click “Create project” and choose Portia. Then, you just have to click in the project to get into Portia.

      If you are a new user in our platform, you first have to create an organization to be able to create a project.

      By the way, we just released a beta version of Portia 2.0 with a lot of interesting features, so give it a try! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *