Browsed by
Author: Cecilia Haynes

Looking Back at 2016

Looking Back at 2016

We started 2016 with an eye on blowing 2015 out of the water. Mission accomplished.

Together with our users, we crawled more in 2016 than the rest of Scrapinghub’s history combined: a whopping 43.7 billion web pages, resulting in 70.3 billion scraped records! Great work everyone!

In the what follows, we’ll give you a whirlwind tour of what we’ve been up to in 2016, along with a quick peek at what you can expect in 2017.

Platform

Scrapy Cloud

It’s been a great year for Scrapy Cloud as we wrap up with a massive growth in our yearly platform signups:

We proudly announced our biggest platform upgrade to date this year with the launch of Scrapy Cloud 2.0. Alongside technical improvements like Docker support and Python 3 support, this upgrade introduced an improved pricing model that is both less expensive for you while allowing you to better customize Scrapy Cloud based on your resource needs.

As we move into 2017, we will continue to focus on expanding the technical capabilities of our platform such as:

  • Support for non-Scrapy spiders: We’re well aware that there are alternatives to Scrapy in the wild. If you’ve got a crawler created using another framework, be it in Python or another language, you’ll soon be able to run it in our cloud-based platform.
  • GitHub integration: You’ll soon be able to sign up and easily deploy from GitHub. We’ll support automatic deploys shortly after: push an update to your GitHub repo and it will automatically be reflected within Scrapinghub.

Heads up, Scrapy Cloud is in for a massive change this year, so stay tuned!

Crawlera

Crawlera is our other flagship product and it basically helps your crawls continue uninterrupted. We were thrilled to launch our Crawlera Dashboard this year, which gives you the ability to visualize how you are using the product, to examine what sites and specific URLs you are targeting, and to manage multiple accounts.

Portia

The main goal of Portia is to lower the barrier of entry to web data extraction and to increase the democratization of data (it’s open source!).

Portia got a lot of love this year with the beta release of Portia 2.0. This 2.0 update includes new features like simple extraction of repeated data, loading start urls from a feed, the option to download Portia projects as python code, and the use of CSS selectors to extract specific data.

Next year we’re going to be bringing a host of new features that will make Portia an even more valuable tool for developers and non-developers alike.

Data Science

While we had been engaging in data science activities since our earliest days, 2016 saw us formalize a Data Science team proper. We’re continuing to push the envelope for machine learning data extraction, so get pumped for some really exciting developments in 2017!

Open Source

Scrapinghub has been as committed to open source as ever in 2016. Running Scrapinghub relies on a lot of open source software so we do our best to pay it forward by providing high quality and useful software to the world. Since nearly 40 open source projects maintained by Scrapinghub staff saw new releases this year, we’ll just give you the key highlights.

Scrapy

Scrapy is the most well-known project that we maintain and 2016 saw our first Python 3 compatible version (version 1.1) back in May (running it on Windows is still a challenge, we know, but bear with us). Scrapy is now the 11th most starred Python project on GitHub! 2017 should see it get many new features to keep it the best tool you have (we think) to tackle any web scraping project, so keep sending it some GitHub star love and feature requests!

Splash

Splash, our headless browser with an HTTP interface, hit a major milestone a few weeks ago with the addition of the long-awaited web scraping helpers: CSS selectors, form filling, interacting with DOM nodes… This 2.3 release came after a steady series of improvements and a successful Google Summer of Code (GSoC) project this summer by our student Michael Manukyan.

Dateparser

Our “little” library to help with dates in natural language got a bit of attention on GitHub when it was picked up as a dependency for Kenneth Reitz’ latest project, Maya. We’re quite proud of this little library :). Keep the bug reports coming if you find any, and if you can, please help us support even more languages.

Frontera

Frontera is the framework we built to allow you to implement distributed crawlers in Python. It provides scaling primitives and crawl frontier capabilities. 2016 brought us 11 releases including support for Python 3! A huge thank you to Preetwinder Bath, one of our GSoC students, who helped us to improve test coverage and made sure that all of the parts of Frontera support Python 3.

Google Summer of Code

As in 2014 and 2015, Scrapinghub participated in GSoC 2016 under the Python Software Foundation umbrella. We had four students complete their projects and two of them got their contribution merged into the respective code base (see the “Frontera” and “Splash” sections above). Another completed project was related to Scrapy performance improvements and is close to being integrated. The last one is a standalone set of helpers to use Scrapy with other programming languages. To our students, Aron, Preet, Michael, and Avishkar, thank you all very much for your contributions!

Conferences

Conferences are always a great opportunity to learn new skills, showcase our projects, and, of course, hang out with our clients, users, and coworkers. As a remote staff, we don’t have the opportunity to meet each other in person often, so tech conferences are always a great way to strengthen ties. The traveling Scrapinghubbers thoroughly enjoyed sharing their knowledge and web scraping experiences through presentations, tutorials, and workshops.

Check out some of the talks that were given this year:

Scrapinghubber Community

Being a fully remote company, we’re thrilled to confirm that we have Scrapinghubbers in almost every continent (we’re *just* missing Antarctica, any Scrapy hackers out there?).

Aside from the conferences we attended this year, we had a few localized team retreats. The Professional Service managers got together in Bangkok, Thailand; the Sales team had a retreat in Buenos Aires, Argentina, where asado dominated the show; and the Uruguayan Scrapinghubbers got together for an end-of-year meetup in La Paloma, Uruguay, hosted by Daniel, our resident hacker/surfer.

Currently our Crawlera team is having a meetup in Poznan, Poland, paving the way for what will become the next version of our flagship smart downloader product.

Wrap Up

And that’s it for 2016! From the whole team at Scrapinghub, I’d like to wish you happy holidays and the best wishes for the start of 2017. We’ve got a lot of exciting events lined up next year, so get ready!

How to Increase Sales with Online Reputation Management

How to Increase Sales with Online Reputation Management

One negative review can cost your business up to 22% of its prospects. This was one of the sobering findings in a study highlighted on Moz last year. With over half of shoppers rating reviews as important in their buying decision, no company large or small can afford to ignore stats like these – let alone the reviews themselves. In what follows I’ll let you in on how web scraping can help you stay on top.

What is online reputation management

Online reputation management is carefully maintaining and curating your brand’s image by monitoring social media, reviews, and articles about your company. When it comes to online reputation management, you can’t have too much information. This is a critical part of your business strategy that impacts pretty much every level of your organization from customer service to marketing to sales. BrightLocal found that, “84% of people trust online reviews as much as a personal recommendation.” The relationship between brands and customers has become a two-way street because of the multitude of channels for interaction. Hence the rise of influencer and guerilla marketing tactics.

A key part of online reputation management is highlighting positive reviews to send the message that you are a responsive company that rewards loyal and happy customers. Online reputation management is likewise critical to putting out any potential customer fires. The attrition rate of consumers shoots up to 70% when they stumble across four or more negative articles. You need to be able to act fast to address criticisms and to mitigate escalating issues. Ideally you should not delete negative feedback, but instead show the steps that you are taking to rectify the situation. Besides sparing you an occasional taste of the Streisand effect, this shows that you are responsible, transparent, and not afraid to own up to errors.

How to manage your reputation online

While you could manually monitor social media and review aggregators, in addition to Googling your company for unexpected articles, it’s much more effective to automate the process. There are a lot of different companies and services that specialize in this service including:

  1. Sprout Social
  2. Brandwatch
  3. Klear
  4. Sprinklr

If you want complete control over your data and the type of information that you’d like to monitor, web scraping is the most comprehensive and flexible choice.

Web scraping, the man behind the curtain

Web scraping provides reliable and up-to-date web data

There is an inconceivably vast amount of content on the web which was built for human consumption. However, its unstructured nature presents an obstacle for software. So the general idea behind web scraping is to turn this unstructured web content into a structured format for easy analysis.

Automated data extraction smooths the tedious manual aspect of research and allows you to focus on finding actionable insights and implementing them. And this is especially critical when it comes to online reputation management. Respondents to The Social Habit study showed that when customers contact companies through social media for customer support issues, 32% expect a response within 30 minutes and 42% expect a response within 60 minutes. Using web scraping, you could easily have constantly updating data feeds that alert you to comments, help queries, and complaints about your brand on any website, allowing you to take instant action.

You also need to be sure that nothing falls through the cracks. You can easily monitor thousands, if not millions of websites for changes and updates that will impact your company.

Sentiment analysis and review monitoring

Now, a key part of online reputation management is monitoring reviews for positive and negative feedback. Once the extracted web data is in, you can use machine learning to do sentiment analysis. This form of textual analysis can categorize messages as positive or negative, and the more data you use to train the program, the more effective it becomes. This is a great method for being able to quickly respond to negative reviews while keeping track of positive reviews to reward customers and highlight loyalty.

Straight From the Horse’s Mouth

Here are two entrepreneurs providing real world examples of how they use online reputation management and review monitoring to increase their business.

The Importance of Review Monitoring

Kent Lewis
President and Founder of Anvil Media, Inc.
http://www.anvilmediainc.com/about/team/kent-lewis

As a career agency professional who has owned my own agency for the past 16 years, I have a few thoughts regarding monitoring reviews and assessing sentiment analysis to move businesses forward:

Monitoring reviews (including sentiment) is essential to your business. Ignoring (negative) reviews can cause undue and unnecessary harm. Since 90% of customers read online reviews before visiting a business, negative reviews can directly affect sales. Conversely, a one-star increase on Yelp, leads to a 5-9% increase in business revenue.

Online reviews can be monitored manually (bookmarking and visiting sites like Google My Business, Yelp and others daily or as-needed). However, there are a host of tools available that automate the process. Utilize a mix of free (socialmention.com) and paid tools (Revinate.com) to regularly monitor reviews, in order to address negative reviews and celebrate positive reviews.

While the primary objective for monitoring reviews is identifying and mitigating negative reviews, there are a host of other benefits to capturing and analyzing the data. Harvesting and analyzing the data will provide insights that will improve your products and services. For starters, you can measure and trend sentiment for the brand overall. With additional insights, you can track and monitor reviews for specific products, services or locations. Social media and review sites are the largest (free) focus group in the world. Additionally, you can look at competitors and create benchmarks to track trends over time. Lastly, you can identify superfans that can be nurtured into brand ambassadors.

The sources of data vary by company and industry. Most businesses can be reviewed on Google My Business, Yelp, BBB and Glassdoor (for employees). Each industry has specific sites that also must be monitored, including Expedia and Travelocity for travel & hospitality.

To get maximum value from your monitoring efforts, always look at competitor reviews. Customers are telling you what business you should be in based on their feedback and suggestions for improvement… learn from the entire industry, not just your current or past customers.

Online reputation management, social media, and competitor monitoring

Max Robinson
Owner of Ace Work Gear
http://www.aceworkgear.com

We use tools like Sprout Social which helps us to track mentions on social media for our clients, as this where the majority of the discussion happens about their business. The main reason that our clients want to track these mentions is that people tend to speak more openly and honestly about their experiences with a business on social media than anywhere else online. This also gives our clients the chance to join in conversations with their customers in a casual manner, whereas interactions on review sites can be far more formal.

We report on the number of mentions, and whether our client is being discussed in a positive or negative manner, as well as what the discussion is specifically related to. We look at 3 main social media platforms – Facebook, Twitter and Reddit. We also monitor mentions of competitors across all of these platforms, as per the request of our clients.

Monitoring the online reputation of competitors

Do not neglect your competitors when monitoring reviews and social media. Keeping track of the online reputation of competitors allows you to:

  1. Correctly position and price your product or service offerings
  2. Snatch customers who are posting dissatisfied reviews and comments about your competition
  3. Launch more effective marketing campaigns that address pain points experienced by customers of your competition
  4. Determine what your competitors are doing right so that you can innovate off of their ideas

And that’s just the tip of the iceberg. Competitive intelligence and having an accurate overview of your industry only serves to help you sell your products more effectively. And to bring it back to online reputation management, having a negative perception of your brand is like shooting yourself in the foot. You’re already at a severe disadvantage, especially when compared to positively reviewed competitors.

How to use online reputation management to increase your sales

In an interview with Don Sorensen, president of Big Blue Robot, he shared that one company he worked with was losing an estimated $2 million and more in sales due to a poor online reputation. Don’t let this be you.

  1. The first step is to level the playing field by locating and responding to all of the negative reviews. With a damaged reputation, you should be in crisis mode and monitoring brand mentions around-the-clock so that you are never caught by surprise.
  2. Dominate your search results so that there is little room for people with vendettas to swoop in. This means posting regularly on social media, getting press coverage, and answering questions in forums related to your business or your industry.
  3. Curate your brand’s reputation by having an active blog that carefully frames the benefits of your business, tailored to your audience.

If you are proactive and have a positive reputation or have managed to repair your reputation, then enthusiastic reviews and word of mouth will increase and improve your lead generation prospects. Your sales team should also be fully aware of your online reputation so they can soothe potential concerns or draw attention to success stories.

Wrap up

They say that a good reputation is more valuable than money. Guard yours closely with web data and ensure that you are taking every precaution necessary to retain customers and win over new leads.

Explore ways that you can use web data or chat with one of our representatives to learn more.

How You Can Use Web Data to Accelerate Your Startup

How You Can Use Web Data to Accelerate Your Startup

In just the US alone, there were 27 million individuals running or starting a new business in 2015. With this fiercely competitive startup scene, business owners need to take advantage of every resource available, especially given a high probability of failure. Enter web data. Web data is abundant and those who harness it can do everything from keeping an eye on competitors to ensuring customer satisfaction.

morpheus_meme

Web Data and Web Scraping

You can get web data through a process called web scraping. Since websites are created in a human readable format, software can’t meaningfully analyze this information. While you could manually (read: the time-consuming route) input this data into a format more palatable to programs, web scraping automates this process and eliminates the possibility of human error.

How You Can Use Web Data

If you’re new to the world of web data or looking for creative ways to channel this resource, here are three real world examples of entrepreneurs who use scraped data to accelerate their startups.

Web Data for Price Monitoring

max-robinson

Max Robinson
Owner of Ace Work Gear
http://www.aceworkgear.com

The key to staying ahead of your competitors online is to have excellent online visibility, which is why we invest so much in paid advertising (Google Adwords). But it occurred to me that if you aren’t offering competitive prices, then you’re essentially throwing money down the drain. Even if you have good visibility, users will look elsewhere to buy once they’ve seen your prices.

Although I used to spend hours scrolling through competitor sites to make sure that I was matching all of their prices, it took far too long and probably wasn’t the best use of my time. So instead, I started scraping websites and exporting the pricing information into easily readable spreadsheets.

This saves me huge amounts of time, but also saves my copywriter time as they don’t have to do as much research. We usually outsource the scraping, as we don’t really trust ourselves to do it properly! The most important aspect of this process is having the data in an easily readable format. Spreadsheets are great, but even they can become too muddled up with unnecessary information.

Enriched Web Data for Lead Generation

chris

Chris McCarron
Founder of GoGoChimp
https://www.gogochimp.com

We use a variety of different sources and data to get our clients more leads and sales. This is really beneficial to our clients that include national and international brands who all use this information to specifically target an audience, boost conversions, increase engagement and/or reduce customer acquisition costs.

Web data can help you know which age groups, genders, locations, and devices convert the best. If you have existing analytics already in place, you can enrich this data with data from around the web, like reviews and social media profiles, to get a more complete picture. You’ll be able to use this enriched web data to tailor your website and your brand’s message so that it instantly connects to who your target customer is.

For example, by using these techniques, we estimate that our client Super Area Rugs will increase their annual revenue by $450,000.

Web Data for Competitor Monitoring

mike-catania-high-res-400x400

Mike Catania
CTO of PromotionCode.org
http://www.promotioncode.org

The coupon business probably seems docile from the outside but the reality is that many sites are backed by tens of millions of dollars in venture capital and there are only so many offers to go around. That means exclusive deals can easily get poached by competitors. So we use scraping to monitor our competition to ensure they’re not stealing coupons from our community and reposting them elsewhere.

Both the IT and Legal departments use this data–in IT, we use it more functionally, of course. Legal uses it as research before moving ahead with cease and desist orders.

Wrap Up

And there you have it. Real use cases of web data helping companies with competitive pricing, competitor monitoring, and increasing conversion for sales. Keep in mind that it’s not just about having the web data, it’s also about quality and using a reputable company to provide you with the information you need to increase your revenue.

Please share any other ways that web data has helped you in the comments below.

Explore more ways that you can use web data or chat with one of our representatives to learn more.

Why Promoting Open Data Increases Economic Opportunities

Why Promoting Open Data Increases Economic Opportunities

During the 2016 Collision Conference held in New Orleans, our Content Strategist Cecilia Haynes interviewed conference speaker Dr. Tyrone Grandison. At the time of the interview, he was the Deputy Chief Data Officer at the U.S. Department of Commerce. Tyrone is currently the Chief Information Officer for the Institute for Health Metrics and Evaluation.

tyrone-picture

Dr. Tyrone Grandison

Coming fresh off his talk on “Data science, apps and civic responsibility“, Cecilia was thrilled to chat with Tyrone all about the democratization of data and how open data can help anyone build innovative products and services.

Issues with Data Ownership and Privacy

Cecilia: Thanks for meeting with me! I saw your talk and I thought you would be the the perfect person to reach out to. Since you’re in government, you’re approaching data in a different way than the business or tech world. What is your take on open data?

Tyrone: Data within startups and companies is proprietary. I have this big issue with data ownership, data privacy, and data security and many companies feeling that because they collected and are stewards for data, they immediately have ownership rights.

For example, who does the data belong to if you were in a hospital and the hospital takes down your information for an evaluation? When a hospital generates data on your condition in the process of delivering care, you likely believe that that data is still yours. However hospitals don’t assume that.

Cecilia: I actually didn’t know that. That’s troubling.

Tyrone: I mean it’s basically their proprietary intellectual property at that point where they now have the right to sell it based upon the terms and conditions that you actually agreed to.

It’s the same thing that happens when you use something like a Fitbit.

*Note that Cecilia was wearing a Fitbit…

I looked at your hand and was just like, “That data is not yours.”

The US Government’s Approach to Data

“We want to reduce the barrier of entry for people working on and with data.”

Cecilia: What is the government’s approach to data?

Tyrone: So the government is more focused on the power of open data and how do we actually increase the accessibility and usability of it.

This includes exploring how to enable public-private data partnerships, and, in the process, help government be more data-driven in how it’s run. What I’ve observed is that the Department of Commerce, for example, has highly valuable data sets.

A quick example is NOAA, which is the National Oceanic and Atmospheric Administration. Commerce has twelve bureaus and NOAA is a bureau within Commerce. NOAA provides information for the weather industry globally.

It’s all free, but no one really knows this. It is technically all open, but it’s very difficult to find and it’s very difficult to actually understand.

And there are some companies that have leveraged this information by investing in understanding it and making it clean and accessible. That’s why you have theweather.com, that’s why you have the weather channel, that’s all NOAA. Even worse, NOAA collects around 20-30 terabytes of data per day. They even have satellites monitoring the sun’s surface. They have sensors monitoring sounds underwater, you name it, they monitor it. 30 terabytes a day, but they only actually release 2 terabytes of that data and it’s only a fraction of those 2 terabytes that funds the world’s weather system.

Cecilia: Oh, so is that why weather predictions are unreliable?

Tyrone: No, no, that’s not the data’s fault. That is on the analytical models on top of the data.

If you had access to more data, and you had a better understanding of the nuances of collection like what you have to filter out and what overlaps, then you can actually get better models. The prediction models are actually better now than they were like three years ago and current three-day predictions are pretty spot on. If we go farther than that, then okay, not so reliable…

Using Open Data to Find Targeted Demographics

Cecilia: What other data sources can benefit companies?

Tyrone: The Census Bureau has this thing called the American Community Survey which basically documents the daily lives of all Americans. So, if you want to know anything, they have tens of thousands of features, which means tens of thousands of descriptors on the lives of Americans.

Every single study that you see that actually talks about how Americans are living, or whatever else, that’s all from the Census Bureau. These studies don’t recognize the Census, they don’t like give attribution back to the Census. But there is nowhere else the data can come from.

Say I wanted to get access to senior citizens over 65 who collected social security benefits and who used to commute 10 miles to their job. Almost any attribute you could actually think of, you could find this demographic right now with open data.

Cecilia: And it’s all completely available?

Tyrone: It’s all open. There is a project called the Commerce Data Usability Project that we’re doing at the department where we produce tutorials that show you:

Here’s a valid data set, here is a story as to why you should care about the data set, here is how to get it, here is how it’s processed, here is how to actually make some visualizations from it, here is how to actually analyze it. Go.

Tools to Support the Democratization of Data

Cecilia: The democratization of data is such a big deal to us as well. It’s why we open source our software and products, and why we made an open source visual scraper, so that anyone can engage with web data.

One of our goals is to enable data journalists, data scientists, everyday people to be able to use our tools to seek out the information they need.

Tyrone: Commerce is really dedicated to this goal as well. That’s why we have a startup now within Commerce called Commerce Data Service whose mission it is to support all the bureaus on their data initiatives.

We want to fundamentally and positively change the way citizens and businesses interact with the data products from Commerce. We recognize the problems are tied to marketing, access, and usability.

The Data Service commits to having everything in the open, everything transparent as much as possible. If you want to see everything we’re working on right now, it’s on github.com/commercedataservice.

Take a look at the Data Usability Project since we have a bunch of tutorials on everything from census data to data from NIST which is the standard’s organization that has everything from internet security standards to time standards, you name it.

We also have satellite information. So there is a satellite that was launched, I think it was October 2011, called the Delta 2. It had on it this device called the Visible Infrared Imaging Radiometer Suite, VIIRS, which actually monitors all human activity as it goes around.

So a bunch of scientists have been looking at this VIIRS data set that no one knows exists and figured out that it’s a really good proxy for a lot of amazing stuff. For example, you could actually use satellite imagery to predict population very simply. You could even use it to predict the number of mental health related arrests in San Francisco. You could also use it to figure out economic activity in a particular place.

Machine Learning for Data Analysis

Cecilia: So do you incorporate machine learning into analyzing the data?

Tyrone: We’ve got the platform and we have examples that show you how to use machine learning with the data sets. If you want to use machine learning algorithms on a data set, you can find everything you need. If you want to use the data sets with something else in a really straight forward way to do straight mapping, for example, then you have it on our platform.

Cecilia: This is actually really helpful to me because we have partnerships with BigML, a company that specializes in predictive analytics, along with MonkeyLearn, a machine learning company that works on text analysis.

We’re always looking for new ways to highlight our collaboration, so we’ll have to check out VIIRS.

Using Open Data to Create Economic Opportunity

Cecilia: What is your role in the Department of Commerce?

Tyrone: I’m the Deputy Chief Data Officer. I’m one of three people that leads this Commerce Data Service and the office itself is the lead for the data pillar across Commerce.

The Secretary has a strategic plan that has five initiatives that everyone has to tie into, data is one of them and we’re responsible for making sure that data is successful.

Cecilia: Have you found it really challenging so far?

Tyrone: The support from the Secretary and the senior staff at Commerce has been amazing. The challenge has actually been that we are not in the private sector. Since it is a little bit different delivering products in government than it is in private industry.

In the private industry, you’re focused on clicks, and buys, and elastic problems where it’s all about growing and shrinking some base. Whereas with government, it’s more of the hardest, most difficult problems that can considered baseline needs like, “I’d like to have health care. I’d like not to be homeless.”

These are problems that you know no company that will actually tackle because there is no profit motive, but these should be basic intrinsic rights for anyone who lives in the US. These are the problems that the government has to handle, and we have to produce amazing data products to make sure that we approach them in the right way.

Cecilia: So your goal is to create products that allow people to access data more easily?

Tyrone: Our approach is two-fold: One, we’re building the products to help people engage with the services that we’re offering. And two, we’re building a platform that’s an enabler. I hope that the platform is something that citizens can use to help solve local issues.

The Commerce Department’s mission is to create conditions for economic growth and opportunity. We want to empower citizens to take this data and build businesses and create more jobs.

That’s why we want to open as much data as possible and just encourage and engage with people so that they can build great things.

No Such Thing as Bad Data

Cecilia: So open data is a critical part of your strategy?

Tyrone: The more data you have, the more you can shed light on issues. However, you can’t let the data speak for itself because you have to recognize that there is bias in data. If you recognize the bias first, you can try and filter out for it, and if you can’t, chuck it and use a different data source.

It’s important to have a data source that is real, legitimate, and sound so you can find a signal and get meaningful information out of it. It’s helpful if you have a purpose, or a direction, or a question you’re asking. Then you can actually say, “I want to see spending trends. I want to see who’s spending X on Y.” And just do an analysis of this one feature over time.

Cecilia: How do you determine the difference between good data vs bad data?

Tyrone: There is no good or bad since data is a product of the collection process and the people that handle it. It’s more about the people that clean, process, and provide it.

Cecilia: So there is a lot of importance in having a reliable group who gathers the data?

Tyrone: There is a lot of value in having the people responsible for ETL (Extract, Transform, Load) who can create a data set that is a gold standard. They reduce biases as much as possible, and they minimize errors as much as possible.

It’s important that they’re honest with the upstream consumer about the problems with any data sets they provide. If you’re really honest about it, then somebody else can know what are the right techniques to use on the data. Otherwise people might just use it willy-nilly and not know that it shouldn’t be used for that purpose or in a particular way.

So the good and bad thing, there is no dichotomy, it’s all data and the interpretation of it.

Advice for Getting Started with Using Open Data

Cecilia: Do you have any advice to people who are looking to get into open data or data security in this industry?

Tyrone: I’d say just go in with a problem or a question, something that’s burning in your heart that you want to solve. And then figure out what data sets you can use.

You have a backpack of methods and technologies available and it all starts with the question or problem you’re fundamentally trying to solve. You need to understand the user, the problem, and the context in which you have to deliver something.

That determines what tools you need to actually use to solve that problem, not the other way around. Don’t approach this with, “I have a hammer, I’m going to smash everything with it.”

slack-for-ios-upload-1
One of my favorite interviews of the entire conference, thank you again for meeting with me, Tyrone!

Learn more about how you can use web data in your business and take a look at how anyone can get started with open data, no coding required.

Interview: How Up Hail uses Scrapy to Increase Transparency

Interview: How Up Hail uses Scrapy to Increase Transparency

During the 2016 Collision Conference held in New Orleans, Scrapinghub Content Strategist Cecilia Haynes had the opportunity to interview the brains and the brawn behind Up Hail, the rideshare comparison app.

up-hail

avi

Avi Wilensky is the Founder of Up Hail

Avi sat down with Cecilia and shared how he and his team use Scrapy and web scraping to help users find the best rideshare and taxi deals in real time.

Fun fact, Up Hail was named one of Mashable’s 11 Most Useful Web Tools of 2014.

Meet Team Up Hail

CH: Thanks for meeting with me! Can you share a bit about your background, what your company is, and what you do?

AW: We are team Up Hail and we are a search engine for ground transportation like taxis and ride-hailing services. We are now starting to add public transportation like trains and buses, as well as bike shares. We crawl the web using Scrapy and other tools to gather data about who is giving the best rates for certain destinations.

scrapy

Scrapy for the win

There’s a lot of data out there, especially public transportation data on different government or public websites. This data is unstructured and a mess and without APIs. Scrapy’s been very useful in gathering it.

CH: How has your rate of growth been so far?

AW: Approximately 100,000 new users a month search our site and app, which is nice and hopefully we will continue to grow. There’s a lot more competition now than when we started, and we’re working really hard to be the leader in this space.

Users come to our site to compare rates and to find the best deals on taxis and ground transportation. They are also interested in finding out if the different service providers are available in their cities. There are many places in the United States and across the world that don’t have these services, so we attract those who want find out more information.

We also crawl and gather a lot of different product attributes such as economy vs. luxury, shared vs. private, how many people each of these options fit, whether they accept cash, and whether you can book in advance.

Giving users transparency on different car services and transportation options is our mission.

CH: By the way, where are you based?

AW: We’re based in midtown Manhattan in a place called A Space Apart. This is run by a very notable web designer and author named Jeffrey Zeldman who has been gracious enough to host us. He also runs A Book Apart, An Event Apart, and A List Apart, which are some of the most popular communities for web developers and designers.

Why the Team Members at Up Hail are Scrapy Fans

CH: You have really found some creative applications for Scrapy. I have to ask, why Scrapy? What do you appreciate about it?

AW: A lot of the sites that we’re crawling are a mess. Especially the government transit ones and local taxi companies. As a framework, Scrapy has a lot of features built in right out the box that are useful for us.

CH: Is there anything in particular that you’re like, “I’m obsessed with this aspect of Scrapy?”

AW: We’re a Python shop and Scrapy is the Python library for building web crawlers. That’s primarily why we use it. Of course, Scrapy has such a vibrant ecosystem of developers and it’s just easy to use. The documentation is great and it was super simple to get up and started. It just does the job.

We’re grateful that you make such a wonderful tool [Note: We are the original authors and lead maintainers of Scrapy] that is free and open source to startups like us. There’s a lot of companies in your space that are charging a lot of money and making it cost prohibitive to use.

CH: That’s really great to hear! We’re all about open source, so keeping Scrapy forever free is a really important aspect of this approach.

On Being a Python Shop

CH: So tell me a bit more about why you’re a Python shop?

AW: Our application runs on the Python Flask framework and we’re using Python libraries to do a lot of the back-end work.

CH: Dare I ask why you’re using Python?

AW: One of the early developers on the project is a Xoogler, and Python is one of Google’s primary languages. He really inspired us to use Python and we just love the language because it’s the philosophy of readability, brevity, and making it simple and powerful enough to get the job done.

I think developer time is scarce and Python makes it faster to deploy, especially for a startup that needs to ship fast.

Introducing Scrapy Cloud and the Scrapinghub Platform

CH: May I ask you’ve used our Scrapy Cloud Platform to deploy Scrapy crawlers?

AW: We haven’t tried it out yet. We just found out about Scrapy Cloud, actually.

CH: Really? Where did you hear about us?

AW: I listen to a Python podcast [Talk Python To Me] which was with Pablo, one of your co-founders. I didn’t know about how Scrapy originated from your co-founders. When I saw your name in the Collision Conference app, I was like, “Oh, I know these guys from the podcast! They’re maintainers of Scrapy.” Now that we know about Scrapy Cloud, we’ll give it a try.

We usually run Scrapy locally or we’ll deploy Scrapy on an EC2 instance on Amazon Web Services.

CH: Yeah, Scrapy Cloud is our forever free production environment that lets you build, deploy, and scale your Scrapy spiders. We’ve actually just included support for Docker. Definitely let me know what you think of Scrapy Cloud when you use it.

AW: Definitely, I’ll have to check it out.

Plans for Up Hail’s Expansion

CH: Where are you hoping to grow within the next five years?

AW: That’s a very good question. We’re hoping to, of course, expand to more regions. Right now, we’re in the United States, Canada, and Europe. There’s a lot of other countries that have a tremendous population that we’re not covering. We’d like to add a lot more transportation options into the mix. There’s all these new things like on-demand helicopters and we want to just show users going from point A to point B all their available options. We’re kind of like the Expedia of ground transportation.

Also, we’re adding a lot of interesting new things like a scoring system. We’re scoring how rideshare-friendly a city is. New York and San Francisco, of course, get 10s, but maybe over in New Jersey, where there are less options, some cities will get 6 or 7. It depends on how many options are available. Buffalo, New York, for example, doesn’t have Uber or Lyft and they would probably get like a 1 because they only have yellow taxis. This may be useful for users that are thinking of moving to a city and want to know how accessible taxi and rideshares are. We want to give users even more information about taxis and transportation options.

Increasing Transparency through Web Scraping

CH: It seems that increasing transparency is a large part of where you want to continue to grow.

AW: The transportation industry is not as transparent as it should be. We’ve heard stories at the Expo Hall [at Collision Conference] of taxi drivers ripping off tourists because they don’t know in advance what it’s going to cost. By us scraping these sites, taking the rate tables, and computing the estimates, they can just fire up our app and have a good idea of what it’s going to cost.

CH: Is your business model based on something like Expedia’s approach?

AW: Similar. We get a few dollars when we sign up new users to the various providers. We’re signing up a few thousand users a month. While it’s been really good so far, we need to grow it tremendously and we’re looking for other business models. Also, advertising on the site has been good for us as well, but, of course, it’s limited. Don’t want to be too intrusive to our users by being overly aggressive with ads, so we’re trying to keep it clean there.

Opening Up Up Hail’s API

AW: Within the next few months we hope to launch a public API to outside developers and other sites. We’ve talked with a lot of other vendors here at the expo like travel concierge apps and the like that want to bring in our data.

CH: Oh, that’s great! Seems to be the makings for a lot of cross-platform collaboration.

AW: We’ve gathered a lot of great data, thanks to Scrapy and other crawling tools, and we hope to make it available for others to use.

In fact, I specifically reached out to you to tell you how awesome Scrapy was.

CH: Well I’m thrilled you did! And I’m so glad we also got to talk about Python and how you use it in your stack.

AW: Definitely. We are heavily using Python to get the job done. We think it’s the right tool for the job and for what we’re doing.

up-hail-group

Team Up Hail at Collision Conference 2016

Interest piqued? Learn more about what web scraping and web data can do for you.

Embracing the Future of Work: How To Communicate Remotely

Embracing the Future of Work: How To Communicate Remotely

What does “the Future of Work” mean to you? To us, it describes how we approach life at Scrapinghub. We don’t work in a traditional office (we’re 100% distributed) and we allow folks the freedom to make their own schedules (you know when you work best). By finding ways to break away from the traditional 9-to-5 mode, we ended up creating a framework for the Future of Work.

Maybe you’ve heard of this term and want to learn more or maybe you’re thinking about implementing aspects of it at your own company, regardless we can’t stress enough that effective communication is a key part.

The Future of Work: a Definition

According to Jacob Morgan (who literally wrote the book, The Future of Work), this broad term can be broken into three parts: Freedom/Flexibility, Autonomy, and Choice/Customization. Not mentioned in his list is the rise of AI, although we might as well prepare for our inevitable robot overlords.

The Future of Work both describes how the employment landscape will look in the future as well as how humans will need adapt to technological leaps and changing expectations of employment. Along these lines, we are remote (always have been, always will be) and we are into machine learning. We’re also open source to the core and if you think wrangling regular developers is a challenge, wait until you meet scarily intelligent super brains who know how to maneuver their pet projects into benefiting their company.

artificial-intelligence

We’re living, breathing, and occasionally chaotic proof that the Future of Work is already in the present.

Remote Life Culture

Our two co-founders, Shane from Ireland and Pablo from Uruguay, established a company based on talent, not geography. And that’s not even a marketing line.

Full disclosure, I’m a millennial who was drawn to Scrapinghub because of its remote stance. I wanted the flexibility to be a digital nomad while not needing to rely on the uncertainty of freelance work. This opportunity was the best of both worlds and I’ve since come to understand how parents can also benefit from remote life (and introverts, and folks who like to work at odd hours).

Flexibility

Remote work is not for everyone. While obviously there are co-working spaces and coffee shops, you need to be comfortable with establishing your own schedule along with having the discipline to work by your lonesome.

On the flip side, you have the flexibility and freedom to sort out your responsibilities so that they fit within your life. And this is a pretty important point for companies looking to transition into the remote space. You need to understand how time zones impact the way that teams operate and how to trust your team members to get their work done on time.

Holidays

When hiring team members from a variety of countries and backgrounds, it’s important to keep local holidays in mind. Adopting an open holiday policy both respects the diversity of your company (we’re based in 48 countries) and also recognizes the importance of having time off.

Autonomy

Cultural fit is especially important in a remote team. Autonomy and trust is a huge part of how we operate because there is no one looking over your shoulder (quite literally). Finding motivated people who can finish their work and stay on top of their responsibilities without needing oversight is crucial to running a successful remote operation.

Remote Work

We make sure that our teammates feel comfortable voicing concerns and sharing ideas for improving the company by promoting an open Slack policy. No matter our level or seniority, we remain accessible to all members of Scrapinghub. This policy facilitates collaboration and helps create a sense of community.

Managing Miscommunication

Tone is incredibly difficult to convey through writing. Think of every misunderstanding that you’ve ever had through written communication (text messages, tweets, comments on Facebook, etc.), and then add in work-related stress. This is not a great situation unless you develop straightforward channels of communication.

Our Workflow

This is our system, so feel free to steal the workflow. Honestly, different methods work for different teams. As long as you have one unifying communication stream, go with whatever feels right:

Communication

Slack: Used for day-to-day communication. Slack is mainly how we stay in touch. The Snooze control is great, as is the Tomatobot Pomodoro timer. We’re also using Leo the Slack bot as a way for upward feedback and to get a handle on the pulse of our colleagues.

Google Hangouts: This is for team meetings and it can also be used for impromptu watercooler moments. In Growth specifically, we plan our sprints and have eclectic meetings about non-work related topics like aphantasia.

GoToMeeting: We use this platform for our larger gatherings so that the majority of the company can join in. We have Town Halls where our founders share company-wide information in order to increase transparency. GoToMeeting is also used for our Friday lightning talks (affectionately known as Shub talks) where team members present on interesting topics ranging from recipes to machine learning techniques.

Email: Optimal for handling outside communications and for sharing external chats with the rest of our teams.

Intercom: Intercom is how we stay in touch with users and customers.

Sprint Planning and Assigning Tasks

Redmine: Some teams use this as an issue tracker and as a way to assign necessary tasks.

Jira: Used by other teams for support ticket management and sprint planning.

Trello: Used by our Growth team for sprint planning and to keep track of daily activities. I use Trello for my editorial calendar since it’s easy for me to organize categories, assign writers, and set due dates.

Work

GitHub: We’re open source folks and so GitHub is a huge part of how we work. Even vacations are managed through pull requests.

BaseCRM: The sales team uses this for lead management.

Lever.co: Used by the HR team for managing job applications.

Updates and Improvements

Scrapinghub Enhancement Proposals (SHEP): SHEPs (a play on Python Enhancement Proposals) are plans created by employees with suggestions for how Scrapinghub can be improved. SHEPs include everything from HR perks to business planning. SHEPs can be created and submitted by anyone in the company.

Confluence: We recently adopted Confluence as a knowledge base in addition to Google Drive. We want to reduce the silo mentality and Confluence has been especially beneficial in accomplishing this goal. Team updates, meeting notes, and ongoing projects are easily reviewed and shared within the company using this program.

Newsletter: Our weekly newsletter (sent through MailChimp) shares information on more personal topics like vacations, new hires, and birthdays. The newsletters help us to keep up-to-date on the quirky and HR-related aspects of Scrapinghub. We highlight exemplary employees, team bios, and conference activities as a way to keep everyone in the loop and to remain connected.

Wrap Up

We’re on the front lines of exploring the Future of Work. Between our technological advances (and we welcome open source collaborators) and our remote organization, we’re experimenting with how to best move forward with maximum flexibility while not turning into robots. We stress effective and clear communication because no on wants to play a game of telephone across international waters.

fry

What are your thoughts on the Future of Work and on remote companies? What applications do you use that we don’t? What workflows are you implementing that you would like to share? Please let us know in the comments or reach out on Twitter.

What the Suicide Squad Tells Us About Web Data

What the Suicide Squad Tells Us About Web Data

Web data is a bit like the Matrix. It’s all around us, but not everyone knows how to use it meaningfully. So here’s a brief overview of the many ways that web data can benefit you as a researcher, marketer, entrepreneur, or even multinational business owner.

morpheus_meme

Since web scraping and web data extraction are sometimes viewed a bit like antiheroes, I’m introducing each of the use cases through characters from the Suicide Squad film. I did my best to pair according to character traits and real world web data uses, so hopefully this isn’t too much of a stretch.

This should be spoiler free, with nothing revealed that you can’t get from the trailers! Fair warning, you’re going to have Ballroom Blitz stuck in your head all day. And if you haven’t seen Suicide Squad yet, hopefully we get you pumped up for this popcorn movie.

Market Research and Predictions: Deadshot

Deadshot’s claim to fame is accurate aim. He can predict bullet trajectories and he never misses a shot. So I paired him with using web data for market research and trend prediction. You can scrape multiple websites for price fluctuation, new products, reviews, and consumer trends. This is an automated process that allows you to quickly and accurately analyze data without needing to manually monitor websites.

Social Media Monitoring: Harley Quinn

Harley Quinn has a sunny personality that remains chipper even when faced with death, destruction, torture, and mayhem. She also always has a witty comeback no matter the situation. These traits go hand-in-hand with how brands should approach social media channels. Extracting web data from social media interactions help you understand consumer opinions. You can monitor ongoing chatter about your company or your competition and respond in the most positive way possible.

Lead Generation and HR Recruitment: Amanda Waller

This is probably the most obvious pairing since Amanda Waller (played by the wonderful Viola Davis) is the one responsible for assembling the Suicide Squad. She carefully researched and compiled intimate details on all the criminals-turned-reluctant-heroes. This is an aspect of web data that benefits all sales, marketing, recruitment, and HR. With a pre-vetted pool, you’ll have access to qualified leads and decision-makers without needing to wade through the worst of the worst.

Tracking Criminal Activity in the Dark Web: Killer Croc

This sewer-dwelling villain thrives in dark and hidden spaces. He’s used to working underground and in places most people don’t even know exist. This makes Killer Croc the perfect backdrop for the type of web data located in the deep/dark web. The dark web is the part of the internet that is not indexed by search engines (Google, Bing, etc.) and is often a haven for criminal activity. Data scraped from this part of the web is commonly used by law enforcement agencies.

Competitive Pricing: Captain Boomerang

This jewelry thief goes around the world stealing from banks and committing acts of burglary – with a boomerang… Captain Boomerang knows all about pricing and the comparative value of products so he can get the largest bang for his buck. Similarly, web data is a great resource for new companies looking to research their industry and how their prices match up to the competition. And if you are an established company, this is a great way for you to keep track of newcomers and potential market disruptors.

Machine Learning Models: Enchantress

In her 6313 years of existence, the Enchantress has had to cope with changing times, customs, and civilizations. The ability to learn quickly and adapt to new situations is definitely an important part of her continued survival. Likewise, machine learning is a form of artificial intelligence that can learn when given new information. Train your machine learning models using datasets for conducting sentiment analysis, making predictions, and even automating web scraping. Whether you are an SaaS company specializing in developing machine learning technology or someone who needs machine learning analysis, you need to ensure you have up-to-date datasets.

Monitoring Resellers: Colonel Rick Flag

Colonel Rick Flag is a “good guy” whose job is to keep track of the Suicide Squad and kill them if they get out of line. Now obviously your relationship with resellers is not a life-and-death situation, but it can be good to know how your brand is being represented across the internet. Web scraping can help you keep track of reseller customer reviews and any contract violations that might be occurring.

Monitoring Legal Matters and Government Corruption: Katana

Katana the samurai is the enforcer of the Suicide Squad. She is there as an additional check to keep the criminal members in line. Similarly, web data allows reporters, lawyers, and concerned citizens to keep track of government officials, potential corruption charges, and changing legal matters. You can scrape obscure or poorly presented public records and then use that information to create accessible interfaces for easy reference and research.

Web Scraping for Fun: the Joker

I believe the Joker needs no introduction, whether you know this character from Jack Nicholson, Heath Ledger, or the new Jared Leto incarnation. He is unpredictable, has eclectic tastes, and is capable of doing anything. And honestly, this is what web scraping is all about. Whether you want to build a bike sharing app or monitor government corruption, web data provides the backbone for all of your creative endeavors.

Wrap Up

I hope you enjoyed this unorthodox tour of the world of web data! If you’re looking for some mindless fun, Suicide Squad ain’t half bad (it ain’t half good either). If you’re looking to explore how web data fits within your business or personal projects, feel free to reach out to us. And if you’re looking to hate on or defend Suicide Squad, comment below.

P.S. There is no way this movie is worse than Batman v Superman: Dawn of Justice

Introducing the Datasets Catalog

Introducing the Datasets Catalog

catal3

Folks using Portia and Scrapy are engaged in a variety of fascinating web crawling projects, so we wanted to provide you with a way to share your data extraction prowess with the world.

With this need in mind, we’re pleased to introduce the latest addition to our Scrapinghub platform: the Datasets Catalog!

This new feature allows you to immediately share the results of your Scrapinghub projects as publicly searchable datasets. Not only is this a great way to collaborate with others, you can also save time by using other people’s datasets in your projects.

datasets_central_page

As fans of the open data movement, we hope that this new feature will ease the process of disseminating data. Open data has been used to help foster transparency in governmental and corporate systems worldwide. Researchers and developers have also benefited from the mutual sharing of information. A couple of our own engineers have even used open data to power transportation apps and to help journalists expose corruption.

Read on to get some ideas on how to use the Datasets Catalog in your workflow.

The Datasets Catalog at a Glance

We are launching the Datasets Catalog with the following features:

  • Publish the data collected by your Portia or Scrapy spiders/web crawlers as easily accessible datasets
  • Highlight your scraped data and help others locate the information they need by giving each dataset a name and a description
  • Let others discover your datasets through search engines like Google
  • Browse publicly available datasets that other people are sharing: https://app.scrapinghub.com/datasets
  • Choose how to share your dataset using three different privacy settings:
    • Public datasets are accessible by anyone (even those without a Scrapinghub account) and are indexed by search engines
    • Restricted datasets are accessible only to the users that you explicitly grant access (they need to have a Scrapinghub account)
    • Private datasets are accessible only by the members of your organization

How Does it Work?

publish datasetYou can find this new “Datasets” option in the menu located at the top navigation bar. On the main Datasets Catalog page, you can browse available datasets along with those that you have recently visited.

Publishing your scraped data into complete datasets takes just one click. This tutorial will get you started on publishing and sharing your extracted data.

 

Wrap Up

And there you have it, a way to not only showcase your web crawling and data extraction skills, but to also help others with the information that you provide.

We invite you to contribute your datasets and play your part in helping drive the open source movement forward. Reach out to us on Twitter and let us know what datasets you would like to see featured and if you have any recommendations for improving the whole Datasets experience.

We’re excited to see what you come up with!

Introducing the Crawlera Dashboard

Introducing the Crawlera Dashboard

crawlera-dash-005_10243
We’ve been rolling out a lot of updates, upgrades, and new features lately, and we’re continuing this trend by announcing the very first Crawlera Dashboard!

Crawlera is a smart downloader that allows you to crawl and scrape websites responsibly. It rotates IP addresses and keeps track of which ones have been blocked by websites, ensuring that your crawls continue uninterrupted. Since Crawlera has always been a mainstay of Scrapinghub, we wanted to revamp its presentation to help you crawl the web and extract data more effectively.

With that in mind, say hello to the dashboard that allows you to visualize how you are using Crawlera, to examine what sites and specific URLs you are targeting, and to manage multiple accounts.

Read on as we walk you through the various features of the dashboard and show you how best to use it.

Crawlera Dashboard

The main benefits of the dashboard include seeing which websites are being crawled the most and understanding the responses you’re getting via Crawlera overall and for each individual account. You can also create and manage different Crawlera accounts from your dashboard, a feature which was not previously available.

Pasted image at 2016_05_31 09_43

Having all of your accounts in one place allows you to more accurately manage and monitor your Crawlera use.

In the past you could only see two global usage graphs: the number of requests you’ve made per day for the last 30 days and per month for the last 12 months. Now you can also view the most recent requests performed by Crawlera in any of your accounts as well as filter the state of each request (succeeded, failed, banned) per website. The graphs are also more comprehensive and representative of your Crawlera activity.

Plans for the future

In the future, we plan to develop support for managing the regions that you use for your Crawlera account(s) as well tools to understand how Crawlera impacts your crawls. We’ll also provide additional resources for our Enterprise clients, such as fine-tuned management of a dedicated pool of IP addresses.

Resources to Get Started

Here is the official Crawlera documentation to help you get started. We will also be rolling out a tutorial video, so keep your eyes peeled in the coming weeks.

Wrap Up

To our longtime Crawlera users, we hope that you enjoy the enhanced features of the Crawlera Dashboard. To users new to Crawlera, we’re thrilled that you’re joining us during the release of this new interface.

Please let us know what you think of the dashboard and feel free to reach out with further suggestions on Twitter. Happy scraping!

Machine Learning with Web Scraping: New MonkeyLearn Addon

Machine Learning with Web Scraping: New MonkeyLearn Addon

Say Hello to the MonkeyLearn Addon

We deal in data. Vast amounts of it. But while we’ve been traditionally involved in providing you with the data that you need, we are now taking it a step further by helping you analyze it as well.

To this end, we’d like to officially announce the MonkeyLearn integration for Scrapy Cloud. This feature will bring machine learning technology to the data that you extract through Scrapy Cloud. We also offer a MonkeyLearn Scrapy Middleware so you can use it on your own platform.

Scrapinghub-MonkeyLearn-Addon-02

MonkeyLearn is a classifier service that lets you analyze text. It provides machine learning capabilities like categorizing products or sentiment analysis to figure out if a customer review is positive or negative.

You can use MonkeyLearn as an addon for Scrapy Cloud. It only takes a minute to enable and once this is done, your items will flow through a pipeline directly into MonkeyLearn’s service. You specify which field you want to analyze, which MonkeyLearn classifier to apply to it, and which field it should output the result to.

Say you were involved in the production of Batman vs Superman and you’re interested in how people reacted to your high budget movie. You could use Scrapy to track mentions of this film across the web and then use MonkeyLearn to perform sentiment analysis on the samples that you collect. But don’t get too excited because you might not like the results of your search…

sad affleck

There are so many ways that you can use this addon with our platform, so we’ll be featuring a series of tutorials that will help you make the most out of this partnership.

Getting Started with MonkeyLearn

MonkeyLearn provides public modules that are already ready to go or you can create your own text analysis module by training a custom machine learning model.

For example, using traditional sentiment analysis on comments filled with trolls would return a 100% negative rating. To develop a “Troll Finder” you would need to create a custom model with a higher tolerance for the extreme negativity. You could create categories like “troll”, “ubertroll”, and “trollmaster” for further categorization. Check out MonkeyLearn’s tutorial to help you through this task.

Before you get started with the MonkeyLearn addon on Scrapy, you first need to sign up for the MonkeyLearn service. They offer a free account, so you don’t need to worry about the cash monies. Once you’ve signed up, you’ll be taken to your dashboard:

Screenshot 2016-04-12 16.48.46

Click the “Explore” option in the top menu to check out the whole range of ready-made classifiers that you can apply to the scraped data. There are a ton of different options to choose from including sentiment analysis for product reviews, language detectors, and extractors for useful data such as phone numbers and addresses.

Screenshot 2016-04-12 16.46.48

Choose the classifier that you’re interested in and make a note of its ID. You can find the ID in the URL:

Screenshot 2016-04-12 16.32.48 copy

And now that you’re all set on the MonkeyLearn side, it’s time to head back over to Scrapy Cloud.

Addon Walkthrough

You can access the MonkeyLearn addon through your dashboard. Navigate to Addons
Setup:

Add ons page

Enable the addon and click Configure:

Screenshot 2016-04-08 17.29.52

Head down to Settings:

Screenshot 2016-04-12 16.35.11 copy

To configure the addon, you need to set your MonkeyLearn API key, specify the classifier you want to use and the field in which you want the result to be stored. You’ll need the classifier ID you chose earlier from the MonkeyLearn platform.

MonkeyLearn reads the content from the classifier fields you’ve specified, performs the classification task on the data, and returns the result of the classification/analysis in the field that you defined as categories field.

For example, in order to detect the category of a movie based on the title, you would need to add the ID from the module you want to use in the first text box. In the second text box you would list your authorization token and the item field you want to analyze (title, in our case) in the third text box. In the fourth text box you would list the name of the field that is going to store the results from MonkeyLearn.

Screenshot 2016-04-12 16.34.57 copy

And you’re all done! Locked and loaded and ready to go with MonkeyLearn.

Using MonkeyLearn with Scrapy

The MonkeyLearn addon is a part of Scrapy Cloud, so you can use it with your Scrapy spiders. Scrapy is also open source, so you can easily run it on your own system.

The addon means you don’t need to worry about learning MonkeyLearn’s API and how to route requests manually. If you need to use MonkeyLearn outside of Scrapy Cloud, you can use the middleware for the same purpose.

When to use MonkeyLearn

We’re really excited about this integration because it is a huge step in closing the gap between data acquisition and analysis.

MonkeyLearn offers a range of text analysis services including:

  • Classifying products into categories based on their name
  • Detecting the language of text
  • Sentiment analysis
  • Keyword extractor
  • Taxonomy classifier
  • News categorizer
  • Entity extraction

We’ll delve into what you can do with each of these tools in future tutorials. For now, feel free to experiment and explore this integration in your web scraping projects.

Wrap Up

Data and textual analysis is more efficient by combining MonkeyLearn’s machine learning capabilities with our data extraction platform. Whether you are using this for personal projects (tracking and monitoring advance reviews for Captain America: Civil War [Team Cap]) or for professional tasks, we’re excited to see what you come up with.

Keep your eyes peeled, the first tutorial will walk you through using the Retail Classifier with the MonkeyLearn addon. Sign up for free for Scrapy Cloud and for Monkeylearn and give this addon a whirl.