Scraping in Python (Best Tutorial 2019)

Scraping in Python

Scrapy and Scraping in Python Tutorial 2019

Scrapy is a comprehensive Python library for crawling websites and extracting structured data from websites, which is also quite popular. This tutorial explains Scraping in Python with the best examples.

 

It deals with both the HTTP and HTML side of things and provides a command-line tool to quickly set up, debug, and deploy web crawlers. This tutorial explains Scraping in Python with the best examples. 

 

Scrapy is a powerful tool that is also worth learning, even though its programming interface differs somewhat from requests and Beautiful Soup or Selenium — though this should not pose too many problems for you based on what you’ve learned.

 

Especially in cases where you have to write a robust crawler, it can be useful to take a look at Scrapy as it provides many sane defaults for restarting scripts, crawling in parallel, data collection, and so on.

 

The main advantage of using Scrapy is that it is very easy to deploy scrapers in “Scrapy Cloud”, a cloud platform for running web crawlers.

 

This is helpful in cases where you want to quickly run your scraper on a bunch of servers without hosting these yourself, though note that this service comes at a cost. An alternative would be to set up your scrapers on Amazon AWS or Google’s Cloud Platform.

 

A notable drawback of Scrappy is that does not emulate a full browser stack. Dealing with JavaScript will hence be troublesome using this library. There does exist a plug-in that couples “Splash” (a JavaScript rendering service) with Scrappy, though this approach is a little cumbersome to set up and maintain.

 

Caching

Caching

Caching is another interesting aspect to discuss. We haven’t spoken about caching a lot so far, though it is interesting to keep in mind that it is a good idea to implement some client-side caching solution while building your web scrapers, which will keep fetched web pages in its “memory.”

 

This avoids continuously hammering web servers with requests over and over again, which is especially helpful during development of your scripts (where you’ll oftentimes restart a script to see if a bug has been fixed, you get the results you expect, and so on).

 

A very interesting library to take a look at in this context is CacheControl, which can simply be installed through pip and directly used together with requests as follows:

import requests
from cachecontrol import CacheControl
session = requests.Session() cached_session = CacheControl(session)
# You can now use cached_session like a normal session # All GET requests will be cached

 

Proxy Servers

Proxy Servers

You can also take a look at setting up a local HTTP proxy server on your development machine.

An HTTP proxy server acts as an intermediary for HTTP requests as follows: a client (a web browser or Python program) sends off an HTTP request, though now not by contacting the web server on the Internet but sending it to the proxy server first. 

 

Depending on the configuration of the proxy server, it may decide to modify the request before sending it along to the true destination.

There are various reasons why the usage of an HTTP proxy server can come in handy.

HTTP proxy server

First, most proxy servers include options to inspect HTTP requests and replies, so that they offer a solid add-on on top of your browser’s developer tools.

 

Second, most proxy servers can be configured to enable caching, keeping the HTTP replies in their memory so that the end destination does not have to be contacted multiple times for subsequent similar requests. Finally, proxy servers are oftentimes used for anonymity reasons as well.

 

Note that the destination web server will see an HTTP request coming in that originated from the HTTP proxy server, which does not necessarily need to run on your local development machine.

 

As such, they’re also used as a means to circumvent web scraping mitigation techniques where you might be blocked in case the web server sees too many requests coming from the same machine.

 

In this case, you can either pay for a service providing a pool of HTTP proxy servers (see, e.g., https://proxymesh.com/) or by using anonymity services such as Tor, which is free, but primarily meant to provide anonymity, and might be less suitable for web scrapers.

 

 Scraping in Other Programming Languages

 Scraping

Moving away from Python to other languages, it is good to know that Selenium also provides libraries for a number of other programming languages, with Java being one of the other main targets of the project.

 

In case you’re working with R — another popular language for data science — definitely take a look at the “rvest” library which is inspired by libraries like Beautiful Soup and makes it easy to scrape data from HTML pages from R.

 

The recent rise in attention being given to JavaScript and the language becoming more and more viable as a server-side scripting language as well has also spawned a lot of powerful scraping libraries for this language as well.

 

Since PhantomJS code can be a bit verbose, other libraries such as Nightmare have been proposed, which offer a more user-friendly, high-level API on top of PhantomJS.

 

Other interesting projects in this space are SlimerJS that is similar to PhantomJS, except that it runs on top of Gecko, the browser engine of Mozilla Firefox, instead of Webkit.

 

CasperJS is another high-level library that can be used on top of either PhantomJS or SlimerJS. Another interesting recent project is Puppeteer, a library that provides a high-level API to control a headless Chrome web browser.

 

Driven by the popularity of PhantomJS, Chrome’s developers are spending a lot of effort in providing a headless version of their browser.

 

Up until now, most deployments of web scrapers relying on a full browser would either use PhantomJS, which is already headless but differs slightly from a real browser; or would use a Firefox or Chrome driver together with a virtual display, using “Xvfb” on Linux, for instance.

 

Now that a true headless version of Chrome is becoming more stable, it is becoming a strong contender for PhantomJS, especially as it is also possible to use this headless Chrome setup as a driver for Selenium. Vitaly Slobodin, the former maintainer of PhantomJS, already stated that “people will switch to it, eventually.

 

Chrome is faster and more stable than PhantomJS. And it doesn’t eat memory like crazy.” In March 2018, the maintainers of PhantomJS announced that the project would cease to receive further updates, urging users to switch to Puppeteer instead.

 

Command-Line Tools

 Command-Line Tools

It’s also worth mentioning a number of helpful command-line tools that can come in handy when debugging web scrapers and interactions with HTTP servers. HTTPie is a command-line tool with beautiful output and support for form data, sessions, and JSON, making the tool also very helpful when debugging web APIs.

 

This will put a command on your clipboard that you can paste in a command-line window, which will perform the exact same request as Chrome did, and should provide you with the same result.

 

If you are stuck in a difficult debugging session, inspecting this command might give an indication about which header or cookie you’re not including in your Python script.

 

Graphical Scraping Tools

Graphical Scraping Tools

Finally, we also need to talk a little bit about graphical scraping tools. These can be offered either offered as stand-alone programs or through a browser plug-in.

 

Some of these have a free offering, like Portia (https://portia.scrapinghub.com) or Parsehub (https://www.parsehub.com) whereas others, like Kapow (https://www.kofax.com/data-integration-extraction), Fminer (http://www.fminer.com/), and Dexi (https://dexi.io/) are commercial offerings.

 

The feature set of such tools differs somewhat. Some will focus on the user-friendliness aspect of their offering. “Just point us to a URL and we’ll get out the interesting data,” as if done by magic.

 

This is fine, but based on what you’ve seen so far, you’ll be able to replicate the same behavior in a much more robust (and possibly even quicker) way. As we’ve seen, getting out tables and a list of links is easy.

 

Oftentimes, these tools will fail to work once the data contained in a page is structured in a less-straightforward way or when the page relies on JavaScript. Higher- priced offerings will be somewhat more robust.

 

These tools will typically also emulate a full browser instance and will allow you to write a scraper using a workflow-oriented approach by dragging and dropping and configuring various steps. This is already better, though it also comes with a number of drawbacks.

 

First, many of these offerings are quite expensive, making them less viable for experiments, proof of concepts, or smaller projects. Second, elements will often be retrieved from a page based on a user simply clicking on the item they wish to retrieve.

 

What happens in the background is that a CSS selector or XPath rule will be constructed, matching the selected element. This is fine in itself (we’ve also done this in our scripts), though note that a program is not as smart as a human programmer in terms of fine-tuning this rule.

 

In many cases, a very granular, specific rule is constructed, which will break once the site returns results that are structured a bit differently or an update is made to the web page.

 

Just as with scrapers written in Python, you’ll hence have to keep in mind that you’ll still need to maintain your collection of scrapers in case you wish to use them over longer periods of time.

 

Graphical tools do not fix this issue for you, and can even cause scrapers to fail more quickly as the underlying “rules” that are constructed can be very specific.

 

Many tools will allow you to change the selector rules manually, but to do so, you will have to learn how they work, and taking a programming-oriented approach might then quickly become more appealing anyway.

 

Finally, in our experience with these tools, we’ve also noted that the browser stacks they include are not always that robust or recent. We’ve seen various cases where the internal browser simply crashes when confronted with a JavaScript-heavy page.

 

Best Practices and Tips

Tips

With the following overview, we provide a list of best practices to summarize what you should keep in mind when building web scrapers:

 

Go for an API first: Always check first whether the site you wish to scrape offers an API. If it doesn’t, or it doesn’t provide the information you want, or it applies rate limiting, then you can decide to go for a web scraper instead.

 

Don’t parse HTML manually: Use a parser such as Beautiful Soup instead of trying to untangle the soup manually or using regular expressions.

 

Play nice: Don’t hammer a website with hundreds of HTTP requests, as this will end up with a high chance of you getting blocked anyway. Consider contacting the webmaster of the site and work out a way to work together.

 

Consider the user agent and referrer: Remember the “User-Agent” and “Referer” headers. Many sites will check these to prevent scraping or unauthorized access.

 

Web servers are picky: Whether it’s URL parameters, headers, or form data, some web servers come with very picky and strange requirements regarding their ordering, presence, and values. Some might even deviate from the HTTP standard.

 

Check your browser:

If you can’t figure out what’s going wrong, start from a fresh browser session and use your browser’s developer tools to follow along through a normal web session — preferably opened as an “Incognito” or “private browsing” window (to make sure you start from an empty set of cookies).

 

If everything goes well there, you should be able to simulate the same behavior as well. Remember that you can use “curl” and other command-line tools to debug difficult cases.

 

  • Before going for a full JavaScript engine, consider internal APIs: Check your browser’s network requests to see whether you can access a data source used by JavaScript directly before going for a more advanced solution like Selenium.

 

  • Assume it will crash: The web is a dynamic place. Make sure to write your scrapers in such a way that they provide early and detailed warnings when something goes wrong.

 

  • Crawling is hard: When writing an advanced crawler, you’ll quickly need to incorporate a database, deal with restarting scripts, monitoring, queue management, timestamps, and so on to create a robust crawler.

 

Some tools are helpful, some are not:

There are various companies offering “cloud-scraping” solutions, like, for example, Scrappy. The main benefit of using these is that you can utilize their fleet of servers to quickly parallelize a scraper.

 

Don’t put too much trust inexpensive graphical scraping tools, however. In most cases, they’ll only work with basic pages, cannot deal with JavaScript, or will lead to the construction of a scraping pipeline that might work but uses very fine-grained and specific selector rules that will break the moment the site changes its HTML a little bit.

 

Scraping is a cat-and-mouse game:

Some websites go very far in order to prevent scraping. Some researchers have investigated the various ways in which, for example, Selenium or browsers such as PhantomJS differ from normal ones (by inspecting their headers or JavaScript capabilities).

 

It’s possible to work around these checks, but there will be a point where it will become very hard to scrape a particular site.

 

Even when using Chrome through Selenium, specialized solutions exist that will try to identify nonhuman patterns, such as scrolling or navigating or clicking too fast, always clicking at the middle position of an element, and so on.

 

It’s unlikely that you’ll encounter many such cases in your projects, but keep this in mind nonetheless.

 

Keep in mind the managerial and legal concerns, and where web scraping fits in your data science process: As discussed, consider the data quality, robustness, and deployment challenges that come with web scraping.

 

Similarly, keep in mind the potential legal issues that might arise when you start depending on web scraping a lot or start to misuse it.

 

Examples

The following examples are included in this blog:

Examples

Scraping Hacker News: This example uses requests and Beautiful Soup to scrape the Hacker News front page.

 

Using the Hacker News API: This example provides an alternative by showing how you can use APIs with requests.

 

Quotes to Scrape: This example uses requests and Beautiful Soup and introduces the “dataset” library as an easy means to store data.

 

Blogs to Scrape: This example uses requests and Beautiful Soup, as well as the dataset library, illustrating how you can run a scraper again without storing duplicate results.

 

Scraping GitHub Stars: This example uses requests and Beautiful Soup to scrape GitHub repositories and show how you can perform a login using requests, reiterating our warnings regarding legal concerns.

 

Scraping Mortgage Rates: This example uses requests to scrap mortgage rates using a particularly tricky site.

 

Scraping and Visualizing IMDB Ratings:

This example uses requests and Beautiful Soup to get a list of IMDB ratings for TV series episodes. We also introduce the “matplotlib” library to create plots in Python.

 

Scraping IATA Airline Information:

This example uses requests and Beautiful Soup to scrape airline information from a site that employs a difficult web form. An alternative approach using Selenium is also provided. Scraped results are converted to a tabular format using the “pandas” library, also introduced in this example.

 

Scraping and Analyzing Web Forum Interactions:

This example uses requests and Beautiful Soup to scrape web forum posts and stores them using the dataset library. From the collected results, we use pandas and matplotlib to create heat map plots showing user activity.

 

Collecting and Clustering a Fashion Data Set:

This example uses requests and Beautiful Soup to download a set of fashion images. The images are then clustered using the “scikit-learn” library.

 

Sentiment Analysis of Scraped Amazon Reviews:

This example uses requests and Beautiful Soup to scrape a list of user reviews from Amazon, stored using the dataset library. We then analyze these using the “vaderSentiment” libraries in Python and plot the results using matplotlib.

 

Scraping and Analyzing News Articles:

This example uses Selenium to scrape a list of news articles, stored using the dataset library. We then associate these to a list of topics by constructing a topic model using nltk.

 

crapSing and Visualizing a Board Members Graph:

This example uses requests and Beautiful Soup to scrape board members for S&P 500 companies. A graph is created using NetworkX and visualized using “Gephi.”

 

Scraping Hacker News

Scraping Hacker News

We’re going to scrape the https://news.ycombinator.com/news front page, using requests and Beautiful Soup. Take some time to explore the page if you haven’t heard about it already. Hacker News is a popular aggregator of news articles that “hackers” (computer scientists, entrepreneurs, data scientists) find interesting.

 

We’ll store the scraped information in a simple Python list of dictionary objects for this example. The code to scrape this page looks as follows:

import requests
import re
from bs4 import BeautifulSoup
articles = []
url = 'https://news.ycombinator.com/news' r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser')
for item in html_soup.find_all('tr', class_='athing'): item_a = item.find('a', class_='storylink') item_link = item_a.get('href') if item_a else None
item_text = item_a.get_text(strip=True) if item_a else None next_row = item.find_next_sibling('tr')
item_score = next_row.find('span', class_='score')
item_score = item_score.get_text(strip=True) if item_score else '0 points' # We use regex here to find the correct element
item_comments = next_row.find('a', string=re.compile('\d+( |\s) comment(s?)'))
item_comments = item_comments.get_text(strip=True).replace('\xa0', ' ') \
if item_comments else '0 comments'
articles.append({
'link' : item_link, 'title' : item_text, 'score' : item_score,
'comments' : item_comments})
for article in articles:
print(article)
This will output the following:
{'link': 'http://moolenaar.net/habits.html', 'title': 'Seven habits of ? effective text editing (2000)', 'score': '44 points', 'comments': '9 comments'}
{'link': 'https://www.repository.cam.ac.uk/handle/1810/251038', 'title': ? 'Properties of expanding universes (1966)', 'score': '52 points', 'comments': '8 comments'}
[...]

Try expanding this code to scrape a link to the comments page as well. Think about potential use cases that would be possible when you also scrape the comments themselves (for example, in the context of text mining).

 

 Using the Hacker News API

Hacker News API

Note that Hacker News also offers an API providing structured, JSON-formatted results (see https://github.com/HackerNews/API). Let’s rework our Python code to now serve as an API client without relying on Beautiful Soup for HTML parsing:

import requests articles = []
url = 'https://hacker-news.firebaseio.com/v0'
top_stories = requests.get(url + '/topstories.json').json()
for story_id in top_stories:
story_url = url + '/item/{}.json'.format(story_id)
print('Fetching:', story_url) r = requests.get(story_url) story_dict = r.json() articles.append(story_dict)
for article in articles:
print(article)
This will output the following:
Fetching: https://hacker-news.firebaseio.com/v0/item/15532457.json Fetching: https://hacker-news.firebaseio.com/v0/item/15531973.json Fetching: https://hacker-news.firebaseio.com/v0/item/15532049.json [...]
{'by': 'laktak', 'descendants': 30, 'id': 15532457, 'kids': [15532761,
15532768, 15532635, 15532727, 15532776, 15532626, 15532700, 15532634],
'score': 60, 'time': 1508759764, ‘title': ‘Seven habits of effective text editing (2000)', 'type': 'story', 'url': 'http://moolenaar.net/ habits.html'}
[...]

 

Quotes to Scrape

We’re going to scrape http://quotes.toscrape.com, user requests, and Beautiful Soup. This page is provided by Scrapinghub as a more realistic scraping playground. Take some time to explore the page. We’ll scrape out all the information, that is:

  • The quotes, with their author and tags;
  • And the author information, that is date and place of birth and description.

 

We’ll store this information in an SQLite database. Instead of using the “records” library and writing manual SQL statements, we’re going to use the “dataset” library (see https://dataset.readthedocs.io/en/latest/).

 

This library provides a simple abstraction layer removing most direct SQL statements without the necessity for a full ORM model so that we can use a database just like we would with a CSV or JSON file to quickly store some information. Installing a dataset can be done easily through pip:

 

pip install -U dataset

Not a Full ORM Note that dataset does not want to replace a full-blown Orm (Object-relational mapping) library like sQlalchemy (even though it uses sQlalchemy behind the scenes).

 

It’s meant simply to quickly store a bunch of data in a database without having to define a schema or write SQL. For more advanced use cases, it’s a good idea to consider using a true Orm library or to define a database schema by hand and query it manually.

 

The code to scrape this site looks as follows:

import requests
import dataset
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
db = dataset.connect('sqlite:///quotes.db') authors_seen = set()
base_url = 'http://quotes.toscrape.com/'
def clean_url(url):
# Clean '/author/Steve-Martin' to 'Steve-Martin' # Use urljoin to make an absolute URL
url = urljoin(base_url, url)
# Use urlparse to get out the path part path = urlparse(url).path
# Now split the path by '/' and get the second part
# E.g. '/author/Steve-Martin' -> ['', 'author', 'Steve-Martin']
return path.split('/')[2]
def scrape_quotes(html_soup):
for quote in html_soup.select('div.quote'):
quote_text = quote.find(class_='text').get_text(strip=True) quote_author_url = clean_url(quote.find(class_='author') \
.find_next_sibling('a').get('href')) quote_tag_urls = [clean_url(a.get('href'))
for a in quote.find_all('a', class_='tag')] authors_seen.add(quote_author_url)
# Store this quote and its tags
quote_id = db['quotes'].insert({ 'text' : quote_text,
'author' : quote_author_url })
db['quote_tags'].insert_many(
[{'quote_id' : quote_id, 'tag_id' : tag} for tag in quote_tag_urls])
def scrape_author(html_soup, author_id):
author_name = html_soup.find(class_='author-title').get_text(strip=True) author_born_date = html_soup.find(class_='author-born-date').get_text (strip=True)
author_born_loc = html_soup.find(class_='author-born-location'). get_text(strip=True)
author_desc = html_soup.find(class_='author-description').get_text (strip=True)
db['authors'].insert({ 'author_id' : author_id,
'name' : author_name, 'born_date' : author_born_date,
'born_location' : author_born_loc, 'description' : author_desc})
# Start by scraping all the quote pages url = base_url
while True:
print('Now scraping page:', url) r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser') # Scrape the quotes
scrape_quotes(html_soup) # Is there a next page?
next_a = html_soup.select('li.next > a')
if not next_a or not next_a[0].get('href'):
break
url = urljoin(url, next_a[0].get('href'))
# Now fetch out the author information
for author_id in authors_seen:
url = urljoin(base_url, '/author/' + author_id)
print('Now scraping author:', url) r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser') # Scrape the author information scrape_author(html_soup, author_id)

 

This will output the following:

Now scraping page: http://quotes.toscrape.com/
Now scraping page: http://quotes.toscrape.com/page/2/ Now scraping page: http://quotes.toscrape.com/page/3/ Now scraping page: http://quotes.toscrape.com/page/4/ Now scraping page: http://quotes.toscrape.com/page/5/ Now scraping page: http://quotes.toscrape.com/page/6/
Now scraping page: http://quotes.toscrape.com/page/7/ Now scraping page: http://quotes.toscrape.com/page/8/ Now scraping page: http://quotes.toscrape.com/page/9/ Now scraping page: http://quotes.toscrape.com/page/10/
Now scraping author: http://quotes.toscrape.com/author/Ayn-Rand
Now scraping author: http://quotes.toscrape.com/author/E-E-Cummings [...]

Note that there are still a number of ways to make this code more robust. We’re not checking for None results when scraping the quote or author pages. In addition, we’re using “dataset” here to simply insert rows in three tables. In this case, the dataset will automatically increment a primary “id” key.

 

If you want to run this script again, you’ll hence first have to clean up the database to start fresh or modify the script to allow for resuming its work or updating the results properly.

 

In later examples, we’ll use dataset’s upsert method to do so. Once the script has finished, you can take a look at the database (“quotes.db”) using an SQLite client such as “DB Browser for SQLite,” which can be obtained from http:// sqlitebrowser.org/. 

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

 Blogs to Scrape

 Blogs to Scrape

We’re going to scrape http://blogs.toscrape.com, user requests, and Beautiful Soup. This page is provided by Scrapinghub as a more realistic scraping playground. Take some time to explore the page. We’ll scrape out all the information, that is, for every blog, we’ll obtain:

  • It's title;
  • It's an image;
  • Its price and stock availability;
  • It's rating;
  • Its product description;
  • Other product information.

We’re going to store this information in an SQLite database, again using the “dataset” library. However, this time we’re going to write our program in such a way that it takes into account updates — so that we can run it multiple times without inserting duplicate records in the database.

 

The code to scrape this site looks as follows:

import requests import dataset import re
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
db = dataset.connect('sqlite:///blogs.db') base_url = 'http://blogs.toscrape.com/' def scrape_blogs(html_soup, url):
for blog in html_soup.select('article.product_pod'):
# For now, we'll only store the blogs url blog_url = blog.find('h3').find('a').get('href') blog_url = urljoin(url, blog_url)
path = urlparse(blog_url).path
blog_id = path.split('/')[2]
# Upsert tries to update first and then insert instead db['blogs'].upsert({'blog_id' : blog_id,
'last_seen' : datetime.now()
}, ['blog_id'])
def scrape_blog(html_soup, blog_id):
main = html_soup.find(class_='product_main') blog = {}
blog['blog_id'] = blog_id
blog['title'] = main.find('h1').get_text(strip=True)
blog['price'] = main.find(class_='price_color').get_text(strip=True) blog['stock'] = main.find(class_='availability').get_text(strip=True) blog['rating'] = ' '.join(main.find(class_='star-rating') \
.get('class')).replace('star-rating', '').strip() blog['img'] = html_soup.find(class_='thumbnail').find('img').get('class='lazy' data-src') desc = html_soup.find(id='product_description')
blog['description'] = ''
if desc:
blog['description'] = desc.find_next_sibling('p') \
.get_text(strip=True)
info_table = html_soup.find(string='Product Information').find_ next('table')
for row in info_table.find_all('tr'):
header = row.find('th').get_text(strip=True)
# Since we'll use the header as a column, clean it a bit # to make sure SQLite will accept it
header = re.sub('[^a-zA-Z]+', '_', header) value = row.find('td').get_text(strip=True) blog[header] = value
db['blog_info'].upsert(blog, ['blog_id'])
# Scrape the pages in the catalogue url = base_url
inp = input('Do you wish to re-scrape the catalogue (y/n)? ')
while True and inp == 'y': print('Now scraping page:', url) r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser') scrape_blogs(html_soup, url)
# Is there a next page?
next_a = html_soup.select('li.next > a')
if not next_a or not next_a[0].get('href'):
break
url = urljoin(url, next_a[0].get('href'))
# Now scrape blog by blog, oldest first
blogs = db['blogs'].find(order_by=['last_seen'])
for blog in blogs:
blog_id = blog['blog_id']
blog_url = base_url + 'catalogue/{}'.format(blog_id)
print('Now scraping blog:', blog_url) r = requests.get(blog_url)
r.encoding = 'utf-8'
html_soup = BeautifulSoup(r.text, 'html.parser') scrape_blog(html_soup, blog_id)
# Update the last seen timestamp db['blogs'].upsert({'blog_id' : blog_id,
'last_seen' : datetime.now()
}, ['blog_id'])

Once the script has finished, remember that you can take a look at the database (“blogs.db”) using, for example, “DB Browser for SQLite.” Note the use of the dataset’s upsert method in this example.

 

This method will try to update a record if it exists already (by matching existing records with a list of given field names), or insert a new record otherwise.

 

Scraping GitHub Stars

Scraping GitHub Stars

We’re going to scrape https://github.com, user requests, and Beautiful Soup. Our goal is to get, for a given GitHub username, like, for example, https://github.com/google, a list of repositories with their GitHub-assigned programming language as well as the number of stars a repository has.

 

The basic structure of this scraper is quite simple:

 import requests
from bs4 import BeautifulSoup
import re
session = requests.Session()
url = 'https://github.com/{}' username = 'google'
r = session.get(url.format(username), params={'page': 1, 'tab': 'repositories'})
html_soup = BeautifulSoup(r.text, 'html.parser')
repos = html_soup.find(class_='repo-list').find_all('li')
for repo in repos:
name = repo.find('h3').find('a').get_text(strip=True)
language = repo.find(attrs={'itemprop': 'programmingLanguage'}) language = language.get_text(strip=True) if language else 'unknown' stars = repo.find('a', attrs={'href': re.compile('\/stargazers')}) stars = int(stars.get_text(strip=True).replace(',', '')) if stars else 0 print(name, language, stars)
Running this will output:
sagetv Java 192
ggrc-core Python 233
gapid Go 445
certificate-transparency-rfcs Python 55
mtail Go 936 [...]

 

However, this will fail if we would try to scrape a normal user’s page. Google’s GitHub account is an enterprise account, which is displayed slightly differently from normal user accounts.

 

You can try this out by setting the “username” variable to “Macuyiko” (one of the authors of this blog). We hence need to adjust our code to handle both cases:

import requests
from bs4 import BeautifulSoup
import re
session = requests.Session()
url = 'https://github.com/{}' username = 'Macuyiko'
r = session.get(url.format(username), params={'page': 1, 'tab': 'repositories'})
html_soup = BeautifulSoup(r.text, 'html.parser')
is_normal_user = False
repos_element = html_soup.find(class_='repo-list')
if not repos_element: is_normal_user = True
repos_element = html_soup.find(id='user-repositories-list')
repos = repos_element.find_all('li')
for repo in repos:
name = repo.find('h3').find('a').get_text(strip=True)
language = repo.find(attrs={'itemprop': 'programmingLanguage'}) language = language.get_text(strip=True) if language else 'unknown' stars = repo.find('a', attrs={'href': re.compile('\/stargazers')}) stars = int(stars.get_text(strip=True).replace(',', '')) if stars else 0 print(name, language, stars)
Running this will output:
macuyiko.github.io HTML 0
blog JavaScript 1
minecraft-python JavaScript 14 [...]

 

As an extra exercise, try adapting this code to scrape out all pages in case the repositories page is paginated (as is the case for Google’s account).

 

As a final add-on, you’ll note that user pages like https://github.com/ Macuyiko?tab=repositories also come with a short bio, including (in some cases) an e-mail address. However, this e-mail address is only visible once we log in to GitHub. In what follows, we’ll try to get out this information as well.

 

Warning this practice of hunting for a highly starred Github profile and extracting the contact information is frequently applied by recruitment firms. this being said, do note that we’re now going to log in to Github and that we’re crossing the boundary between public and private information.

 

Consider this a practice exercise illustrating how you can do so in python. 

 

Keeping the legal aspects in mind, you’re advised to only scrape out your own profile information and to not set up this kind of scrapers on a large scale before knowing what you’re getting into. refer back to the blog on legal concerns for the details regarding the legality of scraping.

 

You will need to create a GitHub profile in case you haven’t done so already. Let us start by getting out the login form from the login page:

import requests
from bs4 import BeautifulSoup
session = requests.Session()
url = 'https://github.com/{}' username = 'Macuyiko'
# Visit the login page
r = session.get(url.format('login'))
html_soup = BeautifulSoup(r.text, 'html.parser')
form = html_soup.find(id='login')
print(form)
Running this will output:
<div class="auth-form px-3" id="login">
</div>
This is not exactly what we expected. If we take a look at the page source, we see that the page is formatted somewhat strangely:
<div class="auth-form px-3" id="login">
</option></form>
<form accept-charset="UTF-8" action="/session" method="post">
<div style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="&#x2713;" />
<input name="authenticity_token" type="hidden" value="AtuMda[...]zw==" />
</div>
<div class="auth-form-header p-0">
<h1>Sign in to GitHub</h1>
</div>
<div id="js-flash-container">
</div> [...]
</form>
The following modification makes sure we get out the forms in the page:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
url = 'https://github.com/{}' username = 'Macuyiko'
# Visit the login page
r = session.get(url.format('login'))
html_soup = BeautifulSoup(r.text, 'html.parser')
data = {}
for form in html_soup.find_all('form'): # Get out the hidden form fields
for inp in form.select('input[type=hidden]'): data[inp.get('name')] = inp.get('value')
# SET YOUR LOGIN DETAILS:
data.update({'login': '', 'password': ''})
print('Going to login with the following POST data:')
print(data)
if input('Do you want to login (y/n): ') == 'y': # Perform the login
r = session.post(url.format('session'), data=data) # Get the profile page
r = session.get(url.format(username))
html_soup = BeautifulSoup(r.text, 'html.parser') user_info = html_soup.find(class_='vcard-details') print(user_info.text)

 

Even Browsers Have Bugs

Bugs      

If you’ve been using Chrome, you might wonder why you’re not seeing the form data when following along with the login process using Chrome’s Developer tools.

 

The reason is that Chrome contains a bug that will prevent data from appearing in Developer tools when the status code of the post corresponds with a redirect. the post data is still being sent; however, you just won’t see it in the Developer tools tab.

 

This bug will probably be fixed by the time you’re reading this, but it just goes to show that bugs appear in browsers as well.

 

Running this will output:

Going to login with the following POST data:
{'utf8': 'V',
'authenticity_token': 'zgndmzes [...]', 'login': 'YOUR_USER_NAME',
'password': 'YOUR_PASSWORD'}
Do you want to login (y/n): y
KU Leuven Belgium
macuyiko@gmail.com http://blog.macuyiko.com

 

Plain Text Passwords  

It goes without saying that hard-coding your password in plain text in python files (and other programs, for that matter) is not advisable for real-life scripts.

 

In a real deployment setting, where your code might get shared with others, make sure to modify your script so that it retrieves stored credentials from a secure data store (e.g., from the operating system environment variables, a file, or a database, preferably encrypted). take a look at the “secure-config” library available in pip, for example, on how to do so.

 

Scraping Mortgage Rates

We’re going to scrape Barclays’ mortgage simulator available at https://www.barclays. co.uk/mortgages/mortgage-calculator/.

 

There isn’t a particular reason why we pick this financial services provider, other than the fact that it applies some interesting techniques that serve as a nice illustration.

 

Take some time to explore the site a bit (using “What would it cost?”). We’re asked to fill in a few parameters, after which we get an overview of possible products that we’d like to scrape out.

 

If you follow along with your browser’s developer tools.

you’ll note that a POST request is being made to https://www.barclays.co.uk/dss/service/co.uk/ mortgages/costcalculator/productservice, with an interesting property:

 

the JavaScript on the page performing the POST is using an “application/json” value for the “Content-Type” header and is including the POST data as plain JSON;

 

Depending on requests’ data argument will not work in this case as it will encode the POST data. Instead, we need to use the json argument, which will basically instruct requests to format the POST data as JSON.

 

Additionally, you’ll note that the result page is formatted as a relatively complex- looking table (with “Show more” links for every entry), though the response returned by the POST request looks like a nicely formatted JSON object; 

 

Let’s see which response we get by implementing this in Python:

import requests
url = 'https://www.barclays.co.uk/dss/service/co.uk/mortgages/' + \ 'costcalculator/productservice'
session = requests.Session() estimatedPropertyValue = 200000
repaymentAmount = 150000
months = 240
data = {"header": {"flowId":"0"}, "body":
{"wantTo":"FTBP",
"estimatedPropertyValue":estimatedPropertyValue, "borrowAmount":repaymentAmount, "interestOnlyAmount":0,
"repaymentAmount":repaymentAmount, "ltv":round(repaymentAmount/estimatedPropertyValue*100), "totalTerm":months,
"purchaseType":"Repayment"}}
r = session.post(url, json=data)
print(r.json())
Running this will output:
{'header':
{'result': 'error', 'systemError':
{'errorCode': 'DSS_SEF001', 'type': 'E', 'severity': 'FRAMEWORK',
'errorMessage': 'State details not found in database', 'validationErrors': [],
'contentType': 'application/json', 'channel': '6'}
}}

 

That doesn’t look too good. Remember that, when we don’t get back the results we expect, there are various things we can do:

  • Check whether we’ve forgotten to include some cookies. For example, we might need to visit the entry page first, or there might be cookies set by JavaScript. If you inspect the request in your browser, you’ll note that there are a lot of cookies present.

 

  • Check whether we’ve forgotten to include some headers, or whether we need to spoof some.
  • If all else fails, resort to Selenium to implement a full browser.

 

In this particular situation, there are a lot of cookies being included in the request, some of which are set through normal “Set-Cookie” headers, though many are also set through a vast collection of JavaScript files included by the page. These would certainly be hard to figure out, as the JavaScript is obfuscated.

 

There are, however, some interesting headers that are being set and included by JavaScript in the POST request, which do seem to be connected to the error message. Let’s try including these, as well as spoofing the “User-Agent” and “Referer” headers:

import requests
url = 'https://www.barclays.co.uk/dss/service/co.uk/mortgages/' + \ 'costcalculator/productservice'
session = requests.Session() session.headers.update({
# These are non-typical headers, let's include them
'currentState': 'default_current_state', 'action': 'default',
'Origin': 'https://www.barclays.co.uk',
# Spoof referer, user agent, and X-Requested-With
'Referer': 'https://www.barclays.co.uk/mortgages/mortgage-calculator/', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
})
estimatedPropertyValue = 200000
repaymentAmount = 150000
months = 240
data = {"header": {"flowId":"0"}, "body":
{"wantTo":"FTBP",
"estimatedPropertyValue":estimatedPropertyValue, "borrowAmount":repaymentAmount, "interestOnlyAmount":0, "repaymentAmount":repaymentAmount,
"ltv":round(repaymentAmount/estimatedPropertyValue*100), "totalTerm":months,
"purchaseType":"Repayment"}}
r = session.post(url, json=data)
# Only print the header to avoid text overload
print(r.json()['header'])
This seems to work! In this case, it in fact turns out we didn’t have to include any cookies at all. We can now clean up this code:
import requests
def get_mortgages(estimatedPropertyValue, repaymentAmount, months): url = 'https://www.barclays.co.uk/dss/service/' + \
'co.uk/mortgages/costcalculator/productservice' headers = {
# These are non-typical headers, let's include them 'currentState': 'default_current_state',
'action': 'default',
'Origin': 'https://www.barclays.co.uk',
# Spoof referer, user agent, and X-Requested-With 'Referer': 'https://www.barclays.co.uk/mortgages/mortgage- calculator/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
data = {"header": {"flowId":"0"}, "body":
{"wantTo":"FTBP",
"estimatedPropertyValue":estimatedPropertyValue, "borrowAmount":repaymentAmount, "interestOnlyAmount":0, "repaymentAmount":repaymentAmount,
"ltv":round(repaymentAmount/estimatedPropertyValue*100), "totalTerm":months,
"purchaseType":"Repayment"}}
r = requests.post(url, json=data, headers=headers) results = r.json()
return results['body']['mortgages'] mortgages = get_mortgages(200000, 150000, 240)
# Print the first mortgage info
print(mortgages[0])
Running this will output:
{'mortgageName': '5 Year Fixed', 'mortgageId': '1321127853346', 'ctaType': None, 'uniqueId': '590b357e295b0377d0fb607b', 'mortgageType': 'FIXED',
'howMuchCanBeBorrowedNote': '95% (max) of the value of your home', 'initialRate': 4.99, 'initialRateTitle': '4.99%', 'initialRateNote': 'until 31st January 2023',
[...]

 

Scraping and Visualizing IMDB Ratings

Scraping IMDB Ratings

The next series of examples move on toward including some more data science-oriented use cases. We’re going to start simply by scraping a list of reviews for episodes of a TV series, using IMDB (the Internet Movie Database).

 

We’ll use Game of Thrones as an example, the episode list for which can be found at http://www.imdb.com/title/ tt0944947/episodes.

Note that IMDB’s overview is spread out across multiple pages (per season or per year), so we iterate over the seasons we want to retrieve using an extra loop:

import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0944947/episodes' episodes = []
ratings = []
# Go over seasons 1 to 7
for season in range(1, 8):
r = requests.get(url, params={'season': season}) soup = BeautifulSoup(r.text, 'html.parser') listing = soup.find('div', class_='eplist')
for epnr, div in enumerate(listing.find_all('div', recursive=False)): episode = "{}.{}".format(season, epnr + 1)
rating_el = div.find(class_='ipl-rating-star rating') rating = float(rating_el.get_text(strip=True)) print('Episode:', episode, '-- rating:', rating) episodes.append(episode)
ratings.append(rating)

We can then plot the scraped ratings using “matplotlib,” a well-known plotting library for Python that can be easily installed using pip:

pip install -U matplotlib

 

Plotting with Python 

Plotting with Python

Of course, you could also reproduce the plot below using, for example, excel, but this example serves as a gentle introduction as we’ll continue to use matplotlib for some later examples as well.

 

Note that this is certainly not the only—or even most user-friendly—plotting library for python, though it remains one of the most prevalent ones.

take a look at seaborn (https://seaborn.pydata.org/), 

altair  (https://altair-viz.github.io/)

and ggplot (http://ggplot.yhathq.com/) for some other excellent libraries.

 

Adding in the following lines to our script plots the results in a simple bar chart.

import matplotlib.pyplot as plt
episodes = ['S' + e.split('.')[0] if int(e.split('.')[1]) == 1 else '' \
for e in episodes]
plt.figure()
positions = [a*2 for a in range(len(ratings))] plt.bar(positions, ratings, align='center') plt.xticks(positions, episodes)
plt.show()

 

Scraping IATA Airline Information

We’re going to scrape airline information using the search form available at http:// www.iata.org/publications/Pages/code-search.aspx. This is an interesting case to illustrate the “nastiness” of some websites, even though the form we want to use looks incredibly simple (there’s only one drop-down and one text field visible on the page).

 

As the URL already shows, the web server driving this page is built on ASP.NET (“.aspx”), which has very peculiar opinions about how it handles form data.

 

It is a good idea to try submitting this form using your browser and taking a look at what happens using its developer tools. It seems that a lot of form data get included in the POST request — much more than our two fields.

 

Certainly, it does not look feasible to manually include all these fields in our Python script. The “ ViewState” field, for instance, holds session information that changes for every request. Even some names of fields seem to include parts of which we can’t really be sure that they wouldn’t change in the future, causing our script to break.

 

In addition, it seems that we also need to keep track of cookies as well. Finally, take a look at the response content that comes back from the POST request. This looks like a partial response (which will be parsed and shown by JavaScript) instead of a full HTML page:

1|#||4|1330|updatePanel|ctl00_SPWebPartManager1_g_e3b09024_878e [...]

MSOSPWebPartManager_StartWebPartEditingName|false|5|hiddenField| MSOSPWebPartManager_EndWebPartEditing|false|

 

To handle these issues, we’re going to try to make our code as robust as possible.

First, we’ll start by performing a GET request to the search page, using requests’ sessions mechanism. Next, we’ll use Beautiful Soup to get out all the form elements with their names and values:

import requests
from bs4 import BeautifulSoup
url = 'http://www.iata.org/publications/Pages/code-search.aspx' session = requests.Session()
# Spoof the user agent as a precaution
session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
# Get the search page r = session.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser') form = html_soup.find(id='aspnetForm')
# Get the form fields data = {}
for inp in form.find_all(['input', 'select']): name = inp.get('name')
value = inp.get('value')
if not name:
continue
data[name] = value if value else ''
print(data, end='\n\n\n')
This will output the following:
{'_wpcmWpid': '',
'wpcmVal': '', 'MSOWebPartPage_PostbackSource': '', 'MSOTlPn_SelectedWpId': '',
'MSOTlPn_View': '0', 'MSOTlPn_ShowSettings': 'False', 'MSOGallery_SelectedLibrary': '', 'MSOGallery_FilterString': '', 'MSOTlPn_Button': 'none',
' EVENTTARGET': '',
' EVENTARGUMENT': '', [...]

 

Next, we’ll use the collected form data to perform a POST request. We do have to make sure to set the correct values for the drop-down and text box, however. We add the following lines to our script:

# Set our desired search query
for name in data.keys(): # Search by
if 'ddlImLookingFor' in name: data[name] = 'ByAirlineName'
# Airline name
if 'txtSearchCriteria' in name: data[name] = 'Lufthansa'
# Perform a POST
r = session.post(url, data=data)
print(r.text)

Strangely enough, contrary to what’s happening in the browser, the POST request does return a full HTML page here, instead of a partial result. This is not too bad, as we can now use Beautiful Soup to fetch the table of results.

 

Instead of parsing this table manually, we’ll use a popular data science library for tabular data wrangling called “pandas,” which comes with a helpful “HTML table to data frame” method built in. The library is easy to install using pip:

pip install -U pandas

 

To parse out HTML, pandas relies on “lxml” by default and falls back to Beautiful Soup with “html5lib” in case “lxml” cannot be found. To make sure “lxml” is available, install it with:
pip install -U lxml
The full script can now be organized to look as follows:
import requests
from bs4 import BeautifulSoup
import pandas
url = 'http://www.iata.org/publications/Pages/code-search.aspx'
def get_results(airline_name): session = requests.Session()
# Spoof the user agent as a precaution session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
r = session.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser') form = html_soup.find(id='aspnetForm')
data = {}
for inp in form.find_all(['input', 'select']): name = inp.get('name')
value = inp.get('value')
if not name:
continue
if 'ddlImLookingFor' in name: value = 'ByAirlineName'
if 'txtSearchCriteria' in name: value = airline_name
data[name] = value if value else ''
r = session.post(url, data=data)
html_soup = BeautifulSoup(r.text, 'html.parser') table = html_soup.find('table', class_='datatable') df = pandas.read_html(str(table))
return df
df = get_results('Lufthansa')
print(df)
The equivalent Selenium code looks as follows:
import pandas
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url = 'http://www.iata.org/publications/Pages/code-search.aspx' driver = webdriver.Chrome()
driver.implicitly_wait(10)
def get_results(airline_name): driver.get(url)
# Make sure to select the right part of the form # This will make finding the elements easier
# as #aspnetForm wraps the whole page, including # the search box
form_div = driver.find_element_by_css_selector('#aspnetForm
.iataStandardForm')
select = Select(form_div.find_element_by_css_selector('select')) select.select_by_value('ByAirlineName')
text = form_div.find_element_by_css_selector('input[type=text]') text.send_keys(airline_name)
submit = form_div.find_element_by_css_selector('input[type=submit]') submit.click()
table = driver.find_element_by_css_selector('table.datatable') table_html = table.get_attribute('outerHTML')
df = pandas.read_html(str(table_html))
return df
df = get_results('Lufthansa')
print(df)
driver.quit()

 

There’s still one mystery we have to solve: remember that the POST request as made by requests returns a full HTML page, instead of a partial result as we observed in the browser. How does the server figure out how to differentiate between both types of results? The answer lies in the way the search form is submitted.

 

In requests, we perform a simple POST request with a minimal amount of headers. On the live page, however, the form submission is handled by JavaScript, which will perform the actual POST request and will parse out the partial results to show them.

 

To indicate to the server that it is JavaScript making the request, two headers are included in the request, which we can spoof in requests as well. If we modify our code as follows, you will indeed also obtain the same partial result:

# Include headers to indicate that we want a partial result session.headers.update({
'X-MicrosoftAjax' : 'Delta=true',
'X-Requested-With' : 'XMLHttpRequest',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})

 

Scraping and Analyzing Web Forum Interactions

In this example, we’re going to scrape web forum posts available at http://bpbasecamp. freeforums.net/board/27/gear-closet (a forum for backpackers and hikers) to get an idea about who the most active users are and who is frequently interacting with whom.

We’re going to keep a tally of interactions that will be constructed as follows:

 

  • The first post in a “thread” is not “replying” to anyone, so we won’t consider this as an interaction,
  • The next posts in a thread can optionally include one or more quote blocks, which indicate that the poster is directly replying to another user.

 

  • If a post does not include any quote blocks, we’ll just assume the post to be a reply to the original poster.

 

  • This might not necessarily be the case, and users will oftentimes use little pieces of text such as “^^” to indicate they’re replying to the direct previous poster, but we’re going to keep it simple in this example (feel free to modify the scripts accordingly to your definition of “interaction,” however).

Let’s get started. First, we’re going to extract a list of threads given a forum URL:

import requests
import re
from bs4 import BeautifulSoup
def get_forum_threads(url, max_pages=None): page = 1
threads = []
while not max_pages or page <= max_pages:
print('Scraping forum page:', page)
r = requests.get(url, params={'page': page}) soup = BeautifulSoup(r.text, 'html.parser') content = soup.find(class_='content')
links = content.find_all('a', attrs={'href': re.compile ('^\/thread\/')})
threads_on_page = [a.get('href') for a in links \
if a.get('href') and not 'page=' in a.get('href')] threads += threads_on_page
page += 1
next_page = soup.find('li', class_='next')
if 'state-disabled' in next_page.get('class'):
break return threads
url = 'http://bpbasecamp.freeforums.net/board/27/gear-closet' threads = get_forum_threads(url, max_pages=5)
print(threads)

 

Note that we have to be a bit clever here regarding pagination. This forum will continue to return the last page, even when supplying higher than maximum page numbers as the URL parameter, so that we can check whether an item with the class “next” also has the class “state-disabled” to determine whether we’ve reached the end of the thread list.

 

Since we only want thread links corresponding with the first page, we remove all links that have “page=” in their URL as well. In the example, we also decide to limit ourselves to five pages only. Running this will output:

Scraping forum page: 1 Scraping forum page: 2 Scraping forum page: 3 Scraping forum page: 4 Scraping forum page: 5
['/thread/2131/before-asking-which-pack-boot', [...] ]

 

For every thread, we now want to get out a list of posts. We can try this out with one thread first:

import requests
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_thread_posts(url, max_pages=None): page = 1
posts = []
while not max_pages or page <= max_pages: print('Scraping thread url/page:', url, page) r = requests.get(url, params={'page': page}) soup = BeautifulSoup(r.text, 'html.parser') content = soup.find(class_='content')
for post in content.find_all('tr', class_='item'): user = post.find('a', class_='user-link')
if not user:
# User might be deleted, skip...
continue
user = user.get_text(strip=True) quotes = []
for quote in post.find_all(class_='quote_header'): quoted_user = quote.find('a', class_='user-link') if quoted_user:
quotes.append(quoted_user.get_text(strip=True)) posts.append((user, quotes))
page += 1
next_page = soup.find('li', class_='next')
if 'state-disabled' in next_page.get('class'):
break return posts
url = 'http://bpbasecamp.freeforums.net/board/27/gear-closet'
thread = '/thread/2131/before-asking-which-pack-boot'
thread_url = urljoin(url, thread) posts = get_thread_posts(thread_url) print(posts)
Running this will output a list with every element being a tuple containing the poster’s name and a list of users that are quoted in the post:
Scraping thread url/page:
http://bpbasecamp.freeforums.net/thread/2131/before-asking-which-pack-boot 1 Scraping thread url/page:
http://bpbasecamp.freeforums.net/thread/2131/before-asking-which-pack-boot 2 [('almostthere', []), ('trinity', []), ('paula53', []),
('toejam', ['almostthere']), (‘stickman', []), (‘tamtrails', []),
('almostthere', ['tamtrails']), ('kayman', []), (‘almostthere',
[‘kayman']), (‘lanceman', []), (‘trinity', [‘trinity']),
(‘Christian', [‘almostthere']), (‘pollock', []), (‘mitsmit', []),
('intothewild', []), (‘Christian', []), (‘softskull', []), (‘argus',
[]),(‘lyssa7', []), (‘kevin', []), (‘greenwoodsuncharted', [])]

 

By putting both of these functions together, we get the script below. We’ll use Python’s “pickle” module to store our scraped results so that we don’t have to scrape the forum over and over again:

import requests
import re
from urllib.parse import urljoin from bs4 import BeautifulSoup import pickle
def get_forum_threads(url, max_pages=None): page = 1
threads = []
while not max_pages or page <= max_pages:
print('Scraping forum page:', page)
r = requests.get(url, params={'page=': page}) soup = BeautifulSoup(r.text, 'html.parser') content = soup.find(class_='content')
links = content.find_all('a', attrs={'href': re.compile ('^\/thread\/')})
threads_on_page = [a.get('href') for a in links \
if a.get('href') and not 'page' in a.get('href')] threads += threads_on_page
page += 1
next_page = soup.find('li', class_='next')
if 'state-disabled' in next_page.get('class'):
break return threads
def get_thread_posts(url, max_pages=None):
page = 1 posts = []
while not max_pages or page <= max_pages: print('Scraping thread url/page:', url, page) r = requests.get(url, params={'page': page}) soup = BeautifulSoup(r.text, 'html.parser') content = soup.find(class_='content')
for post in content.find_all('tr', class_='item'): user = post.find('a', class_='user-link')
if not user:
# User might be deleted, skip...
continue
user = user.get_text(strip=True) quotes = []
for quote in post.find_all(class_='quote_header'): quoted_user = quote.find('a', class_='user-link') if quoted_user:
quotes.append(quoted_user.get_text(strip=True)) posts.append((user, quotes))
page += 1
next_page = soup.find('li', class_='next')
if 'state-disabled' in next_page.get('class'):
break return posts
url = 'http://bpbasecamp.freeforums.net/board/27/gear-closet' threads = get_forum_threads(url, max_pages=5)
all_posts = []
for thread in threads:
thread_url = urljoin(url, thread) posts = get_thread_posts(thread_url) all_posts.append(posts)
withopen('forum_posts.pkl', "wb") as output_file: pickle.dump(all_posts, output_file)

 

Next, we can load the results and visualize them on a heat map. We’re going to use “pandas,” “numpy,” and “matplotlib” to do so, all of which can be installed through pip (if you’ve already installed pandas and matplotlib by following the previous examples, there’s nothing else you have to install):

pip install -U pandas pip install -U numpy
pip install -U matplotlib
Let’s start by visualizing the first thread only (shown in the output fragment of the scraper above):
import pickle import numpy as np import pandas as pd
import matplotlib.pyplot as plt
# Load our stored results
withopen('forum_posts.pkl', "rb") as input_file: posts = pickle.load(input_file)
def add_interaction(users, fu, tu):
if fu not in users: users[fu] = {}
if tu not in users[fu]: users[fu][tu] = 0
users[fu][tu] += 1
# Create interactions dictionary users = {}
for thread in posts: first_one = None for post in thread:
user = post[0] quoted = post[1] if not first_one:
first_one = user
elif not quoted:
add_interaction(users, user, first_one)
else:
for qu in quoted: add_interaction(users, user, qu)
# Stop after the first thread
break
df = pd.DataFrame.from_dict(users, orient='index').fillna(0) heatmap = plt.pcolor(df, cmap='Blues')
y_vals = np.arange(0.5, len(df.index), 1)
x_vals = np.arange(0.5, len(df.columns), 1) plt.yticks(y_vals, df.index)
plt.xticks(x_vals, df.columns, rotation='vertical')
for y in range(len(df.index)):
for x in range(len(df.columns)): if df.iloc[y, x] == 0:
continue
plt.text(x + 0.5, y + 0.5, '%.0f' % df.iloc[y, x], horizontalalignment='center', verticalalignment='center')
plt.show()

There are various ways to play around with this visualization.

 

Collecting and Clustering a Fashion Data Set

In this example, we’re going to use Zalando (a popular Swedish web shop) to fetch a collection of images of fashion products and cluster them using t-SNE.

 

Check the API Note that Zalando also exposes an easy to use API (see https://github.com/zalando/shop-api-documentation/wiki/Api- introduction for the documentation). at the time of writing, the API does not require authentication, though this is scheduled to change in the near future, requiring users to register to get an API access token.

 

Since we’ll only fetch images here, we’ll not bother to register, though in a proper “app,” using the API option would certainly be recommended.

 

Our first script downloads images and stores them in a directory; 

import requests
import os, os.path
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
store = 'images'
if not os.path.exists(store): os.makedirs(store)
url = 'https://www.zalando.co.uk/womens-clothing-dresses/' pages_to_crawl = 15
def download(url):
r = requests.get(url, stream=True)
filename = urlparse(url).path.split('/')[-1]
print('Downloading to:', filename)
with open(os.path.join(store, filename), 'wb') as the_image:
for byte_chunk in r.iter_content(chunk_size=4096*4): the_image.write(byte_chunk)
for p in range(1, pages_to_crawl+1):
print('Scraping page:', p)
r = requests.get(url, params={'p' : p}) html_soup = BeautifulSoup(r.text, 'html.parser')
for img in html_soup.select('#z-nvg-cognac-root z-grid-item img'): img_class='lazy' data-src = img.get('class='lazy' data-src')
if not img_class='lazy' data-src:
continue
img_url = urljoin(url, img_class='lazy' data-src) download(img_url)

 

Next, we’ll use the t-SNE clustering algorithm to cluster the photos. t-SNE is a relatively recent dimensionality reduction technique that is particularly well-suited for the visualization of high-dimensional data sets, like images. You can read about the technique at https://lvdmaaten.github.io/tsne/.

 

We’re going to use “scikit-learn” together with “matplotlib,” “scipy,” and “numpy,” all of which are libraries that are familiar to data scientists and can be installed through pip:
pip install -U matplotlib pip install -U scikit-learn pip install -U numpy
pip install -U scipy
Our clustering script looks as follows:
import os.path
import numpy as np
import matplotlib.pyplot as plt from matplotlib import offsetbox from sklearn import manifold from scipy.misc import imread from glob import iglob
store = 'images' image_data = []
for filename in iglob(os.path.join(store, '*.jpg')):
image_data.append(imread(filename))
image_np_orig = np.array(image_data)
image_np = image_np_orig.reshape(image_np_orig.shape[0], -1)
def plot_embedding(X, image_np_orig): # Rescale
x_min, x_max = np.min(X, 0), np.max(X, 0) X = (X - x_min) / (x_max - x_min)
# Plot images according to t-SNE position plt.figure()
ax = plt.subplot(111)
for i in range(image_np.shape[0]): imagebox = offsetbox.AnnotationBbox(
offsetbox=offsetbox.OffsetImage(image_np_orig[i], zoom=.1), xy=X[i],
frameon=False) ax.add_artist(imagebox)
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca') X_tsne = tsne.fit_transform(image_np)
plot_embedding(X_tsne, image_np_orig) plt.show()

 

This code works as follows. First, we load all the images (using imread) and convert them to a numpy array.

 

The reshape function makes sure that we get a n x 3m matrix, with n the number of images and m the number of pixels per image, instead of an n x r x g x b tensor, with r, g, and b the pixel values for the red, green, and blue channels respectively.

 

After constructing the t-SNE embedding, we plot the images with their calculated x and y coordinates using matplotlib.

 

Image Sizes

Image Sizes     

We’re lucky that all of the images we’ve scraped have the same width and height. If this would not be the case, we’d first have to apply a resizing to make sure every image will lead to a vector in the dataset with equal length.

 

Sentiment Analysis of Scraped Amazon Reviews

Scraped Amazon Reviews

We’re going to scrape a list of Amazon reviews with their ratings for a particular product. 

Note that this product has an id of “14495730,” and even using the URL https://www.amazon.com/ product-reviews/1449355730/, without the product name, will work.

 

Simple URLs playing around with URLs as we do here is always a good idea before writing your web scraper. Based on the above, we know that a given product identifier is enough to fetch the reviews page, without a need to figure out the exact Url, including the product name.

 

Why then, does Amazon allow for both and does it default to including the product name? the reason is most likely search engine optimization (SEO). search engines like Google prefer URLs with human- readable components included.

 

If you explore the reviews page, you’ll note that the reviews are paginated. By browsing to other pages and following along in your browser’s developer tools.

 

We see that POST requests are being made (by JavaScript) to URLs looking like https://www. amazon.com/ss/customer-reviews/ajax/reviews/get/ref, with the product ID included in the form data, as well as some other form fields that look relatively easy to spoof.

Let’s see what we get in requests:

import requests
from bs4 import BeautifulSoup
review_url = 'https://www.amazon.com/ss/customer-reviews/ajax/reviews/get/' product_id = '1449355730'
session = requests.Session() session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))
def get_reviews(product_id, page): data = {
'sortBy':'', 'reviewerType':'all_reviews', 'formatType':'',
'mediaType':'', 'filterByStar':'all_stars', 'pageNumber':page, 'filterByKeyword':'', 'shouldAppend':'undefined', 'deviceType':'desktop',
'reftag':'cm_cr_getr_d_paging_btm_{}'.format(page), 'pageSize':10,
'asin':product_id, 'scope':'reviewsAjax1'
}
r = session.post(review_url + 'ref=' + data['reftag'], data=data)
return r.text
print(get_reviews(product_id, 1))

Note that we spoof the “User-Agent” header here. If we don’t, Amazon will reply with a message requesting us to verify whether we’re a human (you can copy the value for this header from your browser’s developer tools). In addition, note the “scope” form field that we set to “reviewsAjax1.”

 

If you explore the reviews page in the browser, you’ll see that the value of this field is in fact increased for each request, that is, “reviewsAjax1,” “reviewsAjax2,” and so on.

 

We could decide to replicate this behavior as well — which we’d have to do in case Amazon would pick up on our tactics, though it does not seem to be necessary for the results to come back correctly.

 

Finally, note that the POST request does not return a full HTML page, but some kind of hand-encoded result that will be parsed (normally) by JavaScript:

["script",
"if(window.ue) { ues('id','reviewsAjax1','FE738GN7GRDZK6Q09S9G'); ues('t0','reviewsAjax1',new Date());
ues('ctb','reviewsAjax1','1');
uet('bb','reviewsAjax1'); }"
]
&&&
["update","#cm_cr-review_list",""] &&&
["loaded"] &&&
["append","#cm_cr-review_list","<div id=\"R3JQXR4EMWJ7AD\" data- hook=\"review\"class=\"a-section review\"><div id=\ "customer_review-R3JQXR4EMWJ7AD\"class=\"a-section celwidget\">
<div class=\"a-row\"><a class=\"a-link-normal\"title=\"5.0 out
of 5 stars\" [...]

 

Luckily, after exploring the reply a bit (feel free to copy-paste the full reply in a text editor and read through it), the structure seems easy enough to figure out:

  • The reply is composed of several “instructions,” formatted as a JSON list;
  • The instructions themselves are separated by three ampersands, “&&&”;
  • The instructions containing the reviews start with an “append” string;
  •  The actual contents of the review are formatted as an HTML element and found on the third position of the list.

 

Let’s adjust our code to parse the reviews in a structured format. We’ll loop through all the instructions; convert them using the “json” module; check for “append” entries; and then use Beautiful Soup to parse the HTML fragment and get the review id, rating, title, and text.

 

We’ll also need a small regular expression to get out the rating, which is set as a class with a value like “a-start-1” to “a-star-5”. We could use these as is, but simply getting “1” to “5” might be easier to work with later on, so we already perform a bit of cleaning here:

import requests import json import re
from bs4 import BeautifulSoup
review_url = 'https://www.amazon.com/ss/customer-reviews/ajax/reviews/get/' product_id = '1449355730'
session = requests.Session() session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))
def parse_reviews(reply): reviews = []
for fragment in reply.split('&&&'):
if not fragment.strip():
continue
json_fragment = json.loads(fragment)
if json_fragment[0] != 'append':
continue
html_soup = BeautifulSoup(json_fragment[2], 'html.parser') div = html_soup.find('div', class_='review')
if not div:
continue
review_id = div.get('id')
title = html_soup.find(class_='review-title').get_text(strip=True) review = html_soup.find(class_='review-text').get_text(strip=True)
# Find and clean the rating:
review_cls = ' '.join(html_soup.find(class_='review-rating'). get('class'))
rating = re.search('a-star-(\d+)', review_cls).group(1) reviews.append({'review_id': review_id,
'rating': rating, 'title': title, 'review': review})
return reviews
def get_reviews(product_id, page): data = {
'sortBy':'', 'reviewerType':'all_reviews', 'formatType':'',
'mediaType':'', 'filterByStar':'all_stars', 'pageNumber':page, 'filterByKeyword':'', 'shouldAppend':'undefined', 'deviceType':'desktop',
'reftag':'cm_cr_getr_d_paging_btm_{}'.format(page), 'pageSize':10,
'asin':product_id, 'scope':'reviewsAjax1'
}
r = session.post(review_url + 'ref=' + data['reftag'], data=data) reviews = parse_reviews(r.text)
return reviews
print(get_reviews(product_id, 1))

 

This works! The only thing left to do is to loop through all the pages and store the reviews in a database using the “dataset” library. Luckily, figuring out when to stop looping is easy: once we do not get any reviews for a particular page, we can stop:

import requests import json import re
from bs4 import BeautifulSoup
import dataset
db = dataset.connect('sqlite:///reviews.db')
review_url = 'https://www.amazon.com/ss/customer-reviews/ajax/reviews/get/' product_id = '1449355730'
session = requests.Session() session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))
def parse_reviews(reply): reviews = []
for fragment in reply.split('&&&'):
if not fragment.strip():
continue
json_fragment = json.loads(fragment)
if json_fragment[0] != 'append':
continue
html_soup = BeautifulSoup(json_fragment[2], 'html.parser') div = html_soup.find('div', class_='review')
if not div:
continue
review_id = div.get('id')
review_cls = ' '.join(html_soup.find(class_='review-rating'). get('class'))
rating = re.search('a-star-(\d+)', review_cls).group(1)
title = html_soup.find(class_='review-title').get_text(strip=True) review = html_soup.find(class_='review-text').get_text(strip=True)
reviews.append({'review_id': review_id,
'rating': rating, 'title': title, 'review': review})
return reviews
def get_reviews(product_id, page): data = {
'sortBy':'', 'reviewerType':'all_reviews', 'formatType':'',
'mediaType':'', 'filterByStar':'all_stars', 'pageNumber':page, 'filterByKeyword':'', 'shouldAppend':'undefined', 'deviceType':'desktop',
'reftag':'cm_cr_getr_d_paging_btm_{}'.format(page), 'pageSize':10,
'asin':product_id, 'scope':'reviewsAjax1'
}
r = session.post(review_url + 'ref=' + data['reftag'], data=data) reviews = parse_reviews(r.text)
return reviews
page = 1
while True:
print('Scraping page', page)
reviews = get_reviews(product_id, page)
if not reviews:
break
for review in reviews:
print(' -', review['rating'], review['title']) db['reviews'].upsert(review, ['review_id'])
page += 1

 

This will output the following:

Scraping page 1
- 5 let me try to explain why this 1600 page blog may actually end
up saving you a lot of time and making you a better Python progra
- 5 Great start, and written for the novice
- 5 Best teacher of software development
- 5 Very thorough
- 5 If you like big thick blogs that deal with a lot of ...
- 5 Great blog, even for the experienced python programmer
- 5 Good Tutorial; you'll learn a lot.
- 2 Takes too many pages to explain even the most simpliest ...
- 3 If I had a quarter for each time he says something like "here's
an intro to X
- 4 it almost seems better suited for a college class [...]

 

Now that we have a database containing the reviews, let’s do something fun with these. We’ll run a sentiment analysis algorithm over the reviews (providing a sentiment score per review), which we can then plot over the different ratings given to inspect the correlation between a rating and the sentiment in the text.

 

To do so, we’ll use the “vaderSentiment” library, which can simply be installed using pip. We’ll also need to install the “nltk” (Natural Language Toolkit) library:

 

pip install -U vaderSentiment pip install -U nltk
Using the vaderSentiment library is pretty simple for a single sentence:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer analyzer = SentimentIntensityAnalyzer()
sentence = "I'm really happy with my purchase" vs = analyzer.polarity_scores(sentence)
print(vs)
# Shows: {'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.6115}

 

To get the sentiment for a longer piece of text, a simple approach is to calculate the sentiment score per sentence and average this over all the sentences in the text, like so:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk import tokenize
analyzer = SentimentIntensityAnalyzer() paragraph = """
I'm really happy with my purchase.
I've been using the product for two weeks now.
It does exactly as described in the product description. The only problem is that it takes a long time to charge.
However, since I recharge during nights, this is something I can live with.
sentence_list = tokenize.sent_tokenize(paragraph) cumulative_sentiment = 0.0
for sentence in sentence_list:
vs = analyzer.polarity_scores(sentence) cumulative_sentiment += vs["compound"] print(sentence, ' : ', vs["compound"])
average_sentiment = cumulative_sentiment / len(sentence_list) print('Average score:', average_score)
If you run this code, ntlk will most likely complain about the fact that a resource is missing:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt') [...]
To fix this, execute the recommended commands on a Python shell:
>>> import nltk
>>> nltk.download('punkt')

 

After the resource has been downloaded and installed, the code above should work fine and will output:

I'm really happy with my purchase. : 0.6115

I've been using the product for two weeks now. : 0.0

It does exactly as described in the product description. : 0.0

The only problem is that it takes a long time to charge. : -0.4019 However, since I recharge during nights, this is something I can live with. : 0.0

Average score: 0.04192000000000001

 

Let’s apply this to our list of Amazon reviews. We’ll calculate the sentiment for each rating, organize them by rating, and then use the “matplotlib” library to draw violin plots of the sentiment scores per rating:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk import tokenize
import dataset
import matplotlib.pyplot as plt
db = dataset.connect('sqlite:///reviews.db') reviews = db['reviews'].all()
analyzer = SentimentIntensityAnalyzer() sentiment_by_stars = [[] for r in range(1,6)] for review in reviews:
full_review = review['title'] + '. ' + review['review']
sentence_list = tokenize.sent_tokenize(full_review) cumulative_sentiment = 0.0
for sentence in sentence_list:
vs = analyzer.polarity_scores(sentence) cumulative_sentiment += vs["compound"]
average_score = cumulative_sentiment / len(sentence_list) sentiment_by_stars[int(review['rating'])-1].append(average_score)
plt.violinplot(sentiment_by_stars,
range(1,6),
vert=False, widths=0.9,
showmeans=False, showextrema=True, showmedians=True, bw_method='silverman')
plt.axvline(x=0, linewidth=1, color='black') plt.show()

In this case, we can indeed observe a strong correlation between the rating and the sentiments of the texts, though it’s interesting to note that even for lower ratings (two and three stars), the majority of reviews are still somewhat positive.

 

Of course, there is a lot more that can be done with this data set. Think, for instance, about a predictive model to detect fake reviews.

 

Recommend