Scraping a Website (Using Python 2019)

Scraping a Website

 

Scraping and Analyzing News Articles Website using Python (2019)

This blog explains Scraping and Analyzing News Articles of a Website using Python. Our goal is to visit every article on the website and get out the title and main content of the article.

 

How to Scrap Website using Python?

 Getting out the “main content” from a website is trickier as it might seem at first sight. You might try to iterate all the lowest- level HTML elements and keeping the one with the most text embedded in it.

 

Though this approach will break if the text in an article is split up over multiple sibling elements, like a series of “<p>” tags inside a larger “<div>”, for instance.

 

Considering all elements does not resolve this issue, as you’ll end up by simply selecting the top element (e.g., “<html>” or “<body>”) on the website, as this will always contain the largest amount (i.e., all) text. The same holds in case you’d rely on the recent attribute selenium provide to apply a visual approach (i.e., find the element taking up the most space on the website).

 

A large number of libraries and tools have been written to solve this issue. take a look at.

for example, https://github.com/masukomi/ar90-readability, https://github. com/misja/python-boilerpipe, https://github.com/codelucas/ newspaper.

 

And https://github.com/fhamborg/news-please for some interesting libraries for the specific purpose of news extraction.

or specialized apIs such as https://newsapi.org/ and  https://webhose.io/news-api. 

 

In this example, we’ll use Mozilla's implementation of readability; see https:// github.com/mozilla/readability. this is a Javascript-based library, but we’ll figure out a way to use it with python and selenium nonetheless.

 

Finally, although it has sadly fallen a bit out of use in recent years, it is interesting to know that there exists already a nice format that sites can apply to offer their content updates in a structured way: RSS (rich site summary):

 

Web feed that allows users to access updates to online content in a standardized, XML-based format. Keep an eye out for “<link>” tags with their “type” attribute set to ”application/ rss+xml”. the “href” attribute will then announce the Url where the RSS feed can be found.

 

Let’s start by getting out a list of “Top Stories” links from Google News using Selenium. The first iteration of our script looks as follows:

from selenium import web driver

base_url = 'https://news.google.com/news/?ned=us&hl=en' driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get(base_url)
for link in driver.find_elements_by_css_selector('main a[role="heading"]'): news_url = link.get_attribute('href')
print(news_url)
driver.quit()
This will output the following (of course, your links might vary):
http://news.xinhuanet.com/english/2017-10/24/c_136702615.htm http://www.cnn.com/2017/10/24/asia/china-xi-jinping-thought/index.html [...]

 

Navigate to http://edition.cnn.com/2017/10/24/asia/china-xi-jinping- thought/index.html in your browser and open your browser’s console in its developer tools. 

 

Our goal is now to extract the content from this website, using Mozilla’s Readability implementation in JavaScript, a tool which is normally used to display articles in a more readable format. 

 

That is, we would like to “inject” the JavaScript code available at https://raw.githubusercontent.com/mozilla/readability/master/Readability.js in the website.

 

Since we are able to instruct the browser to execute JavaScript using Selenium, we hence need to come up with an appropriate piece of JavaScript code to perform this injection. Using your browser’s console, try executing the following block of code:

(function(d, script) {
script = d.createElement('script'); script.type = 'text/javascript'; script.async = true;
script.onload = function(){
console.log('The script was successfully injected!');
};
script.class='lazy' data-src = 'https://raw.githubusercontent.com/' + 'mozilla/readability/master/Readability.js';
d.getElementsByTagName('head')[0].appendChild(script);
}(document));

 

This script works as follows: a new “<script>” element is constructed with its “class='lazy' data-src” parameter set to https://raw.githubusercontent.com/mozilla/readability/master/ Readability.js and appended into the “<head>” of the document. Once the script has loaded, we show a message on the console. 

 

This does not work as we had expected, as Chrome refuses to execute this script:

Refused to execute script from 'https://raw.githubusercontent.com/mozilla/readability/master/Readability.js' because its MIME type ('text/plain') is not executable, and strict MIME type checking  is enabled.

 

The problem here is that GitHub indicates in its headers that the content type of this document is “text/plain,” and Chrome prevents us from using it as a script. To work around this issue, we’ll host a copy of the script ourselves at http://www. webscrapingfordatascience.com/readability/Readability.js and try again:

(function(d, script) {
script = d.createElement('script'); script.type = 'text/javascript'; script.async = true;
script.onload = function(){
console.log('The script was successfully injected!');
};
script.class='lazy' data-src = 'http://www.webscrapingfordatascience.com/readability/ Readability.js'; d.getElementsByTagName('head')[0].appendChild(script);
}(document));
Which should give the correct result:
The script was successfully injected!
Now that the “<script>” tag has been injected and executed, but we need to figure out how to use it. Mozilla’s documentation at https://github.com/mozilla/
readability provides us with some instructions, based on which we can try executing the following (still in the console window):
var documentClone = document.cloneNode(true); var loc = document.location;
var uri = {
spec: loc.href, host: loc.host,
prePath: loc.protocol + "//" + loc.host,
scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")), pathBase: loc.protocol + "//" + loc.host +
loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, documentClone).parse(); console.log(article);

 

The question is now how we can return this information to Selenium. Remember that we can execute JavaScript commands from Selenium through the execute_script method.

 

One possible approach to get out the information we want is to use JavaScript to replace the whole website’s contents with the information we want, and then use Selenium to get out that information:

from selenium import webdriver
base_url = 'http://edition.cnn.com/2017/10/24/asia/china-xi-jinping- thought/index.html'
driver = webdriver.Chrome() driver.implicitly_wait(10)
driver.get(base_url)
js_cmd = ''' (function(d, script) {
script = d.createElement('script'); script.type = 'text/javascript'; script.async = true;
script.onload = function(){
var documentClone = document.cloneNode(true); var loc = document.location;
var uri = {
spec: loc.href, host: loc.host,
prePath: loc.protocol + "//" + loc.host,
scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")), pathBase: loc.protocol + "//" + loc.host +
loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, documentClone).parse(); document.body.innerHTML = '<h1 id="title">' + article.title + '</h1>' + '<div id="content">' + article.content + '</div>';
};
script.class='lazy' data-src = 'http://www.webscrapingfordatascience.com/readability/ Readability.js'; d.getElementsByTagName('head')[0].appendChild(script);
}(document)); '''
driver.execute_script(js_cmd)
title = driver.find_element_by_id('title').text.strip() content = driver.find_element_by_id('content').text.strip()
print('Title was:', title) driver.quit()

The “document.body.innerHTML” line in the JavaScript command will replace the contents of the “<body>” tag with a header and a “<div>” tag, from which we can then simply retrieve our desired information.

 

However, the execute_script method also allows us to pass back JavaScript objects to Python, so the following approach also works:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
base_url = 'http://edition.cnn.com/2017/10/24/asia/china-xi-jinping- thought/index.html'
driver = webdriver.Chrome() driver.implicitly_wait(10)
driver.get(base_url)
js_cmd = ''' (function(d, script) {
script = d.createElement('script'); script.type = 'text/javascript'; script.async = true;
script.onload = function() { script.id = 'readability-script';
}
script.class='lazy' data-src = 'http://www.webscrapingfordatascience.com/readability/ Readability.js'; d.getElementsByTagName('head')[0].appendChild(script);
}(document)); '''
js_cmd2 = '''
var documentClone = document.cloneNode(true); var loc = document.location;
var uri = {
spec: loc.href, host: loc.host,
prePath: loc.protocol + "//" + loc.host,
scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")),
pathBase: loc.protocol + "//" + loc.host +
loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, documentClone).parse(); return JSON.stringify(article);
'''
driver.execute_script(js_cmd) wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "readability-script")))
returned_result = driver.execute_script(js_cmd2)
print(returned_result) driver.quit()

 

There are several intricacies here that warrant some extra information. First, note that we’re using the execute_script method twice: once to inject the “<script>” tag, and then again to get out our “article” object.

 

However, since executing the script might take some time, and Selenium’s implicit wait does not take this into account when using execute_script, we use an explicit wait to check for the presence of an element with an “id” of “readability-script,” which is set by the “script. onload” function.

 

Once such id is found, we know that the script has finished loading and we can execute the second JavaScript command.

 

Here, we do need to use “JSON.stringify” to make sure we return a JSON-formatted string instead of a raw JavaScript object to Python, as Python will not be able to make sense of this return value and convert it to a list of None values (simple types, such as integers and strings, are fine, however).

 

Let’s clean up our script a little and merge it with our basic framework:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
base_url = 'https://news.google.com/news/?ned=us&hl=en'
inject_readability_cmd = ''' (function(d, script) {
script = d.createElement('script'); script.type = 'text/javascript'; script.async = true;
script.onload = function() { script.id = 'readability-script';
}
script.class='lazy' data-src = 'http://www.webscrapingfordatascience.com/readability/ Readability.js'; d.getElementsByTagName('head')[0].appendChild(script);
}(document)); '''
get_article_cmd = '''
var documentClone = document.cloneNode(true); var loc = document.location;
var uri = {
spec: loc.href, host: loc.host,
prePath: loc.protocol + "//" + loc.host,
scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")), pathBase: loc.protocol + "//" + loc.host +
loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, documentClone).parse(); return JSON.stringify(article);
'''
driver = webdriver.Chrome() driver.implicitly_wait(10)
driver.get(base_url) news_urls = []
for link in driver.find_elements_by_css_selector('main a[role="heading"]'):
news_url = link.get_attribute('href') news_urls.append(news_url)
for news_url in news_urls: print('Now scraping:', news_url) driver.get(news_url)
print('Injecting scripts') driver.execute_script(inject_readability_cmd) wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "readability-script"))) returned_result = driver.execute_script(get_article_cmd)

 

# Do something with returned_result driver.quit()

Note that we’re using two “for” loops: one to extract the links we wish to scrape, which we’ll store in a list; and another one to iterate over the list. Using one loop wouldn’t work in this case: as we’re navigating to other websites inside of the loop,

 

Selenium would complain about “stale elements” when trying to find the next link with find_elements_by_css_selector.

 

This is basically saying: “I’m trying to find the next element for you, but the website has changed in the meantime, so I can’t be sure anymore what you want to retrieve.”

 

If you try to execute this script, you’ll note that it quickly fails anyway. What is happening here?

 

To figure out what is going wrong, try opening another link in your browser, say https://www.washingtonpost.com/world/chinas-leader-elevated-to-the-level-of-mao-in-communist-pantheon (a site using HTTPS), and executing the first JavaScript command manually in your browser’s console, that is, by copy-pasting and executing:

 

>(function(d, script) {
script = d.createElement('script'); script.type = 'text/javascript'; script.async = true;
script.onload = function() { script.id = 'readability-script';
}
script.class='lazy' data-src = 'http://www.webscrapingfordatascience.com/readability/ Readability.js'; d.getElementsByTagName('head')[0].appendChild(script);
}(document));
You’ll probably get a result like what follows:
GET https://www.webscrapingfordatascience.com/readability/Readability.js 
net::ERR_CONNECTION_CLOSED

 

On other websites, you might get:

Mixed Content: The website at [...] was loaded over HTTPS, but requested an insecure script 'http://www.webscrapingfordatascience.com/readability/ Readability.js'. 

This request has been blocked; the content must be served over HTTPS.

 

It’s clear what’s going on here: if we load a website through HTTPS and try to inject a script through HTTP, Chrome will block this request as it deems it insecure (which is true).

 

Other sites might apply other approaches to prevent script injection, using, for example, a “Content-Security-Policy” header. that would result in an error like this:

 

Refused to load the script 'http://www.webscrapingfordatascience.com/readability/Readability.js' because it violates the following Content Security Policy directive: "script-class='lazy' data-src 'self' 'unsafe-eval' 'unsafe-inline'".

 

There are extensions available for Chrome that will disable such checks, but we’re going to take a different approach here, which will work on the majority of websites except those with the most strict Content Security Policies: instead of trying to inject a “<script>” tag, we’re going to simply take the contents of our JavaScript file and execute these directly using Selenium.

 

We can do so by loading the contents from a local file, but since we’ve already hosted the file online, we’re going to user requests to fetch the contents instead:

from selenium import webdriver
import requests
base_url = 'https://news.google.com/news/?ned=us&hl=en'
script_url = 'http://www.webscrapingfordatascience.com/readability/ Readability.js'
get_article_cmd = requests.get(script_url).text get_article_cmd += '''
var documentClone = document.cloneNode(true); var loc = document.location;
var uri = {
spec: loc.href, host: loc.host,
prePath: loc.protocol + "//" + loc.host,
scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")), pathBase: loc.protocol + "//" + loc.host +
loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, documentClone).parse(); return JSON.stringify(article);
'''
driver = webdriver.Chrome() driver.implicitly_wait(10)
driver.get(base_url) news_urls = []
for link in driver.find_elements_by_css_selector('main a[role="heading"]'):
news_url = link.get_attribute('href') news_urls.append(news_url)
for news_url in news_urls: print('Now scraping:', news_url) driver.get(news_url)
print('Injecting script')
returned_result = driver.execute_script(get_article_cmd)
# Do something with returned_result driver.quit()

 

This approach also has the benefit that we can execute our whole JavaScript command in one go and do not need to rely on an explicit wait anymore to check whether the script has finished loading.

 

The only thing remaining now is to convert the retrieved result to a Python dictionary and store our results in a database, once more using the “dataset” library:

from selenium import webdriver
import requests
import dataset
from json import loads
db = dataset.connect('sqlite:///news.db')
base_url = 'https://news.google.com/news/?ned=us&hl=en'
script_url = 'http://www.webscrapingfordatascience.com/readability/ Readability.js'
get_article_cmd = requests.get(script_url).text get_article_cmd += '''
var documentClone = document.cloneNode(true); var loc = document.location;
var uri = {
spec: loc.href, host: loc.host,
prePath: loc.protocol + "//" + loc.host,
scheme: loc.protocol.substr(0, loc.protocol.indexOf(":")), pathBase: loc.protocol + "//" + loc.host +
loc.pathname.substr(0, loc.pathname.lastIndexOf("/") + 1)
};
var article = new Readability(uri, documentClone).parse(); return JSON.stringify(article);
'''
driver = webdriver.Chrome() driver.implicitly_wait(10)
driver.get(base_url)
news_urls = []
for link in driver.find_elements_by_css_selector('main a[role="heading"]'): news_url = link.get_attribute('href')
news_urls.append(news_url)
for news_url in news_urls: print('Now scraping:', news_url) driver.get(news_url)
print('Injecting script')
returned_result = driver.execute_script(get_article_cmd)
# Convert JSON string to Python dictionary article = loads(returned_result)
if not article:
# Failed to extract article, just continue
continue
# Add in the url article['url'] = news_url
# Remove 'uri' as this is a dictionary on its own
del article['uri']
# Add to the database db['articles'].upsert(article, ['url'])
print('Title was:', article['title']) driver.quit()

 

The output looks as follows:

Now scraping: https://www.usnews.com/news/world/articles/2017-10-24/ china-southeast-asia-aim-to-build-trust-with-sea-drills- singapore-says Injecting script
Title was: China, Southeast Asia Aim to Build Trust With Sea Drills,
Singapore Says | World News
Now scraping: http://www.philstar.com/headlines/2017/10/24/ 1751999/pentagon-chief-seeks-continued-maritime-cooperation- asean Injecting script
Title was: Pentagon chief seeks continued maritime cooperation with ASEAN | Headlines News, The Philippine Star,
[...]

 

Remember to take a look at the database (“news.db”) using an SQLite client such as “DB Browser for SQLite”;

We can now analyze our collected articles using Python. We’re going to construct a topic model using Latent Dirichlet Allocation (LDA) that will help us to categorize our articles along with some topics.

 

To do so, we’ll use the “nltk,” “stop-words,” and “gensim” libraries, which can simply be installed using pip:
pip install -U nltk

pip install -U stop-words pip install -U gensim

 

First, we’re going to loop through all our articles in order to tokenize them (convert text into a list of word elements) using a simple regular expression, remove stop words, and apply to stem:

import dataset
from nltk.tokenize import RegexpTokenizer from nltk.stem.porter import PorterStemmer from stop_words import get_stop_words
db = dataset.connect('sqlite:///news.db') articles = []
tokenizer = RegexpTokenizer(r'\w+') stop_words = get_stop_words('en') p_stemmer = PorterStemmer()
for article in db['articles'].all():
text = article['title'].lower().strip()
text += " " + article['textContent'].lower().strip()
if not text:
continue
# Tokenize
tokens = tokenizer.tokenize(text)
# Remove stop words and small words
clean_tokens = [i for i in tokens if not i in stop_words] clean_tokens = [i for i in clean_tokens if len(i) > 2]
# Stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in clean_tokens] # Add to list
articles.append((article['title'], stemmed_tokens))
print(articles[0])

 

Our first article now looks as follows (we keep the title for later reporting):

('Paul Manafort, former business partner to surrender in Mueller investigation', ['presid', 'trump', 'former', 'campaign', 'chairman', [...]]

 

To generate an LDA model, we need to calculate how frequently each term occurs within each document. To do that, we can construct a document-term matrix with gensim:

from gensim import corpora
dictionary = corpora.Dictionary([a[1] for a in articles]) corpus = [dictionary.doc2bow(a[1]) for a in articles]
print(corpus[0])

 

The Dictionary class traverses texts and assigns a unique integer identifier to each unique token while also collecting word counts and relevant statistics.

 

Next, our dictionary is converted to a bag of words corpus that results in a list of vectors equal to the number of documents. Each document vector is a series of “(id, count)” tuples:

[(0, 10), (1, 17), (2, 7), (3, 11), [...]]
We’re now ready to construct an LDA model:
from gensim.models.ldamodel import LdaModel nr_topics = 30
ldamodel = LdaModel(corpus, num_topics=nr_topics,
id2word=dictionary, passes=20)
print(ldamodel.print_topics())
This will show something like:
[(0, '0.027*"s" + 0.018*"trump" + 0.018*"manafort" + 0.011*"investig"
+ 0.008*"presid" + 0.008*"report" + 0.007*"mueller" + 0.007*"year"
+ 0.007*"campaign" + 0.006*"said"'),
(1, '0.014*"s" + 0.014*"said" + 0.013*"percent" + 0.008*"1" +
0.007*"0" + 0.006*"year" + 0.006*"month" + 0.005*"increas" +
0.005*"3" + 0.005*"spend"'), [...]
]

 

This overview shows an entry per topic. Each topic is represented by a list of probable words to appear in that topic, ordered by probability of appearance.

 

Note that adjusting the model’s number and amount of “passes” is important to get a good result. Once the results look acceptable (we’ve increased the number of topics for our scraped set), we can use our model to assign topics to our documents:

from random import shuffle

 

# Show topics by top-3 terms
for t in range(nr_topics):
print(ldamodel.print_topic(t, topn=3))
# Show some random articles
idx = list(range(len(articles))) shuffle(idx)
for a in idx[:3]:
article = articles[a]
print('==========================')
print(article[0])
prediction = ldamodel[corpus[a]][0] print(ldamodel.print_topic(prediction[0], topn=3)) print('Probability:', prediction[1])
This will show something like the following:
0.014*"new" + 0.013*"power" + 0.013*"storm" 0.030*"rapp" + 0.020*"spacey" + 0.016*"said" 0.024*"catalan" + 0.020*"independ" + 0.019*"govern" 0.025*"manafort" + 0.020*"trump" + 0.015*"investig" 0.007*"quickli" + 0.007*"complex" + 0.007*"deal" 0.018*"earbud" + 0.016*"iconx" + 0.014*"samsung" 0.012*"halloween" + 0.007*"new" + 0.007*"star" 0.021*"octopus" + 0.014*"carver" + 0.013*"vega" 0.000*"rapp" + 0.000*"spacey" + 0.000*"said" 0.025*"said" + 0.017*"appel" + 0.012*"storm" 0.039*"akzo" + 0.018*"axalta" + 0.017*"billion"
0.024*"rapp" + 0.024*"spacey" + 0.017*"said" 0.000*"boehner" + 0.000*"one" + 0.000*"trump" 0.033*"boehner" + 0.010*"say" + 0.009*"hous" 0.000*"approv" + 0.000*"boehner" + 0.000*"quarter" 0.017*"tax" + 0.013*"republican" + 0.011*"week" 0.012*"trump" + 0.008*"plan" + 0.007*"will" 0.005*"ludwig" + 0.005*"underlin" + 0.005*"sensibl" 0.015*"tax" + 0.011*"trump" + 0.011*"look" 0.043*"minist" + 0.032*"prime" + 0.030*"alleg" 0.058*"harri" + 0.040*"polic" + 0.032*"old" 0.040*"musk" + 0.026*"tunnel" + 0.017*"compani" 0.055*"appl" + 0.038*"video" + 0.027*"peterson" 0.011*"serv" + 0.008*"almost" + 0.007*"insid" 0.041*"percent" + 0.011*"year" + 0.010*"trump" 0.036*"univers" + 0.025*"econom" + 0.012*"special" 0.022*"chees" + 0.021*"patti" + 0.019*"lettuc" 0.000*"boehner" + 0.000*"said" + 0.000*"year" 0.000*"boehner" + 0.000*"new" + 0.000*"say" 0.030*"approv" + 0.025*"quarter" + 0.021*"rate"
==========================
Paul Manafort, Who Once Ran Trump Campaign, Indicted on Money Laundering and Tax Charges
0.025*"manafort" + 0.020*"trump" + 0.015*"investig" Probability: 0.672658189483
Apple fires employee after daughter's iPhone X video goes viral 0.055*"appl" + 0.038*"video" + 0.027*"peterson"
Probability: 0.990880503145
Theresa May won't say when she knew about sexual harassment allegations 0.043*"minist" + 0.032*"prime" + 0.030*"alleg"
Probability: 0.774530402797

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

Scraping and Analyzing a Wikipedia Graph

Our goal here is to scrape titles of Wikipedia websites while keeping track of links between them, which we’ll use to construct a graph and analyze it using Python. We’ll again use the “dataset” library as a simple means to store results.

 

The following code contains the full crawling setup:

import requests
import dataset
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urldefrag
from joblib import Parallel, delayed
db = dataset.connect('sqlite:///wikipedia.db') base_url = 'https://en.wikipedia.org/wiki/'
def store_website(url, title): print('Visited website:', url) print(' title:', title)
db['websites'].upsert({'url': url, 'title': title}, ['url'])
def store_links(from_url, links): db.begin()
for to_url in links:
db['links'].upsert({'from_url': from_url, 'to_url': to_url}, ['from_url', 'to_url'])
db.commit()
def get_random_unvisited_websites(amount=10): result = db.query('''SELECT * FROM links
WHERE to_url NOT IN (SELECT url FROM websites) ORDER BY RANDOM() LIMIT {}'''.format(amount))
return [r['to_url'] for r in result]
def should_visit(base_url, url):
if url is None:
return None
full_url = urljoin(base_url, url) full_url = urldefrag(full_url)[0]
if not full_url.startswith(base_url): # This is an external URL
return None
ignore = ['Wikipedia:', 'Template:', 'File:', 'Talk:', 'Special:',
'Template talk:', 'Portal:', 'Help:', 'Category:', 'index.php']
if any([i in full_url for i in ignore]): # This is a website to be ignored return None
return full_url
def get_title_and_links(base_url, url): html = requests.get(url).text
html_soup = BeautifulSoup(html, 'html.parser') website_title = html_soup.find(id='firstHeading') website_title = website_title.text if website_title else '' links = []
for link in html_soup.find_all("a"):
link_url = should_visit(base_url, link.get('href'))
if link_url:
links.append(link_url)
return url, website_title, links
if name == ' main ':
urls_to_visit = [base_url]
while urls_to_visit:
scraped_results = Parallel(n_jobs=5, backend="threading")( delayed(get_title_and_links)(base_url, url) for url in urls_to_visit
)
for url, website_title, links in scraped_results: store_website(url, website_title) store_links(url, links)
urls_to_visit = get_random_unvisited_websites()

 

There are a lot of things going on here that warrant some extra explanation:

  •  The database is structured as follows: a table “websites” holds a list of visited URLs with their website titles. The method store_website is used to store entries in this table.

 

  • Another table, “links,” simply contains pairs of URLs to represent links between websites. The method store_link is used to update these, and both methods use the “dataset” library.

 

For the latter, we perform multiple upsert operations inside a single explicit database transaction to speed things up.

  • The method get_random_unvisited_websites now returns a list of unvisited URLs, rather than just one, by selecting a random list of linked-to URLs that do not yet appear in the “websites” table (and hence have not been visited yet).

 

  • The should_visit method is used to determine whether a link should be considered for crawling. It returns a properly formatted URL if it should be included, or None otherwise.

          

  • The get_title_and_links method performs the actual scraping of websites, fetching their title and a list of URLs.

          

  • The script itself loops until there are no more unvisited websites (basically forever, as new websites will continue to be discovered). It fetches out a list of random websites we haven’t visited yet, gets their title and links, and stores these in the database.

          

  • Note that we use the “job-lib” library here to set up a parallel approach. Simply visiting URLs one by one would be a tad too slow here, so we use job-lib to set up a multithreaded approach to visiting links at the same time, effectively spawning multiple network requests.

 

  • It’s important not to hammer our own connection or Wikipedia, so we limit the n_jobs argument to five.

 

The back-end argument is used here to indicate that we want to set up a parallel calculation using multiple threads, instead of multiple processes. Both approaches have their pros and cons in Python.

 

A multi-process approach comes with a bit more overhead to set up, but it can be faster as Python’s internal threading system can be a bit tedious due to the “global interpreter lock” (the GIL) (a full discussion about the GIL is out of scope here, but feel free to look up more information online if this is the first time you have heard about it).

 

In our case, the work itself is relatively straightforward: execute a network request and perform some parsing, so a multithreading approach is fine.

 

This is also the reason why we don’t store the results in the database inside the get_title_and_links method itself but wait until the parallel jobs have finished their execution and have returned their results.

 

SQLite doesn’t like to be written to from multiple threads or many processes at once, so we wait until we have collected the results before writing them to the database. An alternative would be to use a client-server database system. Note that we should avoid overloading the database too much with a huge set of results.

 

Not only will the intermediate results have to be stored in memory, but we’ll also incur a waiting time when writing the large set of results. Since the get_random_unvisited_websites method returns a list of ten URLs maximum, we don’t need to worry about this too much in our case.

          

Finally, note that the main entry point of the script is now placed under “if name  == ’  main  ’:”.

 

In other examples, we have not done so for the sake of simplicity, although it is good practice to do so nonetheless. The reason for this is as follows: when a Python script imports another module, all the code contained in that module is executed at once.

 

For instance, if we’d like to reuse the should_visit method in another script,

we could import our original script using “import myscript” or “from myscript import should_visit.”

 

In both cases, the full code in “myscript.py” will be executed. If this script contains a block of code, like our “while” loop in this example, it will start executing that block of code, which is not what we want when importing our script;

 

we just want to load the function definitions. We hence want to indicate to Python to “only execute this block of code when the script is directly executed,” which is what the “if name == ’   main   ’:” check does. If we start our script from the command line, the special “ name ” variable will be set to “ main ”.

 

If our script would be imported from another module, “ name ” will be set to that module’s name instead. When using joblib as we do here, the contents of our script will be sent to all “workers” (threads or processes), in order for them to perform the correct imports and load the correct function definitions.

 

In our case,  for instance, the different workers should know about the get_title_ and_links method.

 

However, since the workers will also execute the full code contained in the script (just like an import would), we also need to prevent them from running the main block of code as well, which is why we need to provide an “if name == ’ main ’:” check.

 

You can let the crawler run for as long as you like, though note that it is extremely unlikely to ever finish, and a smaller graph will also be a bit easier to look at in the next step.

 

Once it has run for a bit, simply interrupt it to stop it. Since we use “upset,” feel free to resume it later on (it will just continue to crawl based on where it left off).

 

We can now perform some fun graph analysis using the scraped results. In Python, there are two popular libraries available to do so,

NetworkX (the “networkx” library in pip) and iGraph (“python-igraph” in pip).

We’ll use NetworkX here, as well as “matplotlib” to visualize the graph.

 

Graph Visualization Is Hard as the Networkx documentation itself notes, proper graph visualization is hard, and the library authors recommend that people visualize their graphs with tools dedicated to that task.

 

For our simple use case, the built-in methods suffice, even although we’ll have to wrangle our way through matplotlib to make things a bit more appealing. take a look at programs such as Cytoscape, Gephi, and Graphviz if you’re interested in graph visualization. In the next example, we’ll use Gephi to handle the visualization workload.

 

The following code visualizes the graph. We first construct a new NetworkX graph object and add in the websites as visited nodes. Next, we add the edges, though only between websites that were visited.

 

As an extra step, we also remove nodes that are completely unconnected (even although these should not be present at this stage).

 

We then calculate a centrality measure, called betweenness, as a measure of the importance of nodes. This metric is calculated based on the number of shortest paths from all nodes to all other nodes that pass through the node we’re calculating the metric for.

 

The more times a node lies on the shortest path between two other nodes, the more important it is according to this metric.

 

We’ll color the nodes based on this metric by giving them different shades of blue. We apply a quick and dirty sigmoid function to the betweenness metric to “squash” the values in a range that will result in a more appealing visualization. We also add labels to nodes manually here, in order to have them appear above the actual nodes. 

 

Ignore the Warnings    

When running the visualization code, you’ll most likely see warnings appear from matplotlib complaining about the fact that Network-x is using deprecated functions.

 

This is fine and can be safely ignored, though future versions of matplotlib might not play nice with Network anymore. It’s unclear whether the authors of the Network will continue to focus on visualization in the future.

 

as you’ll note, the “arrows” of the edges in the visualization also don’t look very pretty. this is a long-standing issue with Networkx. again: Network is fine for analysis and graph wrangling, though less so for visualization. take a look at other libraries if visualization is your core concern.

import networkx
import matplotlib.pyplot as plt
import dataset
db = dataset.connect('sqlite:///wikipedia.db') G = networkx.DiGraph()
print('Building graph...')
for website in db['websites'].all(): G.add_node(website['url'], title=website['title'])
for link in db['links'].all():
# Only addedge if the endpoints have both been visited
if G.has_node(link['from_url']) and G.has_node(link['to_url']): G.add_edge(link['from_url'], link['to_url'])
# Unclutter by removing unconnected nodes G.remove_nodes_from(networkx.isolates(G))
# Calculate node betweenness centrality as a measure of importance
print('Calculating betweenness...')
betweenness = networkx.betweenness_centrality(G, endpoints=False)
print('Drawing graph...')
# Sigmoid function to make the colors (a little) more appealing squash = lambda x : 1 / (1 + 0.5**(20*(x-0.1)))
colors = [(0, 0, squash(betweenness[n])) for n in G.nodes()] labels = dict((n, d['title']) for n, d in G.nodes(data=True)) positions = networkx.spring_layout(G)
networkx.draw(G, positions, node_color=colors, edge_color='#AEAEAE') # Draw the labels manually to make them appear above the nodes
for k, v in positions.items():
plt.text(v[0], v[1]+0.025, s=labels[k], horizontalalignment='center', size=8)
plt.show()
Scraping and Visualizing a Board Members Graph
In this example, our goal is to construct a social graph of S&P 500 companies and their interconnectedness through their board members. We’ll start from the S&P 500 website at Reuters available at https://www.reuters.com/finance/markets/index/.SPX to obtain a list of stock symbols:
from bs4 import BeautifulSoup
import requests
import re
session = requests.Session()
sp500 = 'https://www.reuters.com/finance/markets/index/.SPX' website = 1
regex = re.compile(r'\/finance\/stocks\/overview\/.*')
symbols = []
while True:
print('Scraping website:', website)
params = params={'sortBy': '', 'sortDir' :'', 'pn': website} html = session.get(sp500, params=params).text
soup = BeautifulSoup(html, "html.parser"
websitenav = soup.find(class_='websiteNavigation')
if not websitenav:
break
companies = websitenav.find_next('table', class_='dataTable')
for link in companies.find_all('a', href=regex): symbols.append(link.get('href').split('/')[-1])
website += 1
print(symbols)
Once we have obtained a list of symbols, we can scrape the board member websites for each of them (e.g., https://www.reuters.com/finance/stocks/company-officers/ MMM.N), fetch the table of board members, and store it as a pandas data frame, which
we’ll save using pandas’ to_pickle method. Don’t forget to install pandas first if you
haven’t already:
pip install -U pandas
Add this to the bottom of your script:
import pandas as pd
officers = 'https://www.reuters.com/finance/stocks/company-officers/
{symbol}'
dfs = []
for symbol in symbols:
print('Scraping symbol:', symbol)
html = session.get(officers.format(symbol=symbol)).text soup = BeautifulSoup(html, "html.parser")
officer_table = soup.find('table', {"class" : "dataTable"}) df = pd.read_html(str(officer_table), header=0)[0] df.insert(0, 'symbol', symbol)
dfs.append(df)
# Store the results df = pd.concat(dfs)
df.to_pickle('sp500.pkl')

 

This sort of information can lead to a lot of interesting use cases, especially again — in the realm of the graph and social network analytics.

 

We’re going to use NetworkX once more, but simply to parse through our collected information and export a graph in a format that can be read with Gephi, a popular graph visualization tool, which can be downloaded from https://gephi.org/users/download/:

import pandas as pd
import networkx as nx
from networkx.readwrite.gexf import write_gexf
df = pd.read_pickle('sp500.pkl') G = nx.Graph()
for row in df.itertuples(): G.add_node(row.symbol, type='company') G.add_node(row.Name,type='officer') G.add_edge(row.symbol, row.Name)
write_gexf(G, 'graph.gexf')

 

Open the graph file in Gephi, and apply the “ForceAtlas 2” layout technique for a few iterations. 

Take some time to explore Gephi’s visualization and filtering options if you like. All attributes that you have set in NetworkX (“type,” in our case) will be available in Gephi as well. 

Recommend