Web Scraping in Python (Best Tutorial 2019)

Web Scraping in Python

What Is Web Scraping?

Web “scraping” (also called “web harvesting,” “web data extraction,” or even “web data mining”), can be defined as “the construction of an agent to download, parse, and organize data from the web in an automated manner.”

 

Or, in other words: instead of a human end user clicking away in a web browser and copy-pasting interesting parts into, say, a spreadsheet, web scraping offloads this task to a computer program that can execute it much faster, and more correctly, than a human can.

 

 Why Web Scraping for Data Science?

Web Scraping for Data Science

When surfing the web using a normal web browser, you’ve probably encountered multiple sites where you considered the possibility of gathering, storing, and analyzing the data presented on the site’s pages. Especially for data scientists, whose “raw material” is data, the web exposes a lot of interesting opportunities:

 

  • There might be an interesting table on a Wikipedia page (or pages) you want to retrieve to perform some statistical analysis.

 

  • Perhaps you want to get a list of reviews from a movie site to perform text mining, create a recommendation engine, or build a predictive model to spot fake reviews.

 

  • You might wish to get a listing of properties on a real-estate site to build an appealing geo-visualization.

 

  • You’d like to gather additional features to enrich your dataset based on information found on the web, say, weather information to forecast, for example, soft drink sales.

 

  • You might be wondering about doing social network analysis using profile data found on a web forum.
  • It might be interesting to monitor a news site for trending news stories on a particular topic of interest.

 

The web contains lots of interesting data sources that provide a treasure trove for all sorts of interesting things. Sadly, the current unstructured nature of the web does not always make it easy to gather or export this data in an easy manner.

 

Web browsers are very good at showing images, displaying animations, and laying out websites in a way that is visually appealing to humans, but they do not expose a simple way to export their data, at least not in most cases.

 

Instead of viewing the web page by page through your web browser’s window, wouldn’t it be nice to be able to automatically gather a rich dataset? This is exactly where web scraping enters the picture.

 

If you know your way around the web a bit, you’ll probably be wondering: “Isn’t this exactly what Application Programming Interface (APIs) are for?”

 

Indeed, many websites nowadays provide such an API that provides a means for the outside world to access their data repository in a structured way — meant to be consumed and accessed by computer programs, not humans (although the programs are written by humans, of course).

 

Twitter, Facebook, LinkedIn, and Google, for instance, all provide such APIs in order to search and post tweets, get a list of your friends and their likes, see who you’re connected with, and so on.

 

So why, then, would we still need web scraping? The point is that APIs are great means to access data sources, provided the website at hand provides one, to begin with and that the API exposes the functionality you want. 

 

The general rule of thumb is to look for an API first and use that if you can, before setting off to build a web scraper to gather the data.

 

For instance, you can easily use Twitter’s API to get a list of recent tweets, instead of reinventing the wheel yourself. Nevertheless, there are still various reasons why web scraping might be preferable over the use of an API:

  • The website you want to extract data from does not provide an API.
  • The API provided is not free (whereas the website is).
  • The API provided is rate limited: meaning you can only access it a number of certain times per second, per day, …
  • The API does not expose all the data you wish to obtain (whereas the website does).

 

In all of these cases, the usage of web scraping might come in handy. The fact remains that if you can view some data in your web browser, you will be able to access and retrieve it through a program. If you can access it through a program, the data can be stored, cleaned, and used in any way.

 

Who Is Using Web Scraping?

Who Is Using Web Scraping

There are many practical applications of having access to and gathering data on the web, many of which fall in the realm of data science. The following list outlines some interesting real-life use cases:

 

Many of Google’s products have benefited from Google’s core business of crawling the web. Google Translate, for instance, utilizes text stored on the web to train and improve itself.

 

Scraping is being applied a lot in HR and employee analytics. The San Francisco-based hiQ startup specializes in selling employee analyses by collecting and examining public profile information.

 

for instance, from LinkedIn (who was not happy about this but was so far unable to prevent this practice following a court case; see https://www. Terms of Service Violation mine-your-data-and-sell-it-to-your-boss).

 

Digital marketers and digital artists often use data from the web for all sorts of interesting and creative projects. “We Feel Fine” by Jonathan Harris and Sep Kamvar, for instance, scraped various blog sites for phrases starting with “I feel,” the results of which could then visualize how the world was feeling throughout the day.

 

In another study, messages scraped from Twitter, blogs, and other social media were scraped to construct a data set that was used to build a predictive model toward identifying patterns of depression and suicidal thoughts.

 

This might be an invaluable tool for aid providers, though of course, it warrants a thorough consideration of privacy-related issues as well (see Analytics, Business Intelligence, and Data Management insights/articles/analytics/using-big-data-to-predict- suicide-risk-canada.html).

 

Emmanuel Sales also scraped Twitter, though here with the goal to make sense of his own social circle and timeline of posts (see Trying to organize my Twitter timeline, using unsupervised learning).

 

An interesting observation here is that the author first considered using Twitter’s API, but found that “Twitter heavily rate limits doing this: if you want to get a user's following list, then you can only do so 15 times every 15 minutes, which is pretty unwieldy to work with.”

 

In a paper titled “The Billion Prices Project: Using Online Prices for Measurement and Research” (see NBER Working Papers w22111), web scraping was used to collect a dataset of online price information that was used to construct a robust daily price index for multiple countries.

 

Banks and other financial institutions are using web scraping for competitor analysis.

 

For example, banks frequently scrape competitors’ sites to get an idea of where branches are being opened or closed, or to track loan rates offered — all of which is interesting information that can be incorporated in their internal models and forecasting.

 

Investment firms also often use web scraping, for instance, to keep track of news articles regarding assets in their portfolio.

 

Sociopolitical scientists are scraping social websites to track population sentiment and political orientation. A famous article called “Dissecting Trump’s Most Rabid Online Following” analyzes user discussions on Reddit using semantic analysis to characterize the online followers and fans of Donald Trump.

 

One researcher was able to train a deep learning model based on scraped images from Tinder and Instagram together with their “likes” to predict whether an image would be deemed “attractive” 

 

(see What a Deep Neural Network thinks about your #selfie). Smartphone makers are already incorporating such models in their photo apps to help you brush up your pictures.

 

In “The Girl with the Brick Earring,” Lucas Woltmann sets out to scrape Lego brick information from Buy and sell LEGO Parts, Sets and Minifigures to determine the best selection of Lego pieces to represent an image (one of the co-authors of this blog is an avid Lego fan, so we had to include this example).

 

We’ve supervised a study where web scraping was used to extract information from job sites, to get an idea regarding the popularity of different data science- and analytics-related tools in the workplace (spoiler: Python and R were both rising steadily). 

 

Another study from our research group involved using web scraping to monitor news outlets and web forums to track public sentiment regarding Bitcoin.

 

No matter your field of interest, there’s almost always a use case to improve or enrich your practice based on data. “Data is the new oil,” so the common saying goes, and the web has a lot of it.

 

Setting Up Python for web scraping

Setting Up Python for web scraping

We’ll be using Python 3 throughout this blog. You can download and install Python 3 for your platform (Windows, Linux, or MacOS) from Download Python.

 

Why Python 3 and Not 2? according to the creators of python themselves, “Python 2 is legacy, python 3 is the present and future of the language.”

 

Since we strive to offer a modern guide, we have deliberately chosen python 3 as our working language. that said, there’s still a lot of python 2 code floating around (maybe even in your organization).

 

Most of the concepts and examples provided in this blog should work well in python 2, too, if you add the following import statement in your Python 2 code:

from future import absolute_import, division, print_function

 

You’ll also need to install “pip,” Python’s package manager. If you’ve installed a recent version of Python 3, it will already come with pip installed. It is a good idea, however, to make sure pip is up to date by executing the following command on the command line:

python -m pip install -U pip

Or (in case you’re using Linux or MacOS):

pip install -U pip

 

Manually Installing pip no pip on your system yet? refer to the following page to install it on your system

(under “Installing with http://get-pip.py”): https://pip. http://pypa.io/en/stable/installing/.

 

Finally, you might also wish to install a decent text editor on your system to edit Python code files. Python already comes with a bare-bones editor built in (look for “Idle” in your programs menu), but other text editors such as Notepad++, Sublime Text, VSCode, Atom, and others all work well, too.

 

A Quick Python Primer

A Quick Python Primer

We assume you already have some programming experience under your belt, and perhaps are already somewhat familiar with reading and writing Python code. If not, the following overview will get you up to speed quickly.

 

Python code can be written and executed in two ways:

1. By using the Python interpreter REPL (“read-eval-print-loop”), which provides an interactive session where you can enter Python commands line by line (read), which will be evaluated (eval), showing the results (print). These steps are repeated (“loop”) until you close the session.

 

2. By typing out Python source code in “.py” files and then running them.

 

For virtually any use case, it is a good idea to work with proper “.py” files and run these, though the Python REPL comes in handy to test a quick idea or experiment with a few lines of code. To start it, just enter “python” in a command-line window and press enter.

 

The three angled brackets (“> > >”) indicate a prompt, meaning that the Python REPL is waiting for your commands. Try executing the following commands:

>>>> 1 + 1
2
>>> 8 - 1
7
>>> 10 * 2
20
>>> 35 / 5
7.0
>>> 5 // 3
1
>>> 7 % 3
1
>>> 2 ** 3
8

Mathematics work as you’d expect; “//” indicates an integer division; “%” is the modulo operator, which returns the remainder after division; and “**” means “raised to the power of.”

 

Python supports the following number types:

Python supports

Integers (“int”), representing signed integer values (i.e., non-decimal numbers), such as 10, 100, -700, and so on.

Long integers (“long”), representing signed integer values taking up more memory than standard integers,

hence allowing for larger numbers. They’re written by putting an “L” at the end.

For example, 535633629843L, 10L, -100000000L, and so on.

Floating-point values (“float”), that is, decimal numbers, such as 0.4, -10.2, and so on.

Complex numbers (“complex”), which are not widely used in non- mathematics code. They’re written by putting a “j” at the end, for example, 3.14j, .876j, and so on.

 

Apart from numbers, Python also supports strings (“str”): textual values that are enclosed by double or single quotes, as well as the Boolean (“bool”) logic values: “True” and “False” (note the capitalization).

 

“None” represents a special value indicating nothingness. Try out the following lines of code to play around with these types:

>>> 'This is a string value' 'This is a string value'
>>> "This is also a string value" 'This is also a string value'
>>> "Strings can be " + 'added' 'Strings can be added'
>>> "But not to a number: " + 5 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: must be str, not int
>>> "So we convert it first: " + str(5) 'So we convert it first: 5'
>>> False and False False
>>> False or True True
>>> not False
True
>>> not True False
>>> not False and True True
>>> None + 5
Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
>>> 4 < 3 # > also works False
>>> 100 >= 10 # <=also works True>>> 10 == 10
True
>>> None == False False
>>> False == 0 True
>>> True == 1 True
>>> True == 2 False
>>> 2 != 3
True
>>> 2 != '2'
True

Again, the instructions above should be pretty self-explanatory, except for perhaps the last few lines. In Python, “==” indicates an equality comparison and hence returns True or False as a result. None is neither equal to False nor True itself, but False is considered equal to zero (0) and True is considered equal to one (1) in Python.

 

Note that the equality and inequality operators (“==” and “!=”) to consider the types that are being compared;

the number 2 is hence not equal to the string “2.”

 

Is “Is” Equal to “==”? apart from “==”, Python also provides the “is” keyword, which will return true if two variables point to the same object (their contents will hence always be equal as well). “==” checks whether the contents of two variables are equal, even though they might not point to the same object.

 

In general, “==” is what you’ll want to use, except for a few exceptions, which is to check whether a variable is equal to true, False, or none. all variables having this as their value will point to the same object in memory so that instead of writing my_var == None you can also write my_var is None that reads a bit better.

 

In the REPL interactive session, all results of our inputs are immediately shown on the screen. When executing a Python file, however, this is not the case, and hence we need a function to explicitly “print out” information on the screen. In Python, this can be done through the print function:

>>> print("Nice to meet you!") Nice to meet you!
>>> print("Nice", "to", "meet", "you!") Nice to meet you!
>>> print("HE", "LLO", sep="--") HE--LLO
>>> print("HELLO", end="!!!\n") HELLO!!!

 

When working with data, we obviously would like to keep our data around to use in different parts of our program. That is, we’d like to store numbers, strings, … in variables. Python simply uses the “=” operator for variable assignment:

>>> var_a = 3
>>> var_b = 4
>>> var_a + var_b + 2 9
>>> var_str = 'This is a string'
>>> print(var_str) This is a string

 

Strings in Python can be formatted in a number of different ways. First of all, characters prefixed with a backslash (“\”) inside a string indicate so-called “escape characters” and represent special formatting instructions.

 

In the example above, for instance, “\n” indicates a line break. “\t” on the other hand represents a tab, and “\\” is simply the backslash character itself. Next, it is possible to format strings by means of the Bigger than 1

>format function:
>>> "{} : {}".format("A", "B") 'A : B'
>>> "{0}, {0}, {1}".format("A", "B") 'A, A, B'
>>> "{name} wants to eat {food}".format(name="Seppe", food="lasagna") 'Seppe wants to eat lasagna'
Format Overload If there’s anything that new versions of python don’t need, it’s more ways to format strings. apart from using the format function illustrated here, python also allows us to format strings using the “%” operator:
"%s is %s" % ("Seppe", "happy")
python 3.6 also added “f-strings” to format strings in a more concise way:
f'Her name is {name} and she is {age} years old.'
We’ll stick to using format, to keep things clear.
Other than numbers, Booleans and strings, Python also comes with a number of helpful data structures built in, which we’ll be using a lot: lists, tuples, dictionaries, and sets.
Lists are used to store ordered sequences of things. The following instructions outline how they work in Python. Note that the code fragment below also includes comments, which will be ignored by Python and start with a “#” character:
>>> li = []
>>> li.append(1) # li is now [1]
>>> li.append(2) # li is now [1, 2]
>>> li.pop() # removes and returns the last element 2
>>> li = ['a', 2, False] # not all elements need to be the same type
>>> li = [[3], [3, 4], [1, 2, 3]] # even lists of lists
>>> li = [1, 2, 4, 3]
>>> li[0] 1
>>> li[-1] 3
>>> li[1:3] [2, 4]
>>> li[2:]
[4, 3]
>>> li[:3]
[1, 2, 4]
>>> li[::2] # general format is li[start:end:step] [1, 4]
>>> li[::-1]
[3, 4, 2, 1]
>>> del li[2] # li is now [1, 2, 3]
>>> li.remove(2) # li is now [1, 3]
>>> li.insert(1, 1000) # li is now [1, 1000, 3]
>>> [1, 2, 3] + [10, 20]
[1, 2, 3, 10, 20]
>>> li = [1, 2, 3]
>>> li.extend([1, 2, 3])
>>> li
[1, 2, 3, 1, 2, 3]
>>> len(li) 6
>>> len('This works for strings too') 26
>>> 1 in li True
>>> li.index(2) 1
>>> li.index(200)
Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: 200 is not in list
Tuples are similar to lists but are immutable, meaning that elements cannot be added or removed after creation:
>>> tup = (1, 2, 3)
>>> tup[0] 1
>>> type((1)) # a tuple of length one has to have a comma after the 
last element but tuples of other lengths, even zero, do not
<class 'int'>
>>> type((1,))
<class 'tuple'>
>>> type(())
<class 'tuple'>
>>> len(tup) 3
>>> tup + (4, 5, 6)
(1, 2, 3, 4, 5, 6)
>>> tup[:2] (1, 2)
>>> 2 in tup True
>>> a, b, c = (1, 2, 3) # a is now 1, b is now 2 and c is now 3
>>> a, *b, c = (1, 2, 3, 4) # a is now 1, b is now [2, 3] and c is now 4
>>> d, e, f = 4, 5, 6 # you can also leave out the parentheses
>>> e, d = d, e # d is now 5 and e is now 4
Sets are also similar to lists, but they store a unique and unordered collection of items, just like a set in mathematics:
>>> empty_set = set()
>>> some_set = {1, 1, 2, 2, 3, 4} # some_set is now {1, 2, 3, 4}
>>> filled_set = some_set
>>> filled_set.add(5) # filled_set is now {1, 2, 3, 4, 5}
>>> other_set = {3, 4, 5, 6}
>>> filled_set & other_set # intersection
{3, 4, 5}
>>> filled_set | other_set # union
{1, 2, 3, 4, 5, 6}
>>> {1, 2, 3, 4} - {2, 3, 5} # difference
{1, 4}
>>> {1, 2} >= {1, 2, 3}
False
>>> {1, 2} <={1, 2, 3} True>>> 2 in filled_set True
Dictionaries store a mapping between a series of unique keys and values:
>>> empty_dict = {}
>>> filled_dict = {"one": 1, "two": 2, "three": 3}
>>> filled_dict["one"] 1
>>> list(filled_dict.keys()) ["one", "two", "three"]
>>> list(filled_dict.values()) [1, 2, 3]
>>> "one" in filled_dict # in checks based on keys True
>>> 1 in filled_dict False
>>> filled_dict.get("one") 1
>>> filled_dict.get("four") None
>>> filled_dict.get("four", 4) # default value if not found 4
>>> filled_dict.update({"four":4})
>>> filled_dict["four"] = 4 # also possible to add/update this way
>>> del filled_dict["one"] # removes the key "one"
Finally, control flow in Python is relatively simple, too:
>>> some_var = 10
>>> if some_var > 1:
... print('Bigger than 1')
...

Note the colon (“:”) after the if-statement as well as the three dots “. . .” in the REPL, indicating that more output is expected before a given piece of code can be executed.

 

The code in Python is structured using white space, meaning that everything inside of an “if”-block, for instance, should be indented using spaces or tabs.

 

Indentation Some programmers find this white space indentation frustrating when first working with python, though it does undeniably lead to more readable and cleanly organized code. Just make sure not to mix tabs and spaces in your source code!

“If”-blocks in Python can also include optional “elif” and “else” blocks:

>>> some_var = 10
>>> if some_var > 10:
... print('Bigger than 10')
... elif some_var > 5:
... print('Bigger than 5')
... else:
... print('Smaller than or equal to 5')
...
Bigger than 5

 

Readable If Blocks remember that zero (0) integers, floats, and complex numbers all evaluate to False in python. Similarly, empty strings, sets, tuples, lists, and dictionaries also evaluate to False, so instead of writing if len(my_list) > 0:, you can simply use if my_list: as well, which is much easier to read.

 

We’ve already seen the “in”-operator as a way to check for list, tuple, set, and dictionary membership. This operator can be used to write “for”-loops as well:

>>> some_list = [1, 2, 3]
>>> some_string = 'a string'
>>> 1 in some_list True
>>> 'string' in some_string True
>>> for num in some_list:
... print(num)
... 1
2
3
>>> for chr in some_string:
... print(chr)
...
a
s t r i n g

 

To loop over number ranges, Python also provides the helpful built-in range function:

  • range(number): returns an iterable of numbers from zero to (not including) the given number.
  • range(lower, upper): returns an iterable of numbers from the lower number to (not including) the upper number.
  • range(lower, upper, step): returns an iterable of numbers from the lower number to the upper number, while incrementing by step.
  • Integers Only all of these functions sadly require integers as input arguments.

 

If you want to iterate over a range of decimal values, you’ll have to define your own function.

Note the use of the concept “iterable” here. In Python, iterables are basically a “smart” list.

 

Instead of immediately filling up your computer’s memory with the complete list, Python will avoid doing so until you actually need to access the elements themselves. This is why using the range function shows the following:

>>> range(3) range(0, 3)

Converting iterables to a real list is simple; just convert the value to an explicit list:

>>> list(range(3)) [0, 1, 2]

 

While looping over an iterable, however, you don’t need to explicitly convert them first.

You can hence just use the range functions directly as follows:

>>> for num in range(1, 100, 15):
... print(num)
... 1
16
31
46
61
76
91
And, of course, Python has a “while”-style loop as well:
>>> x = 0
>>> while x < 3:
... print(x)
... x = x + 1
... 0
1
2

Infinite Loops Forgot to add the x = x + 1 line? tried out while True:, or is your “for”-loop going in circles? You can press Control+C on your keyboard to execute a “Keyboard Interrupt” and stop the execution of the current code block.

 

Whenever you’re writing code, it is a good idea to encapsulate small reusable pieces of code so they can be called and executed at various places without having to copy- paste lots of code. A basic way to do so is to create a function, which is done using “def”:

>>> def add(x, y):
... print("x is {} and y is {}".format(x, y))
... return x + y # return value
...
>>> add(5, 6)
x is 5 and y is 6 11
>>> add(y=10, x=5) x is 5 and y is 10 15

 

There are two special constructs worth mentioning here as well: “*” and “**”. Both of these can be used in function signatures to indicate “the rest of the arguments” and “the rest of the named arguments” respectively. Let’s show how they work using an example:

>>> def many_arguments(*args):
... # args will be a tuple
... print(args)
...
>>> many_arguments(1, 2, 3)
(1, 2, 3)
>>> def many_named_arguments(**kwargs):
... # kwargs will be a dictionary
... print(kwargs)
...
>>> many_named_arguments(a=1, b=2)
{'a': 1, 'b': 2}
>>> def both_together(*args, **kwargs):
... print(args, kwargs)
Apart from using these in method signatures, you can also use them when calling functions to indicate that you want to pass an iterable as arguments or a dictionary as named arguments:
>>> def add(a, b):
... return a + b
...
>>> l = [1,2] # tuples work, too
>>> add(*l) 3
>>> d = {'a': 2, 'b': 1}
>>> add(**d) 3
Finally, instead of using the Python REPL, let’s take a look at how you’d write Python code using a source file. Create a file called “http://test.py” somewhere where you can easily find it and add the following contents:
# http://test.py
def add(x, y):
return x + y
for i in range(5): print(add(i, 10))

Save the file, and open a new command-line window. You can execute this file by supplying its name to the “python” executable.

 

This concludes our “rapid-fire” Python primer. We’ve skipped over some details here (such as classes, try-except-catch blocks, iterators versus generators, inheritance, and so on), but these are not really necessary to get started with what we’re actually here to do: web scraping.

 

The Magic of Networking

Networking

1. You enter “Google” into your web browser, which needs to figure out the IP address for this site. IP stands for “Internet Protocol” and forms a core protocol of the Internet, as it enables networks to route and redirect communication packets between connected computers, which are all given an IP address.

 

To communicate with Google’s web server, you need to know its IP address. Since the IP address is basically a number, it would be kind of annoying to remember all these numbers for every website out there.

 

So, just as how you link telephone numbers to names in your phone’s contact blog, the web provides a mechanism to translate domain names like “Google” to an IP address.

 

2. And so, your browser sets off to figure out the correct IP address behind “Google”. To do so, your web browser will use another protocol, called DNS (which stands for Domain Name System) as follows: first, the web browser will inspect its own cache (its “short-term memory”) to see whether you’ve recently visited this website in the past.

 

If you have, the browser can reuse the stored address. If not, the browser will ask the underlying operating system (Windows, for example) to see whether it knows the address for Google.

 

3. If the operating system is also unaware of this domain, the browser will send a DNS request to your router, which is the machine that connects you to the Internet and also — typically — keeps its own DNS cache.

 

If your router is also unaware of the correct address, your browser will start sending a number of data packets to known DNS servers, for example, to the DNS server maintained by your Internet Service Provider (ISP) — for which the IP address is known and stored in your router.

 

The DNS server will then reply with a response basically indicating that“Google” is mapped to the IP address “172.217.17.68”.

 

Note that even your ISPs DNS server might have to ask other DNS servers (located higher in the DNS hierarchy) in case it doesn’t have the record at hand.

 

4. All of this was done just to figure out the IP address of Google. Your browser can now establish a connection to 172.217.17.68, Google’s web server.

 

A number of protocols — a protocol is a standard agreement regarding what messages between communicating parties should look like — are combined here (wrapped around each other, if you will) to construct a complex message.

 

At the outermost part of this “onion,” we find the IEEE 802.3 (Ethernet) protocol, which is used to communicate with machines on the same network. Since we’re not communicating on the same network, the Internet Protocol, IP, is used to embed another message indicating that we wish to contact the server at address 172.217.17.68.

 

Inside this, we find another protocol, called TCP (Transmission Control Protocol), which provides a general, reliable means to deliver network messages, as it includes functionality for error checking and splitting messages up in smaller packages, thereby ensuring that these packets are delivered in the right order.

 

TCP will also resend packets when they are lost in transmission. Finally, inside the TCP message, we find another message, formatted according to the HTTP protocol (HyperText Transfer Protocol), which is the actual protocol used to request and receive web pages.

 

Basically, the HTTP message here states a request from our web browser: “Can I get your index page, please?”

 

5. Google’s web server now sends back an HTTP reply, containing the contents of the page we want to visit. In most cases, this textual content is formatted using HTML, a markup language we’ll take a closer look at later on.

 

From this (oftentimes large) bunch of text, our web browser can set off to render the actual page, that is, making sure that everything appears neatly on screen as instructed by the HTML content.

 

Note that a web page will oftentimes contain pieces of content for which the web browser will — behind the scenes — initiate new HTTP requests.

 

In case the received page instructs the browser to show an image, for example, the browser will fire off another HTTP request to get the contents of the image (which will then not look like HTML- formatted text but simply as raw, binary data). As such, rendering just one web page might involve a lot of HTTP requests.

 

Luckily, modern browsers are smart and will start rendering the page as soon as information is coming in, showing images and other visuals as they are retrieved. In addition, browsers will try to send out multiple requests in parallel if possible to speed up this process as well.

 

With so many protocols, requests, and talking between machines going on, it is nothing short of amazing that you are able to view a simple web page in less than a second.

 

To standardize the large number of protocols that form the web, the International Organization of Standardization (ISO) maintains the Open Systems Interconnection (OSI) model, which organizes computer communication into seven layers:

 

  • Layer 1: Physical Layer: Includes the Ethernet protocol, but also USB, Bluetooth, and other radio protocols.
  • Layer 2: Data link Layer: Includes the Ethernet protocol.
  • Layer 3: Network Layer: Includes IP (Internet Protocol).

 

  • Layer 4: Transport Layer: TCP, but also protocols such as UDP, which does not offer the advanced error checking and recovery mechanisms of TCP for sake of speed.
  • Layer 5: Session Layer: Includes protocols for opening/closing and managing sessions.

 

  • Layer 6: Presentation Layer: Includes protocols to format and translate data.
  • Layer 7: Application Layer: HTTP and DNS, for instance.

 

Not all network communications need to use protocols from all these layers. To request a web page, for instance, layers 1 (physical), 2 (Ethernet), 3 (IP), 4 (TCP), and 7 (HTTP) are involved, but the layers are constructed so that each protocol found at a higher level can be contained inside the message of a lower-layer protocol.

 

When you request a secure web page, for instance, the HTTP message (layer 7) will be encoded in an encrypted message (layer 6) (this is what happens if you surf to an “https”-address).

 

The lower the layer you aim for when programming networked applications, the more functionality, and complexity you need to deal with. Luckily for us web scrapers, we’re interested in the topmost layer, that is, HTTP, the protocol used to request and receive web pages.

 

That means that we can leave all complexities regarding TCP, IP, Ethernet, and even resolving domain names with DNS up to the Python libraries we use, and the underlying operating system.

 

The HyperText Transfer Protocol: HTTP

HyperText Transfer Protocol: HTTP

We’ve now seen how your web browser communicates with a server on the World Wide Web. The core component in the exchange of messages consists of a HyperText Transfer Protocol (HTTP) request message to a web server, followed by an HTTP response (also oftentimes called an HTTP reply), which can be rendered by the browser.

 

Since all of our web scrapings will build upon HTTP, we do need to take a closer look at HTTP messages to learn what they look like.

 

HTTP is, in fact, a rather simple networking protocol. It is text-based, which at least makes its messages somewhat readable to end users (compared to raw binary messages that have no textual structure at all) and follow a simple request-reply- based communication scheme.

 

That is, contacting a web server and receiving a reply simply involves two HTTP messages: a request and a reply. In case your browser wants to download or fetch additional resources (such as images), this will simply entail additional request-reply messages being sent.

 

Keep Me Alive In the simplest case, every request-reply cycle in HTTP involves setting up a fresh new underlying TCP connection as well.

 

For heavy websites, setting up many TCP connections and tearing them down in quick succession creates a lot of overhead, so HTTP version 1.1 allows us to keep the TCP connection “alive” to be used for concurrent request-reply HTTP messages.

 

HTTP version 2.0 even allows us to “multiplex” (a fancy word for “mixing messages”) in the same connection, for example, to send multiple concurrent requests.

 

Luckily, we don’t need to concern ourselves much with these details while working with python, as the requests, the library we’ll use, takes care of this for us automatically behind the scenes.

 

Let us now take a look at what an HTTP request and reply look like. As we recall, a client (the web browser, in most cases) and web server will communicate by sending plain text messages. The client sends requests to the server and the server sends responses or replies.

A request message consists of the following:

request message

  • A request line;
  • A number of request headers, each on their own line;
  • An empty line;
  • An optional message body, which can also take up multiple lines.

Each line in an HTTP message must end with <CR><LF> (the ASCII characters 0D and 0A).

 

The empty line is simply <CR><LF> with no other additional white space.

 

New Lines <Cr> and <LF> are two special characters to indicate that a new line should be started. You don’t see them appearing as such, but when you type out a plain text document in, say, Notepad.

 

Every time you press enter, these two characters will be put inside of the contents of the document to represent “that a new line appears here.” an annoying aspect of computing is that operating systems do not always agree on which character to use to indicate a new line.

 

Linux programs tend to use <LF> (the “line feed” character), whereas older versions of MacOS used <Cr> (the “carriage return” character).

 

Windows uses both <Cr> and <LF> to indicate a new line, which was also adopted by the HTTP standard. Don’t worry too much about this, as the python requests library will take care of correctly formatting the HTTP messages for us.

 

The following code fragment shows a full HTTP request message as executed by a web browser (we don’t show the “<CR><LF>” after each line, except for the last, blank line):

GET / HTTP/1.1
Host: Example Domain Connection: keep-alive Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Referer: Google
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8,nl;q=0.6
<CR><LF>

 

Let’s take a closer look at this message. “GET / HTTP/1.1” is the request line. It contains the HTTP “verb” or “method” we want to execute (“GET” in the example above), the URL we want to retrieve (“/”), and the HTTP version we understand (“HTTP/1.1”). Don’t worry too much about the “GET” verb.

 

HTTP has a number of verbs (that we’ll discuss later on). For now, it is important to know that “GET” means this: “get the contents of this URL for me.” Every time you enter a URL in your address bar and press enter, your browser will perform a GET request.

 

Next up are the request headers, each on their own line. In this example, we already have quite a few of them. Note that each header includes a name (“Host,” for instance), followed by a colon (“:”) and the actual value of the header (“Example Domain”).

 

Browsers are very chatty in terms of what they like to include in their headers, and Chrome (the web browser used here, is no exception).

 

The HTTP standard includes some headers that are standardized and which will be utilized by proper web browsers, though you are free to include additional headers as well. “Host,” for instance, is a standardized and mandatory header in HTTP 1.1 and higher.

 

The reason why it was not around in HTTP 1.0 (the first version) is simple: in those days, each web server (with its IP address) was responsible for serving one particular website.

 

If we would hence send “GET / HTTP/1.1” to a web server responsible for “Example Domain”, the server knew which page to fetch and return. However, it didn’t take long for the following bright idea to appear: Why not serve multiple websites from the same server, with the same IP address?

 

The same server responsible for “Example Domain” might also be the one serving pages belonging to “Example Domain”, for instance.

 

However, we then need a way to tell the server which domain name we’d like to retrieve a page from. 

 

Including the domain name in the request line itself, like “GET Example Domain HTTP/1.1” might have been a solid idea, though this would break backward compatibility with earlier web servers, which expect a URL without a domain name in the request line.

 

A solution was then offered in the form of a mandatory “Host” header, indicating from which domain name the server should retrieve the page.

 

The wrong Host Don’t try to be too clever and send a request to a web server responsible for “Example Domain” and change the “host” header to read “host: somethingentirely-different.com”.

 

Proper web servers will complain and simply send back an error page saying: “hey, I’m not the server hosting that domain.” this being said, security issues have been identified on websites where it is possible to confuse and misdirect them by spoofing this header.

 

Apart from the mandatory “Host” header, we also see a number of other headers appearing that form a set of “standardized requests headers,” which are not mandatory, though nevertheless included by all modern web browsers.

 

“Connection: keep-alive,” for instance, signposts to the server that it should keep the connection open for subsequent requests if it can. The “User-Agent” contains a large text value through which the browser happily informs the server what it is (Chrome), and which version it is running as.

 

The User-Agent Mess Well… you’ll note that the “User-agent” header contains “Chrome,” but also a lot of additional seemingly unrelated text such as “Mozilla,” “appleWebkit,” and so on. Is Chrome masquerading itself and posing as other browsers? In a way it, though it is not the only browser that does so.

 

The problem is this: when the “User-agent” header came along and browsers started sending their names and version, some website owners thought it was a good idea to check this header and reply with different versions of a page depending on who’s asking, for instance to tell users that “Netscape 4.0” is not supported by this server.

 

the routines responsible for these checks were often implemented in a haphazardly way, thereby mistakenly sending users off when they’re running some unknown browser, or failing to correctly check the browser’s version.

 

Browser vendors hence had no choice over the years to get creative and include lots of other text fields in this User-agent header. basically, our browser is saying “I’m Chrome, but I’m also compatible with all these other browsers, so just let me through please.”

 

“Accept,” tells the server which forms of content the browser prefers to get back, and “Accept-Encoding,” tells the server that the browser is also able to get back compressed content.

 

The “Referer” header (a deliberate misspelling) tells the server from which page the browser comes from (in this case, a link was clicked on “google.com” sending the browser to “example.com”).

 

A Polite Request even though your web browser will try to behave politely and, for instance, tell the web server which forms of content it accepts, there is no guarantee whatsoever that a web server will actually look at these headers or follow up on them.

 

A browser might indicate in its “accept” header that it understands “web” images, but the web server can just ignore this request and send back images as “jpg” or “png” anyway. Consider these request headers as polite requests, though, nothing more.

 

Finally, our request message ends with a blank <CR><LF> line and has no message body whatsoever. These are not included in getting requests, but we’ll see HTTP messages later on where this message body will come into play.

 

If all goes well, the web server will process our request and send back an HTTP reply. These look very similar to HTTP requests and contain:

  • A status line that includes the status code and a status message;
  • A number of response headers, again all on the same line;
  • An empty line;
  • An optional message body.

 

As such, we might get the following response following our request above:

HTTP/1.1 200 OK
Connection: keep-alive Content-Encoding: gzip
Content-Type: text/html;charset=utf-8 Date: Mon, 28 Aug 2017 10:57:42 GMT
Server: Apache v1.3 Vary: Accept-Encoding
Transfer-Encoding: chunked
<CR><LF>
<html>
<body>Welcome to My Web Page</body>
</html>

Again, let’s take a look at the HTTP reply line by line. The first line indicates the status result of the request. It opens by listing the HTTP version the server understands (“HTTP/1.1”), followed by a status code (“200”), and a status message (“OK”). If all goes well, the status will be 200.

 

There are a number of agreed-upon HTTP status codes that we’ll take a closer look at later on, but you’re probably also familiar with the 404 status message, indicating that the URL listed in the request could not be retrieved, that was, was “not found” on the server.

 

Next up are again — a number of headers, now coming from the server. Just like web browsers, servers can be quite chatty in terms of what they provide and can include as many headers as they like. Here, the server includes its current date and version (“Apache v1.3”) in its headers.

 

Another important header here is “Content-Type,” as it will provide browsers with information regarding what the content included in the reply looks like. Here, it is HTML text, but it might also be binary image data, movie data, and so on.

 

Following the headers is a blank <CR><LF> line, and an optional message body, containing the actual content of the reply. Here, the content is a bunch of HTML text containing “Welcome to My Web Page.”

 

It is this HTML content that will then be parsed by your web browser and visualized on the screen. Again, the message body is optional, but since we expect most requests to actually come back with some content, a message body will be present in almost all cases.

 

Message Bodies even when the status code of the reply is 404, for instance, many websites will include a message body to provide the user with a nice looking page indicating that — sorry — this page could not be found.

 

If the server leaves it out, the web browser will just show its default “page not found” page instead. there are some other cases where an http reply does not include a message body, which we’ll touch upon later on.

 

HTTP in Python: The Requests Library

HTTP in Python

We’ve now seen the basics regarding HTTP, so it is time we get our hands dirty with some Python code. Recall the main purpose of web scraping: to retrieve data from the web in an automated manner.

 

Basically, we’re throwing out our web browser and we’re going to surf the web using a Python program instead. This means that our Python program will need to be able to speak and understand HTTP.

 

Definitely, we could try to program this ourselves on top of standard networking functionality already built-in in Python (or other languages, for that manner), making sure that we neatly format HTTP request messages and are able to parse the incoming responses.

 

However, we’re not interested in reinventing the wheel, and there are many Python libraries out there already that make this task a lot more pleasant so that we can focus on what we’re actually trying to accomplish.

 

In fact, there are quite a few libraries in the Python ecosystem that can take care of HTTP for us. To name a few:

 

Python 3 comes with a built-in module called “urllib,” which can deal with all things HTTP (see https://docs.python.org/3/library/ URLlib.html).

 

The module got heavily revised compared to its counterpart in Python 2, where HTTP functionality was split up in both “urllib” and “URLlib2” and somewhat cumbersome to work with.

 

  • “httplib2” (see https://github.com/httplib2/httplib2): a small, fast HTTP client library. Originally developed by Googler Joe Gregorio, and now community supported.
  • “URLlib3” (see https://URLlib3.readthedocs.io/): a powerful HTTP client for Python, used by the requests library below.
  • “requests” (see http://docs.python-requests.org/): an elegant and simple HTTP library for Python, built “for human beings.”
  • “grequests” (see https://pypi.python.org/pypi/grequests): which extends requests to deal with asynchronous, concurrent HTTP requests.
  • “aiohttp” (see http://aiohttp.readthedocs.io/): another library focusing on asynchronous HTTP.

 

In this blog, we’ll use the “requests” library to deal with HTTP. The reason why is simple: whereas “urllib” provides solid HTTP functionality (especially compared with the situation in Python 2), using it often involves lots of boilerplate code making the module less pleasant to use and not very elegant to read.

 

Compared with “urllib,” “URLlib3” (not part of the standard Python modules) extends the Python ecosystem regarding

HTTP with some advanced features, but it also doesn’t really focus that much on being elegant or concise.

 

That’s where “requests” comes in. This library builds on top of “URLlib3,” but it allows you to tackle the majority of HTTP use cases in code that is short, pretty, and easy to use.

 

Both “grequests” and “aiohttp” are more modern-oriented libraries and aim to make HTTP with Python more asynchronous.

 

This becomes especially important for very heavy-duty applications where you’d have to make lots of HTTP requests as quickly as possible.

 

We’ll stick with “requests” in what follows, as asynchronous programming is a rather challenging topic on its own, and we’ll discuss more traditional ways of speeding up your web scraping programs in a robust manner.

 

It should not be too hard to move on from “requests” to “grequests” or “aiohttp” (or other libraries) should you wish to do so later on.

 

Installing requests can be done easily through pip. Execute the following in a command-line window (the “-U” argument will make sure to update an existing version of requests should there already be one):

>pip install -U requests
Next, create a Python file (“firstexample.py” is a good name), and enter the following:
import requests
URL = 'http://www.webscrapingfordatascience.com/basichttp/' r = requests.get(URL)
print(r.text)
If all goes well, you should see the following line appear when executing this script:
Hello from the web!

 

Webscrapingfordatascience.Com? is the companion website for this blog. We’ll use the pages on this site throughout this blog to show off various examples.

 

Since the web is a fast-moving place, we wanted to make sure that the examples we provide continue working as long as possible. Don’t be too upset about staying in the “safe playground” for now, as various real-life examples are included in the last blog as well.

 

Let’s take a look at what’s happening in this short example:

Python object

First, we import the requests module. If you’ve installed requests correctly on your system, the import line should simply work without any errors or warnings.

 

We’re going to retrieve the contents of http://www.webscrapingfordatascience.com/basichttp/.

 

Try opening this web page in your browser. You’ll see “Hello from the web!” appear on the page. This is what we want to extract using Python.

 

We use the requests. get method to perform an “HTTP GET” request to the provided URL. In the simplest case, we only need to provide the URL of the page we want to retrieve. Requests will make sure to format a proper HTTP request message in accordance with what we’ve seen before.

 

The requests.get method returns requests. Response Python object containing lots of information regarding the HTTP reply that was retrieved. Again, requests take care of parsing the HTTP reply so that you can immediately start working with it.

 

r.text contains the HTTP response content body in a textual form. Here, the HTTP response body simple contained the content “Hello from the web!”

 

A More Generic Request Since we’ll be working with http Get requests only (for now), the requests.get method will form a cornerstone of the upcoming examples. Later on, we’ll also deal with other types of http requests, such as post. each of these come with a corresponding method in requests.

 

For example, requests.post. there’s also a generic request method that looks like this: requests.request('GET', URL). this is a bit longer to write, but might come in handy in cases where you don’t know beforehand which type of http request

 

(Get, or something else) you’re going to make.

Let us expand upon this example a bit further to see what’s going on under the hood:

import requests
URL = 'http://www.webscrapingfordatascience.com/basichttp/' r = requests.get(URL)
# Which HTTP status code did we get back from the server?
print(r.status_code)
# What is the textual status code?
print(r.reason)
# What were the HTTP response headers?
print(r.headers)
# The request information is saved as a Python object in r.request:
print(r.request)
# What were the HTTP request headers?
print(r.request.headers)
# The HTTP response content:
print(r.text)
If you run this code, you’ll see the following result:
200
OK
{'Date': 'Wed, 04 Oct 2017 08:26:03 GMT',
'Server': 'Apache/2.4.18 (Ubuntu)', 'Content-Length': '20',
'Keep-Alive': 'timeout=5, max=99', 'Connection': 'Keep-Alive',
'Content-Type': 'text/html; charset=UTF-8'}
<PreparedRequest [GET]>
{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*',
'Connection': 'keep-alive'} Hello from the web!

 

Recall our earlier discussion on HTTP requests and replies. By using the status_code and reason attributes of a request.Response object, we can retrieve the HTTP status code and associated text message we got back from the server. Here, a status code and message of “200 OK” indicates that everything went well.

 

The headers attribute of the request.Response object returns a dictionary of the headers the server included in its HTTP reply.

 

Again: servers can be pretty chatty. This server reports its data, server version, and also provides the “Content-Type” header.

 

To get information regarding the HTTP request that was fired off.

you can access the request attribute of a request.Response object.

This attribute itself is a request.Request object, containing information about the HTTP request that was prepared.

 

Since an HTTP request message also includes headers, we can access the headers attribute for this object as well to get a dictionary representing the headers that were included by requests.

 

Note that requests politely reports its “User-Agent” by default. In addition, requests can take care of compressed pages automatically as well, so it also includes an “Accept-Encoding” header to signpost this.

 

Finally, it includes an “Accept” header to indicate that “any format you have can be sent back” and can deal with “keep-alive” connections as well. Later on, however, we’ll see cases where we need to override requests’ default request header behavior.

 

URLs with Parameters

URLs with Parameters

There’s one more thing we need to discuss regarding the basic working of HTTP: URL parameters. Try adapting the code example above in order to scrape the URL http://www.webscrapingfordatascience.com/paramhttp/. You should get the following content:

 

Please provide a "query" parameter

Try opening this page in your web browser to verify that you get the same result. Now try navigating to the page http://www.webscrapingfordatascience.com/paramhttp/? query=test. What do you see?

 

The optional “?…” part in URLs is called the “query string,” and it is meant to contain data that does not fit within a URL’s normal hierarchical path structure. You’ve probably encountered this sort of URL many times when surfing the web, for example:

http://www.example.com/product_page.html?product_id=304
https://www.google.com/search?dcr=0&source=hp&q=test&oq=test
 http://example.com/path/to/page/?type=animal&location=asia

 

Web servers are smart pieces of software. When a server receives an HTTP request for such URLs, it may run a program that uses the parameters included in the query string — the “URL parameters” — to render different content. 

 

Compare http:// www.webscrapingfordatascience.com/paramhttp/?query=test with http://www. webscrapingfordatascience.com/paramhttp, for instance. Even for this simple page, you see how the response dynamically incorporates the parameter data that you provided in the URL.

 

Query strings in URLs should adhere to the following conventions:

Query strings

  • A query string comes at the end of a URL, starting with a single question mark, “?”.
  • Parameters are provided as key-value pairs and separated by an ampersand, “&”.
  • The key and value are separated using an equals sign, “=”.

 

Since some characters cannot be part of a URL or have a special meaning (the characters “/”, “?”, “&”, and “=” for instance), URL “encoding” needs to be applied to properly format such characters when using them inside of a URL.

 

Try this out using the URL http://www.webscrapingfordatascience.com/ paramhttp/?query=another%20test%3F%26, which sends “another test?&” as the value for the “query” parameter to the server in an encoded form.

 

Other exact semantics are not standardized. In general, the order in which the URL parameters are specified is not taken into account by web servers, though some might.

 

Many web servers will also be able to deal and use pages with URL parameters without a value, for example, http://www.example.com/?noparam=&anotherparam. Since the full URL is included in the request line of an HTTP request, the web server can decide how to parse and deal with these.

 

URL Rewriting this latter remark also highlights another important aspect regarding URL parameters: even although they are somewhat standardized, they’re not treated as being a “special” part of a URL, which is just sent as a plain text line in an http request anyway.

 

Most web servers will pay attention to parse them on their end in order to use their information while rendering a page (or even ignore them when they’re unused.

 

Try the URL http://www.webscrapingfordatascience.com/paramhttp/?query=test&other= ignored, for instance)
but in recent years, the usage of URL parameters is being avoided somewhat.

 

Instead, most web frameworks will allow us to define “nice looking” URLs that just include the parameters in the path of an URL, for example, “/product/302/” instead of “products.html?p=302”.

 

The former looks nicer when looking at the URL as a human, and search engine optimization (SEO) people will also tell you that search engines prefer URLs as well.

 

On the server-side of things, any incoming URL can hence be parsed at will, taking pieces from it and “rewriting” it, as it is called, so some parts might end up being used as input while preparing a reply.

 

For us web scrapers, this basically means that even although you don’t see a query string in an URL, there might still be dynamic parts in the URL to which the server might respond in different ways.

 

Let’s take a look at how to deal with URL parameters in requests. The easiest way to deal with these is to include them simply in the URL itself:

import requests
URL = 'http://www.webscrapingfordatascience.com/paramhttp/?query=test' r = requests.get(URL)
print(r.text)
# Will show: I don't have any information on "test"
In some circumstances, requests will try to help you out and encode some characters for you:
import requests
URL = 'http://www.webscrapingfordatascience.com/paramhttp/?query=a query with spaces'
r = requests.get(URL)
# Parameter will be encoded as 'a%20query%20with%20spaces'
# You can verify this be looking at the prepared request URL:
print(r.request.URL)
# Will show [...]/paramhttp/?query=a%20query%20with%20spaces
print(r.text)
# Will show: I don't have any information on "a query with spaces"
However, sometimes the URL is too ambiguous for requests to make sense of it:
import requests
URL = 'http://www.webscrapingfordatascience.com/paramhttp/?query=complex?&' # Parameter will not be encoded
r = requests.get(URL)
# You can verify this be looking at the prepared request URL:
print(r.request.URL)
# Will show [...]/paramhttp/?query=complex?&
print(r.text)
# Will show: I don't have any information on "complex?"

In this case, requests are unsure whether you meant “?&” to belong to the actual URL as is or whether you wanted to encode it. Hence, requests will do nothing and just request the URL as is.

 

On the server-side, this particular web server is able to derive that the second question mark (“?”) should be part of the URL parameter (and should have been properly encoded, but it won’t complain), though the ampersand “&” is too ambiguous in this case.

 

Here, the web server assumes that it is a normal separator and not part of the URL parameter value.

 

So how then, can we properly resolve this issue? A first method is to use the “URL- lib.parse” functions quote and quote_plus.

 

The former is meant to encode special characters in the path section of URLs and encodes special characters using percent “%XX” encoding, including spaces. The latter does the same but replaces spaces by plus signs, and it is generally used to encode query strings:

import requests

from URLlib.parse import quote, quote_plus

raw_string = 'a query with /, spaces and?&' print(quote(raw_string)) print(quote_plus(raw_string))

This example will print out these two lines:

a%20query%20with%20/%2C%20spaces%20and%3F%26 a+query+with+%2F%2C+spaces+and%3F%26

 

The quote function applies percent encoding but leaves the slash (“/”) intact (as its default setting, at least) as this function is meant to be used on URL paths. The quote_ plus function does apply a similar encoding, but uses a plus sign (“+”) to encode spaces and will also encode slashes.

 

As long as we make sure that our query parameter does not use slashes, both encoding approaches are valid to be used to encode query strings. In case our query string does include a slash, and if we do want to use a quote, we can simply override its safe argument as done below:

import requests
from URLlib.parse import quote, quote_plus
raw_string = 'a query with /, spaces and?&'
URL = 'http://www.webscrapingfordatascience.com/paramhttp/?query='
print('\nUsing quote:')
# Nothing is safe, not even '/' characters, so encode everything r = requests.get(URL + quote(raw_string, safe=''))
print(r.URL) print(r.text)
print('\nUsing quote_plus:')
r = requests.get(URL + quote_plus(raw_string))
print(r.URL) print(r.text)
This example will print out:
Using quote: http://[...]/?query=a%20query%20with%20%2F%2C%20spaces%20and%3F%26 I don't have any information on "a query with /, spaces and?&"
Using quote_plus: http://[...]/?query=a+query+with+%2F%2C+spaces+and%3F%26
I don't have any information on "a query with /, spaces and?&"
All this encoding juggling can quickly lead to a headache. Wasn’t requests supposed to make our life easy and deal with this for us? Not to worry, as we can simply rewrite the example above using requests only as follows:
import requests
URL = 'http://www.webscrapingfordatascience.com/paramhttp/' parameters = {
'query': 'a query with /, spaces and?&'
}
r = requests.get(URL, params=parameters)
print(r.URL) print(r.text)

 

Note the usage of the params argument in the requests. get method: you can simply pass a Python dictionary with your non-encoded URL parameters and requests will take care of encoding them for you.

 

Empty and Ordered Parameters empty parameters, for example, as in “params={’query’: ”}” will end up in the URL with an equals sign included, that is, “?query=”.

 

If you want, you can also pass a list to params with every element being a tuple or list itself having two elements representing the key and value per parameter respectively, in which case the order of the list will be respected.

 

You can also pass an OrderedDict object (a built-in object provided by the “collections” module in Python 3) that will retain the ordering. Finally, you can also pass a string representing your query string part.

 

In this case, requests will prepend the question mark (“?”) for you, but will — once again — not be able to provide smart URL encoding, so that you are responsible to make sure your query string is encoded properly.

 

Although this is not frequently used, this can come in handy in cases where the web server expects an “?param” without an equals sign at the end, for instance — something that rarely occurs in practice, but can happen.

 

Silencing requests Completely even when passing a string to params, or including the full URL in the requests. get method, requests will still try, as we have seen, to help out a little.

 

For instance, writing: requests.get('http://www.example.com/?spaces |pipe') will make you end up with “?spaces%20%7Cpipe” as the query string in the request URL, with the space and pipe (“|”) characters encoded for you.

 

In rare situations, a very picky web server might nevertheless expect URLs to come in unencoded. again, cases such as these are extremely rare, but we have encountered situations in the wild where this happens. In this case, you will need to override requests as follows:

import requests
from URLlib.parse import unquote
class NonEncodedSession(requests.Session): # Override the default send method
def send(self, *a, **kw):
# Revert the encoding which was applied a[0].URL = unquote(a[0].URL)
return requests.Session.send(self, *a, **kw)
my_requests = NonEncodedSession()
URL = 'http://www.example.com/?spaces |pipe' r = my_requests.get(URL)
print(r.URL)
# Will show: http://www.example.com/?spaces |pipe
As a final exercise, head over to http://www.webscrapingfordatascience.com/ calchttp/. Play around with the “a,” “b,” and “op” URL parameters. You should be able to work out what the following code does:
import requests
def calc(a, b, op):
URL = 'http://www.webscrapingfordatascience.com/calchttp/' params = {'a': a, 'b': b, 'op': op}
r = requests.get(URL, params=params)
return r.text
print(calc(4, 6, '*'))
print(calc(4, 6, '/'))

 

Based on what we’ve seen above, you’ll probably feel itchy to try out what you’ve learned using a real-life website. However, there is another hurdle we need to pass before being web ready. What happens, for instance, when you run the following:

import requests

URL = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(URL)

print(r.text)

 

Wikipedia Versioning We’re using the “older” URL parameter here such that we obtain a specific version of the “List of Game of Thrones episodes” page, to make sure that our subsequent examples will keep working.

 

by the way, here you can see “URL rewriting” in action: both https://en.wikipedia.org/wiki/List_of_ Game_of_Thrones_episodes and https://en.wikipedia.org/w/index. php?title=List_of_Game_of_Thrones_episodes lead to the exact same page.

 

the difference is that the latter uses URL parameters and the former does not, though Wikipedia’s web server is clever enough to route URLs to their proper “page.”

 

Also, you might note that we’re not using the params argument here. We could, though neither the “title” nor “old” parameters require encoding here, so we can just stick them in the URL itself to keep the rest of the code a bit shorter.

 

As you can see, the response body captured by r.text now spits out a slew of confusing-looking text. This is HTML-formatted text, and although the content we’re looking for is buried somewhere inside this soup, we’ll need to learn about a proper way to get out the information we want from there. 

 

The Fragment Identifier apart from the query string, there is, in fact, another optional part of the URL that you might have encountered before: the fragment identifier, or “hash,” as it is sometimes called.

 

It is prepended by a hash mark (“#”) and comes at the very end of an URL, even after the query string, for instance, as in “http://www.example.org/about.htm?p=8#contact”. this part of the

 

URL is meant to identify a portion of the document corresponding to the URL. For instance, a web page can include a link including a fragment identifier that, if you click on it, immediately scrolls your view to the corresponding part of the page.

 

However, the fragment identifier functions differently than the rest of the URL, as it is processed exclusively by the web browser with no participation at all from the web server.

 

In fact, proper web browsers should not even include the fragment identifier in their http requests when they fetch a resource from a web server.

 

Instead, the browser waits until the web server has sent its reply, and it will then use the fragment identifier to scroll to the correct part of the page. Most web servers will simply ignore a fragment identifier if you would include it in a request

 

URL, although some might be programmed to take them into account as well. again: this is rather rare, as the content provided by such a server would not be viewable by most web browsers, as they leave out the fragment identifier part in their requests, though the web is full of interesting edge cases.

 

We’ve now seen the basics of the requests library. Take some time to explore the documentation of the library available at http://docs.python-requests.org/en/ master/.

 

The quality of requests’ documentation is very high and easy to refer to once you start using the library in your projects.

 

CSS Soup

CSS Soup

So far we have discussed the basics of HTTP and how you can perform HTTP requests in Python using the requests library. However, since most web pages are formatted using the Hypertext Markup Language (HTML), we need to understand how to extract information from such pages.

 

As such, this blog introduces you to HTML, as well as another core building block that is used to format and stylize modern web pages: Cascading Style Sheets (CSS). This blog then discusses the Beautiful Soup library, which will help us to make sense of the HTML and CSS “soup.”

 

HTML

In the previous blog, we introduced the basics of HTTP and saw how to perform HTTP requests in Python using the requests library, but now we need to figure out a way to parse HTML contents. Recall our small Wikipedia example we ended within the previous blog and the soup of HTML we got back from it:

import requests

URL = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(URL)

print(r.text)

 

Perhaps you’ve tried running this example with some other favorite websites of yours... In any case, once you start looking a bit closer to how the web works and start web scraping in practice, you’ll no doubt start to marvel at all the things your web browser does for you.

 

Getting out web pages; converting this “soup” into nicely formatted pages, including images, animation, styling, video, and so on. This might feel very intimidating at this point — surely we won’t have to replicate all the things a web browser does from scratch?

 

The answer is that, luckily, no we do not. Just as with HTTP, we’ll use a powerful Python library that can help us navigate this textual mess. And, contrary to a web browser, we’re not interested in fetching out a complete page’s content and rendering it, but only in extracting those pieces we’re interested in.

 

If you run the example above, you’ll see the following being printed onscreen:

 style="margin:0;width:995px;height:92px"><!DOCTYPEHTML>
<html class=client-nojs lang=en dir=ltr>
<head>
<meta charset="UTF-8"/>
<title>List of Game of Thrones episodes - Wikipedia</title> [...]
</html>

 

This is Hypertext Markup Language (HTML), the standard markup language for creating web pages. Although some will call HTML a “programming language,” “markup language” is a more appropriate term as it specifies how a document is structured and formatted.

 

There is no strict need to use HTML to format web pages — in fact, all the examples we’ve dealt with in the previous blog just returned simple, textual pages.

 

However, if you want to create visually appealing pages that actually look good in a browser (even if it’s just putting some color on a page), HTML is the way to go.

 

HTML provides the building blocks to provide structure and formatting to documents. 

 

This is provided by means of a series of “tags.”HTML tags often come in pairs and are enclosed in angled brackets, with “<tagname>” being the opening tag and “</tagname>” indicating the closing tag. Some tags come in an unpaired form and do not require a closing tag. Some commonly used tags are the following:

 

  • <p>...</p> to enclose a paragraph;
  • <br> to set a line break;
  • <table>...</table> to start a table block, inside; <tr>...<tr/> is used for the rows; and <td>...</td> cells;
  • <img> for images;
  • <h1>...</h1> to <h6>...</h6> for headers;
  • <div>...</div> to indicate a “division” in anHTML document, basically used to group a set of elements;
  • <a>...</a> for hyperlinks;
  • <ul>...</ul>, <ol>...</ol> for unordered and ordered lists respectively; inside of these, <li>...</li> is used for each list item.

 

Tags can be nested inside each other, so “<div><p>Hello</p></div>” is perfectly valid, though overlapping nestings such as “<div><p>Oops</div></p>” is not.

 

Even though this isn’t proper HTML, every web browser will exert a lot of effort to still parse and render an HTML page as well as possible.

 

If web browsers would require that all web pages are perfectly formatted according to the HTML standard, you can bet that the majority of websites would fail.HTML is messy.

 

Tags that come in pairs have content. For instance, “<a>click here</a>” will render out “click here” as a hyperlink in your browser. Tags can also have attributes, which are put inside of the opening tag.

 

For instance, “<a href=“http://www.google.com”> click here </a>” will redirect the user to Google’s home page when the link is clicked. The “href” attribute hence indicates the web address of the link.

 

For an image tag, which doesn’t come in a pair, the “class='lazy' data-src” attribute is used to indicate the URL of the image the browser should retrieve, for example, “<img class='lazy' data-src=“http://www.example.com/image.jpg”>”.

 

Development Tool

Development Tool

Don’t worry too much if all of this is going a bit fast, as we’ll come to understand HTML in more detail when we work our way through the examples. Before we continue, we want to provide you with a few tips that will come in handy while building web scrapers.

 

Most modern web browsers nowadays include a toolkit of powerful tools you can use to get an idea of what’s going on regarding HTML, and HTTP too.

 

Navigate to the Wikipedia page over at https://en.wikipedia.org/w/index.php?title=List_of_ Game_of_Thrones_episodes&oldid=802553687 again in your browser — we assume you’re using Google Chrome for what follows.

 

First of all, it is helpful to know how you can take a look at the underlying HTML of this page.

 

To do so, you can right-click on the page and press “View source,” or simply press Control+U in Google Chrome. A new page will open containing the raw HTML contents for the current page (the same content as what we got back using r.text);

 

Additionally, you can open up Chrome’s “Developer Tools.” To do so, either select the Chrome Menu at the top right of your browser window, then select “Tools,” “Developer Tools,” or press Control+Shift+I.

 

Alternatively, you can also right-click on any page element and select “Inspect Element.” Other browsers such as Firefox and Microsoft Edge have similar tools built in. 

 

Moving Around take some time to explore the developer tools pane. Yours might appear at the bottom of your browser window. if you prefer to have it on the right, find the menu with the three-dotted-colon icon (the tri-colon), and pick a different “dockside.”

 

The Developer Tools pane is organized by means of a series of tabs, of which “Elements” and “Network” will come in most helpful.

 

Let’s start by taking a look at the Network tab. You should see a red “recording” icon in the toolbar indicating that Chrome is tracking network requests (if the icon is not lit, press it to start tracking).

 

Refresh the Wikipedia page and look at what happens in the Developer Tools pane: Chrome starts logging all requests it is making, starting with an HTTP request for the page itself at the top.

 

Note that your web browser is also making lots of other requests to actually render the page, most of them to fetch image data (“Type: png”). By clicking a request, you can get more information about it.

 

Click the “index.php” request at the top, for instance, to get a screen. Selecting a request opens another pane that provides a wealth of information that should already look pretty familiar to you now that you’ve already worked with HTTP.

 

For instance, making sure the “Headers” tab is selected in the side panel, we see general information such as the request URL, method (verb), and status code that was sent back by the server, as well as a full list of request and response headers.

 

Finally, there are a number of useful checkboxes in the Network tab that are noteworthy to mention. Enabling “Preserve log” will prevent Chrome from “cleaning up” the overview every time a new page request is performed.

 

This can come in handy in case you want to track a series of actions when navigating a website. “Disable cache” will prevent Chrome from using its “short-term memory.”

 

Chrome will try to be smart and prevent performing a request if it still has the contents of a recent page around, though you can override this in case you want to force Chrome to actually perform every request.

 

Moving on to the “Elements” tab, we see a similar view as what we see when viewing the page’s source, though now neatly formatted as a tree-based view, with little arrows that we can expand and collapse.

 

What is particularly helpful here is the fact that you can hover over the HTML tags in the Elements tab, and Chrome will show a transparent box over the corresponding visual representation on the web page itself.

 

This can help you to quickly find the pieces of content you’re looking for. Alternatively, you can right-click any element on a web page and press “Inspect element” to immediately highlight its corresponding HTML code in the Elements tab.

 

Note that the “breadcrumb trail” at the bottom of the Elements tab shows you where you currently are in the HTML “tree.”

 

Inspecting Elements versus View Source You might wonder why the “View source” option is useful to look at a page’s raw HTML source when we have a much user-friendlier alternative offered by the elements tab.

 

A warning is in order here: the “View source” option shows the HTML code as it was returned by the web server, and it will contain the same contents as r.text when using requests.

 

The view in the elements tab, on the other hand, provides a “cleaned up” version after the HTML was parsed by your web browser. overlapping tags are fixed and extra white space is removed, for instance.

 

There might hence be small differences between these two views. in addition, the elements tab provides a live and dynamic view. Websites can include scripts that are executed by your web browser and which can alter the contents of the page at will. the elements tab will hence always reflect the current state of the page.

 

These scripts are written in a programming language called JavaScript and can be found inside <script>... </script> tags in HTML. We’ll take a closer look at JavaScript and why it is important in the context of web scraping a few blogs later.

 

Next, note that any HTML element in the Elements tab can be right-clicked. “Copy, Copy selector” and “Copy XPath” are particularly useful, which we’re going to use quite often later on.

 

You’ll even see that you can edit the HTML code in real time (the web page will update itself to reflect your edits), though don’t feel too much like a CSI: Miami style hacker: these changes are of course only local.

 

They don’t do anything on the web server itself and will be gone once you refresh the page, though it can be a fun way to experiment with HTML. In any case, your web browser is going to become your best friend when working on web scraping projects.

 

Cascading Style Sheets: CSS

Cascading Style Sheets

Before we can get started with actually dealing with HTML in Python, there is another key piece of technology that we need to discuss first: Cascading Style Sheets (CSS).

 

While perusing the HTML elements in your browser, you’ve probably noticed that some HTML attributes are present in lots of tags:

  • “id,” which is used to attach a page-unique identifier to a certain tag;
  • “class,” which lists a space-separated series of CSS class names.

 

Whereas “id” will come in handy to quickly fetch parts of an HTML page we’re interested in, “class” deserves some further introduction and relates to the concept of CSS.

 

CSS and HTML go hand in hand. Recall that, originally, HTML was meant as a way to define both the structure and formatting of a website.

 

In the early days of the web, it was hence normal to find lots of HTML tags that were meant to define what content should look like, for example “<b>...</b>” for bold text; “<i>...</i>” for italics text; and “<font>...</font>” to change the font family, size, color, and other font attributes.

 

After a while, however, web developers began to argue — rightly so — that the structure and formatting of documents basically relate to two different concerns.

 

Compare this to writing a document with a text processor such as Word. You can either apply to format directly to the document, but a better way to approach this is to use styles to indicate headers, lists, tables, etc.,

 

The formatting of which can then easily be changed by modifying the definition of the style. CSS works in a similar way.

 

HTML is still used to define the general structure and semantics of a document, whereas CSS will govern how a document should be styled, or in other words, what it should look like.

 

The CSS language looks somewhat different from HTML. In CSS, style information is written down as a list of colon-separated key-value based statements, with each statement itself being separated by a semicolon, as follows:

color: 'red'; background-color: #ccc; font-size: 14pt;

border: 2px solid yellow;

 

These style declarations can be included in a document in three different ways:

style declarations

Inside a regularHTML “style” attribute, for instance as in: “<p style=”color:’red’;”>...</p>”. Inside ofHTML “<style>...</style>” tags, placed in inside the “<head>” tag of a page.

 

Inside a separate file, which is then referred to by means of a “<link>” tag inside the “<head>” tag of a page. This is the cleanest way of working. When loading a web page, your browser will perform an additional HTTP request to download this CSS file and apply its defined styles to the document.

 

In case style declarations are placed inside a “style” attribute, it is clear to which element the declarations should be applied: the HTML tag itself. In the other two cases, the style definition needs to incorporate information regarding the HTML element or elements a styling should be applied to.

 

This is done by placing the style declarations inside curly brackets to group them, and putting a “CSS selector” at the beginning of each group:

 style="margin:0;width:1007px;height:127px">h1 {
color: red;
}
div.box {
border: 1px solid black;
}
#intro-paragraph { font-weight: bold;
}

 

CSS selectors define the patterns used to “select” the HTML elements you want to style. They are quite comprehensive in terms of syntax. The following list provides a full reference:

 

Tagname selects all elements with a particular tag name. For instance, “h1” simply matches with all “<h1>” tags on a page.

 

.classname (note the dot) selects all elements having a particular class defined in the HTML document. This is exactly where the “class” attribute comes in.

 

For instance, .intro will match with both “<p class=”intro”>” and “<h1 class=”intro”>”. Note that HTML elements can have multiple classes, for example, “<p class=”intro highlight”>”.

 

#idname matches elements based on their “id” attribute. Contrary to classes, proper HTML documents should ensure that each “id” is unique and only given to one element only (though don’t be surprised if some particularly messy HTML page breaks this convention and used the same id value multiple times).

 

These selectors can be combined in all sorts of ways. div.box, for instance, selects all “<div class=”box”> tags, but not “<div class=”circle”>” tags.

 

Multiple selector rules can be specified by using a comma, “,”, for example, h1, h2, h3.

selector1 selector2 defines a chaining rule (note the space) and selects all elements matching selector2 inside of elements matching selector1. Note that it is possible to chain more than two selectors together.

 

selector1 > selector2 selects all elements matching selector2 where the parent element matches selector1. Note the subtle difference here with the previous line.

 

A “parent” element refers to the “direct parent.” For instance, div > span will not match with the span element inside “<div> <p> <span> </span> </p> </div>” (as the parent element here is a “<p>” tag), whereas div span will.

 

selector1 + selector2 selects all elements matching selector2 that are placed directly after (i.e., on the same level in the HTML hierarchy) elements matching selector1.

 

selector1 ~ selector2 selects all elements matching selector2 that are placed after (on the same level in the HTML hierarchy) selector1. Again, there’s a subtle difference here with the previous rule: the precedence here does not need to be “direct”: there can be other tags in between.

 

It is also possible to add more fine-tuned selection rules based on attributes of elements. tagname[attributename] selects all tag name elements where an attribute named attribute name is present. Note that the tag selector is optional, and simply writing [title] selects all elements with a “title” attribute.

 

The attribute selector can be further refined. [attributename=value] checks the actual value of an attribute as well. If you want to include spaces, wrap the value in double quotes.

 

[attributename~=value] does something similar, but instead of performing an exact value comparison, here all elements are selected whose attribute name attribute’s value is a space-separated list of words, one of them being equal to value.

 

[attributename|=value] selects all elements whose attributename

attribute’s value is a space-separated list of words, with any of them being equal to “value” or starting with “value” and followed by a hyphen (“-”).

 

[attributename^=value] selects all elements whose attribute value starts with the provided value. 

If you want to include spaces, wrap the value in double quotes.

 

[attributename$=value] selects all elements whose attribute value ends with the provided value.

If you want to include spaces, wrap the value in double quotes.

 

[attributename*=value] selects all elements whose attribute value contains the provided value. 

If you want to include spaces, wrap the value in double quotes.

 

Finally, there are a number of “colon” and “double-colon” “pseudo-classes” that can be used in a selector rule as well. p:first-child selects every “<p>” tag that is the first child of its parent element, and p:last-child and p:nth-child(10) provide similar functionality.

 

Play around with the Wikipedia page using your Chrome’s Developer Tools (or the equivalent in your browser): try to find instances of the “class” attribute. The CSS resource of the page is referenced through a “<link>” tag (note that pages can load multiple CSS files as well):

 

<link rel="stylesheet" href="/w/load.php?[...];skin=vector">

 

We’re not going to build websites using CSS. Instead, we’re going to scrape them. As such, you might wonder why this discussion regarding CSS is useful for our purposes.

 

The reason is that the same CSS selector syntax can be used to quickly find and retrieve elements from an HTML page using Python.

 

Try right-clicking some HTML elements in the Elements tab of Chrome’s Developer Tools pane and press “Copy, Copy selector.” Note that you obtain a CSS selector. For instance, this is the selector to fetch one of the tables on the page:

 

#mw-content-text > div > table:nth-child(9).

Or: “inside the element with id “mw-content-text,” get the child “div” element, and get the 9th “table” child element.”

We’ll use these selectors quite often once we start working with HTML in our web scraping scripts.

 

The Beautiful Soup Library

Beautiful Soup Library

We’re now ready to start working with HTML pages using Python. Recall the following lines of code:

 style="margin:0;width:1008px;height:116px">import requests

URL = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(URL)HTML_contents = r.text

How do we deal with the HTML contained inHTML_contents? To properly parse and tackle this “soup,” we’ll bring in another library, called “Beautiful Soup.”

 

Soup, Rich and Green and finally, it becomes clear why we’ve been referring to messy HTML pages as a “soup.” the Beautiful Soup library was named after a Lewis Carroll poem bearing the same name from “Alice's Adventures in Wonderland.”

 

In the tale, the poem is sung by a character called the “Mock turtle” and goes as follows: “Beautiful Soup, so rich and green,// Waiting in a hot tureen!// Who for such dainties would not stoop?// Soup of the evening, beautiful Soup!”.

 

Just like in the story, Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the oftentimes very messy web by fixing bad HTML and presenting us with an easy-to-work-with python structure.

 

Just as was the case with requests, installing Beautiful Soup is easy with pip and note the “4” in the package name:

pip install -U beautifulsoup4

 

Using Beautiful Soup starts with the creation of a BeautifulSoup object. If you already have an HTML page contained in a string (as we have), this is straightforward.

 

Don’t forget to add the new import line:

import requests
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(URL)HTML_contents = r.text
html_soup = BeautifulSoup(html_contents)
Try running this snippet of code. If everything went well, you should get no errors, though you might see the following warning appear:
Warning (from warnings module):
File " init .py", line 181 markup_type=markup_type))

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

 

The code that caused this warning is on line 1 of the file <string>.

To get rid of this warning, change the code so that it looks like this:

BeautifulSoup(YOUR_MARKUP}) to this:

BeautifulSoup(YOUR_MARKUP, "html.parser")

 

Uh-oh, what’s going on here? The Beautiful Soup library itself depends on an HTML parser to perform most of the bulk parsing work. In Python, multiple parsers exist to do so:

  • “html.parser”: a built-in Python parser that is decent (especially when using recent versions of Python 3) and requires no extra installation.
  • “lxml”: which is very fast but requires an extra installation.
  • “html5lib”: which aims to parse web page in exactly the same way as a web browser does, but is a bit slower.

 

Since there are small differences between these parsers, Beautiful Soup warns you if you don’t explicitly provide one, this might cause your code to behave slightly different when executing the same script on different machines.

 

To solve this, we simply specify a parser ourselves — we’ll stick with the default Python parser here:

html_soup = BeautifulSoup(html_contents, 'html.parser')

 

Beautiful Soup’s main task is to take HTML content and transform it into a tree-based representation. Once you’ve created a BeautifulSoup object, there are two methods you’ll be using to fetch data from the page:

find(name, attrs, recursive, string, **keywords);

find_all(name, attrs, recursive, string, limit, **keywords).

 

Underscores if you don’t like writing underscores, Beautiful Soup also exposes most of its methods using “camelCaps” capitalization. So instead of find_all, you can also use findAll if you prefer.

 

Both methods look very similar indeed, with the exception that find_all takes an extra limit argument. To test these methods, add the following lines to your script and

run it:

print(html_soup.find('h1')) print(html_soup.find('', {'id': 'p-logo'})) for found inHTML_soup.find_all(['h1', 'h2']):

print(found)

 

The general idea behind these two methods should be relatively clear: they’re used to find elements inside the HTML tree. Let’s discuss the arguments of these two methods step by step:

 

The name argument defines the tag names you wish to “find” on the page. You can pass a string or a list of tags. Leaving this argument as an empty string simply selects all elements.

 

The attrs argument takes a Python dictionary of attributes and matches HTML elements that match those attributes.

 

And or Or? Some guidelines state that the attributes defined in the attrs dictionary behave in an “or-this-or-that” relationship, where every element that matches at least one of the attributes will be retrieved.

 

This is not true, however: both your filters defined in attrs and in the keywords you use in **keywords should all match in order for an element to be retrieved.

 

The recursive argument is a Boolean and governs the depth of the search. If set to True — the default value, the find and find_all methods will look into children, children’s children, and so on... for elements that match your query. If it is False, it will only look at direct child elements.

 

The string argument is used to perform matching based on the text content of elements.

Text or String? the string argument is relatively new. in earlier Beautiful Soup versions, this argument was named text instead.

 

You can, in fact, still use text instead of the string if you like. if you use both (not recommended), then text takes precedence and string end up in the list of **keywords below.

 

The limit argument is only used in the find_all method and can be used to limit the number of elements that are retrieved. Note that find is functionally equivalent to calling find_all with the limit set to 1.

 

With the exception that the former returns the retrieved element directly, and that the latter will always return a list of items, even if it just contains a single element.

 

Also important to know is that, when find_all cannot find anything, it returns an empty list, whereas if find cannot find anything, it returns None.

 

**keywords is kind of a special case. Basically, this part of the method signature indicates that you can add in as many extra named arguments as you like, which will then simply be used as attribute filters.

 

Writing “find(id='myid')” is hence the same as “find(attrs={'id': 'myid'})”. If you define both the attrs argument and extra keywords, all of these will be used together as filters. This functionality is mainly offered as a convenience in order to write easier-to-read code.

 

Take Care with Keywords even although the **keywords argument can come in very helpful in practice, there are some important caveats to mention here.

 

First of all, you cannot use the class as a keyword, as this is a reserved python keyword. this is a pity, as this will be one of the most frequently used attributes when hunting for content inside HTML.

 

Luckily, Beautiful Soup has provided a workaround. instead of using the class, just write class_ as follows:

“find(class_='myclass')”. note that name can also not be used as a keyword since that is what is used already as the first argument name for find and find_all.

 

Sadly, Beautiful Soup does not provide a name_ alternative here. instead, you’ll need to use attrs in case you want to select based on the “name” HTML attribute.

 

Both find and find_all return Tag objects. Using these, there are a number of interesting things you can do:

children attribute

  • Access the name attribute to retrieve the tag name.
  • Access the contents attribute to get a Python list containing the tag’s children (its direct descendant tags) as a list.

 

  • The children attribute does the same but provides an iterator instead; the descendants attribute also returns an iterator, now including all the tag’s descendants in a recursive manner. These attributes are used when you call to find and find_all.

 

  • Similarly, you can also go “up” the HTML tree by using the parent and parents attributes. To go sideways (i.e., find next and previous elements at the same level in the hierarchy), next_sibling, previous_sibling and next_siblings, and previous_siblings can be used.

 

  • Converting the Tag object to a string shows both the tag and its HTML content as a string. This is what happens if you call to print out the Tag object, for instance, or wrap such an object in the str function.

 

  • Access the attributes of the element through the attrs attribute of the Tag object. For the sake of convenience, you can also directly use the Tag object itself as a dictionary.

 

  • Use the text attribute to get the contents of the Tag object as clear text (without HTML tags).

 

  • Alternatively, you can use the get_text method as well, to which a strip Boolean argument can be given so that get_text(strip=True) is equivalent to text.strip().

 

  • It’s also possible to specify a string to be used to join the bits of text enclosed in the element together, for example, get_text('--').

 

  • If a tag only has one child, and that child itself is simply text, then you can also use the string attribute to get the textual content. However, in case a tag contains other HTML tags nested within, the string will return None whereas text will recursively fetch all the text.

 

  • Finally, not all find and find_all searches need to start from your original BeautifulSoup objects. Every Tag object itself can be used as a new root from which new searches can be started.

We’ve dealt with a lot of theory. Let’s show off these concepts through some example code:

import requests
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(URL)HTML_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser') # Find the first h1 tag
first_h1 =HTML_soup.find('h1')
print(first_h1.name) # h1
print(first_h1.contents) # ['List of ', [...], ' episodes']
print(str(first_h1))
# Prints out: <h1 class="firstHeading" id="firstHeading" lang="en">List of # <i>Game of Thrones</i> episodes</h1>
print(first_h1.text) # List of Game of Thrones episodes
print(first_h1.get_text()) # Does the same
print(first_h1.attrs)
# Prints out: {'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}
print(first_h1.attrs['id']) # firstHeading print(first_h1['id']) # Does the same print(first_h1.get('id')) # Does the same
print('------------ CITATIONS ------------')
# Find the first five cite elements with a citation class cites =HTML_soup.find_all('cite', class_='citation', limit=5) for citation in cites:
print(citation.get_text())
# Inside of this cite element, find the first a tag link = citation.find('a')
# ... and show its URL print(link.get('href')) print()

A Note About Robustness takes a good look at the “citations” part of the example above.

 

What would happen in case no “<a>” tag is present inside a “<cite>” element? in that case, the link variable would be set to none and the line “link.get(’href’)” would crash our program. always take care when writing web scrapers and prepare for the worst.

 

For examples in “safe environments” we can permit ourselves to be somewhat sloppy for the sake of brevity, but in a real-life situation, you’d want to put in an extra check to see whether the link is none or not and act accordingly.

 

Before we move on with another example, there are two small remarks left to be made regarding find and find_all. If you find yourself traversing a chain of tag names as follows:

tag.find('div').find('table').find('thead').find('tr')

 

It might be useful to keep in mind that Beautiful Soup also allows us to write this in a shorthand way:

tag.div.table.thead.tr
Similarly, the following line of code:
tag.find_all('h1')
Is the same as calling:
tag('h1')

Although this is — again — offered for the sake of convenience, we’ll nevertheless continue to use find and find_all in full throughout this blog, as we find that being a little bit more explicit helps readability in this case.

 

Let us now try to work out the following use case. You’ll note that our Game of Thrones Wikipedia page has a number of well-maintained tables listing the episodes with their directors, writers, air date, and a number of viewers. Let’s try to fetch all of this data at once using what we have learned:

import requests
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(URL)HTML_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list episodes = []
ep_tables =HTML_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables: headers = []
rows = table.find_all('tr')
# Start by fetching the header cells from the first row to determine # the field names
for header in table.find('tr').find_all('th'): headers.append(header.text)
# Then go through all the rows except the first one
for row in table.find_all('tr')[1:]: values = []
# And get the column cells, the first one being inside a th-tag
for col in row.find_all(['th','td']): values.append(col.text)
if values:
episode_dict = {headers[i]: values[i] for i in range(len(values))} episodes.append(episode_dict)
# Show the results
for episode in episodes: print(episode)

Most of the code should be relatively straightforward at this point, though some things are worth pointing out:

 

We don’t come up with the “find_all('table', class_= 'wikiepisodetable')” line from thin air.

although it might seem that way just by looking at the code.

 

Recall what we said earlier about your browser’s developer tools becoming your best friend. Inspect the episode tables on the page. Note how they’re all defined by means of a “<table>” tag.

 

However, the page also contains tables we do not want to include. Some further investigation leads us to a solution:

all the episode tables have “wikiepisodetable” as a class, whereas the other tables do not.

 

You’ll often have to puzzle your way through a page first before coming up with a solid approach. In many cases, you’ll have to perform multiple find and find_all iterations before ending up where you want to be.

 

For every table, we first want to retrieve the headers to use as keys in a Python dictionary. To do so, we first select the first “<tr>” tag, and select all “<th>” tags within it.

 

Next, we loop through all the rows (the “<tr>” tags), except for the first one (the header row). For each row, we loop through the “<th>” and “<td>” tags to extract the column values (the first column is wrapped inside of a “<th>” tag, the others in “<td>” tags, which is why we need to handle both).

 

At the end of each row, we’re ready to add a new entry to the “episodes” variable. To store each entry, we use a normal Python dictionary (episode_dict). The way how this object is constructed might look a bit strange in case you’re not very familiar with Python.

 

That is, Python allows us to construct a complete list or dictionary “in one go” by putting a “for” construct inside the “[...]” or “{...}” brackets. Here, we use this to immediately loop through the headers and values lists to build the dictionary object.

 

Note that this assumes that both of these lists have the same length and that the order for both of these matches so that the header at “headers[2]”, for instance, is the header corresponding with the value over at “values[2]”. Since we’re dealing with rather simple tables here, this is a safe assumption.

 

Are Tables Worth It? You might not be very impressed with this example so far.

 

Most modern browsers allow you to simply select or right-click tables on web pages and will be able to copy them straight into a spreadsheet program such as excel anyway. that’s true, and if you only have one table to extract, this is definitely the easier route to follow.

 

Once you start dealing with many tables, however, especially if they’re spread over multiple pages, or need to periodically refresh tabular data from a particular web page, the benefit of writing a scraper starts to become more apparent.

 

Experiment a bit more with this code snippet. You should be able to work out the following:

code snippet

Try extracting all links from the page as well as where they point to (tip: look for the “href” attribute in “<a>” tags).

Try extracting all images from the page.

 

Try extracting the “ratings” table from the page. This one is a bit tricky. You might be tempted to use “find('table', class_="wikitable")”.

But you’ll note that this matches the very first table on the page instead, even though its class attribute is set to “wikitable plainrowheaders.”

Indeed, for HTML attributes that can take multiple, space-separated values (such as “class”), Beautiful Soup will perform a partial match.

 

To get out the table we want, you’ll either have to loop over all “wikitable” tables on the page and perform a check on its text attribute to make sure you have the one you want, or try to find a unique parent element from which you can drill down, for example, “find('div', align="center").

 

find('table', class_="wikitable")” — at least for now, you’ll learn about some more advanced Beautiful Soup features in the upcoming section.

 

Classes Are Special for HTML attributes that can take multiple, space-separated values (such as “class”), Beautiful Soup will perform a partial match. this can be tricky in case you want to perform an exact match such as “find me elements with the “class” class and only that class,” but also in cases where you want to select an HTML element matching more than one class.

 

In this case, you can write something like “find(class_='class-one class-two')”, though this way of working is rather brittle and dangerous (the classes should then appear in the same order and next to each other in the HTML page, which might not always be the case).

 

Another approach is to wrap your filter in a list, that is, “find(class_=['class-one', 'class-two'])”, 

though this will also not obtain the desired result: instead of matching elements having both “class-one” and “class-two” as classes.

 

This statement will match with elements having any of these classes! to solve this problem in a robust way, we hence first need to learn a little bit more about Beautiful Soup...

 

More on Beautiful Soup

Beautiful Soup

Now that we understand the basics of Beautiful Soup, we’re ready to explore the library a bit further. First of all, although we have already seen the basics of find and find_all, it is important to note that these methods are very versatile indeed. We have already seen

how you can filter on a simple tag name or a list of them:

html_soup.find('h1')HTML_soup.find(['h1', 'h2'])

 

However, these methods can also take other kinds of objects, such as a regular expression object. The following line of code will match with all tags that start with the letter “h” by constructing a regular expression using Python’s “re” module:

import reHTML_soup.find(re.compile('^h'))

 

Regex if you haven’t heard about regular expressions before, a regular expression (regex) defines a sequence of patterns (an expression) defining a search pattern. it is frequently used for string searching and matching code to find (and replace) fragments of strings. although they are very powerful constructs, they can also be misused.

 

For instance, it is a good idea not to go overboard with long or complex regular expressions, as they’re not very readable and it might be hard to figure out what a particular piece of regex is doing later on. By the way, this is also a good point to mention that you should avoid using regex to parse HTML pages.

 

We could have introduced regex earlier as a way to parse pieces of content from your HTML soup, without resorting at all to the use of Beautiful Soup.

 

this is a terrible idea, however.HTML pages are — as we have seen — very messy, and you’ll quickly end up with lots of spaghetti code in order to extract content from a page. always use an HTML parser like Beautiful Soup to perform the grunt work. You can then use small snippets of regex (as shown here) to find or extract pieces of content.

 

Apart from strings, lists, and regular expressions, you can also pass a function. This is helpful in complicated cases where other approaches wouldn’t work:

def has_classa_but_not_classb(tag): cls = tag.get('class', [])

return 'classa' in cls and not 'classb' in cls

html_soup.find(has_classa_but_not_classb)

 

Note that you can also pass lists, regular expressions, and functions to the attrs dictionary values, string, and **keyword arguments of find and find_all.

 

Apart from find and find_all, there are also a number of other methods for searching the HTML tree, which is very similar to find and find_all. The difference is that they will search for different parts of the HTML tree:

 

find_parent and find_parents work their way up the tree, looking at a tag’s parents using its parents attribute. Remember that find and find_all work their way down the tree, looking at a tag’s descendants.

 

  • find_next_sibling and find_next_siblings will iterate and match a tag’s siblings using next_siblings attribute.
  • find_previous_sibling and find_previous_siblings do the same but use the previous_siblings attribute.
  • find_next and find_all_next use the next_elements attribute to iterate and match over whatever comes after a tag in the document.
  • find_previous and find_all_previous will perform the search backward using the previous_elements attribute instead.

 

Remember that find and find_all work on the children attribute in case the recursive argument is set to False, and on the descendants attribute in case recursive is set to True.

 

So Many Methods although it’s not really documented, it is also possible to use the findChild and findChildren methods (though not find_child and find_children), which are defined as aliases for find and find_all respectively.

 

There is no findDescendant, however, so keep in mind that using findChild will default to searching throughout all the descendants (just like find does), unless you set the recursive argument to False. this is certainly confusing, so it’s best to avoid these methods.

 

Although all of these can come in handy, you’ll find that find and find_all will take up most of the workload when navigating an HTML tree. There is one more method, however, which is extremely useful: select.

 

Finally, the CSS selectors we’ve seen above can be of use. Using this method, you can simply pass a CSS selector rule as a string.

 

Beautiful Soup will return a list of elements matching this rule:

# Find all <a> tagsHTML_soup.select('a')
# Find the element with the info idHTML_soup.select('#info')
# Find <div> tags with both classa and classb CSS classesHTML_soup.select(div.classa.classb)
# Find <a> tags with an href attribute starting with http://example.com/HTML_soup.select('a[href^="http://example.com/"]')
# Find <li> tags which are children of <ul> tags with class lstHTML_soup.select(ul.lst > li')

 

Once you start getting used to CSS selectors, this method can be very powerful indeed. For instance, if we want to find out the citation links from our Game of Thrones Wikipedia page, we can simply run:

for link inHTML_soup.select('ol.references cite a[href]'):

print(link.get('href'))

 

However, the CSS selector rule engine in Beautiful Soup is not as powerful as the one found in a modern web browser. The following rules are valid selectors, but will not work in Beautiful Soup.

# This will not work:
# cite a[href][rel=nofollow]
# Instead, you can use:
tags = [t for t inHTML_soup.select('cite a[href]') \
if 'nofollow' in t.get('rel', [])]
# This will not work:
# cite a[href][rel=nofollow]:not([href*="archive.org"])
# Instead, you can use:
tags = [t for t inHTML_soup.select('cite a[href]') \
if 'nofollow' in t.get('rel', []) and 'archive.org' not in
t.get('href')]

Luckily, cases, where you need to resort to such complex selectors, are rare and remember that you can still use find, find_all, and friends too (try playing around with the two samples above and rewrite them without using select at all).

 

All Roads Lead to Your Element observant readers will have noticed that there are often multiple ways to write a CSS selector to get the same result.

 

That is, instead of writing “cite a,” you can also go overboard and write “body div.reflect ol.references li cite. 

citation a” and get the same result. in general, however, it is good practice to only make your selectors as granular as necessary to get the content you want. 

 

Websites often change, and if you’re planning to use a web scraper in production for a significant amount of time, you can save yourself some headache by finding a good trade-off between precision and robustness.

 

That way, you can try to be as future proof as possible in case the site owners decide to play around with the HTML structure, class names, attributes, and so on.

 

this being said, there might always be a moment where a change is so significant that it ends up breaking your selectors. including extra checks in your code and providing early warning signs can help a lot to build robust web scrapers.

 

Finally, there is one more detail to mention regarding Beautiful Soup. So far, we’ve been talking mostly about the BeautifulSoup object itself (theHTML_soup variable in our examples above), as well as Tag objects retrieved by find, find_all, and other search operations.

 

There are two more object types in Beautiful Soup that, although less commonly used, are useful to know about:

 

NavigableString objects: these are used to represent text within tags, rather than the tags themselves. Some Beautiful Soup functions and attributes will return such objects, such as the string attribute of tags, for instance.

 

Attributes such as descendants will also include these in their listings. In addition, if you use to find or find_all methods and supply a string argument value without a name argument, then these will return NavigableString objects as well, instead of Tag objects.

 

Comment objects: these are used to represent HTML comments (found in comment tags, “<!-- ... -->”). These are very rarely useful when web scraping.

 

Feel free to play around some more with Beautiful Soup if you like. Take some time to explore the documentation of the library over at https://www.crummy.com/software/ BeautifulSoup/bs4/doc/.

 

Note that Beautiful Soup’s documentation is a bit less well- structured than requests’ documentation, so it reads more like an end-to-end document instead of a reference guide. 

Recommend