Web Crawling (Best Tutorial 2019)

Web Crawling

What Is Web Crawling? (Tutorial 2019)

Web crawlers are heavily used by search engines like Google to retrieve contents for a URL, examine that page for other links, retrieve the URLs for those links, and so on. This tutorial explains Web Crawling using python with best examples. 

 

When writing web crawlers, there are some subtle design choices that come into play that will change the scope and nature of your project:

  • In many cases, crawling will be restricted to a well-defined set of pages, for example, product pages of an online shop. These cases are relatively easy to handle, as you’re staying within the same domain and have an expectation about what each product page will look like, or about the types of data you want to extract.

 

  • In other cases, you will restrict yourself to a single website (a single domain name), but do not have a clear target regarding information extraction in mind. Instead, you simply want to create a copy of the site.

 

  • In such cases, manually writing a scraper is not an advisable approach. There are many tools available (for Windows, Mac, and Linux) that will help you to make an offline copy of a website, including lots of configurable options. To find these, look for “website mirror tool.”

 

  • Finally, in even more extreme cases, you’ll want to keep your crawling very open-ended.

 

  • For example, you might wish to start from a series of keywords, Google each of them, crawl to the top 10 results for every query and crawl those pages for, say, images, tables, articles, and so on. Obviously, this is the most advanced use case to handle.

 

  • Writing a robust crawler requires that you put various checks in place and think carefully about the design of your code. Since crawlers can end up on any part of the web and will oftentimes run for long amounts of time.

 

  • You’ll need to think carefully about stopping conditions, keeping track of which pages you visited before (and whether it’s already time to visit them again), how you will store the results.

 

  • How you can make sure a crashed script can be restarted without losing its current progress. The following overview provides some general best practices and food for thought:

 

  • Think carefully about which data you actually want to gather: Can you extract what you need by scraping a set of predefined websites, or do you really need to discover websites you don’t know about yet? The first option will always lead to easier code in terms of writing and maintenance.

 

  • Use a database: It’s best to use a database to keep track of links to visit, visited links, and gathered data. Make sure to timestamp everything so you know when something was created and last updated.

 

  • Separate crawling from scraping: Most robust crawlers separate the “crawling” part (visiting websites, extracting links, and putting them in a queue, that is, gathering the pages you wish to scrape) from the actual “scraping” part (extracting information from pages). Doing both in one and the same program or loop is quite error-prone.

 

In some cases, it might be a good idea to have the crawler store a complete copy of a page’s HTML contents so that you don’t need to revisit it once you want to scrape out information.

Separate crawling from scraping

  • Stop early: When crawling pages, it’s always a good idea to incorporate stopping criteria right away. That is, if you can already determine that a link is not interesting at the moment of seeing it, don’t put it in the “to crawl” queue.

 

  • The same applies when you scrape information from a page. If you can quickly determine that the contents are not interesting, then don’t bother continuing with that page.

 

  • Retry or abort: Note that the web is a dynamic place, and links can fail to work or pages can be unavailable. Think carefully about how many times you’ll want to retry a particular link.

 

  • Crawling the queue: That said, the way you deal with your queue of links is important as well.

 

  • If you just apply a simple FIFO (first in first out) or LIFO (last in first out) approach, you might end up retrying a failing link in quick succession, which might not be what you want to do. Building in cooldown periods is hence important as well.

 

  • Parallel programming: In order to make your program efficient, you’ll want to write it in such a way that you can spin up multiple instances that all work in parallel.

 

  • Hence the need for a database-backed data store as well. Always assume that your program might crash at any moment and that a fresh instance should be able to pick up the tasks right away.

 

  • Keep in mind the legal aspects of scraping: In the next blog, we take a closer look at legal concerns. Some sites will not be too happy about scrapers visiting them.

 

  • Also, make sure that you do not “hammer” a site with a storm of HTTP requests. Although many sites are relatively robust, some might even go down and will be unable to serve normal visitors if you’re firing hundreds of HTTP requests to the same site.

 

We’re not going to create a Google competitor in what follows, but we will give you a few pointers regarding crawling in the next section, using two self-contained examples.

 

Crawling In Python

Url parameter

Not Easy to Guess  You’ll note that the pages here use an “r” Url parameter that is non-guessable.

 

If this would not have been the case, that is, if all values for the Url parameter would have fallen between a well-defined range or look like successive numbers, writing a scraper would be much easier, as the list of URLs you want to obtain is then well-defined.

 

Keep this in mind when exploring a page to see how viable it is to scrape it and to figure out which approach you’ll have to take.

 

A first attempt to start scraping this site looks as follows:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base_url = 'http://www.webscraping.com/crawler/' links_seen = set()
def visit(url, links_seen):
html = requests.get(url).text
html_soup = BeautifulSoup(html, 'html.parser') links_seen.add(url)
for link in html_soup.find_all("a"): link_url = link.get('href')
if link_url is None:
continue
full_url = urljoin(url, link_url)
if full_url in links_seen:
continue
print('Found a new page:', full_url)
# Normally, we'd store the results here too visit(full_url, links_seen)
visit(base_url, links_seen)

Note that we’re using the url join function here. The reason why we do so is that the “href” attribute of links on the page refers to a relative URL.

for instance, “?r=f01e7f 02e91239a2003bdd35770e1173”, which we need to convert to an absolute one.

 

We could just do this by prepending the base URL, that is, base_url + link_url, but once we’d start to follow links and pages deeper in the site’s URL tree, that approach would fail to work.

 

Using url join is the proper way to take an existing absolute URL, join a relative URL, and get a well-formatted new absolute URL.

 

What About Absolute href Values?  the url join approach will even work with absolute “href” link values. Using:

urljoin('http://example.org', 'https://www.other.com/dir/') will return “https://www.other.com/dir/”, so this is always a good function to rely on when crawling.

 

If you run this script, you’ll see that it will start visiting the different URLs. If you let it run for a while, however, this script will certainly crash, and not only because of network hiccups.

 

The reason for this is because we use recursion: the visit function is calling itself over and over again, without an opportunity to go back up in the call tree as every page will contain links to other pages:

Traceback (most recent call last):

File "C:\Users\Seppe\Desktop\firstexample.py", line 23, in <module>

visit(url, links_seen) [...]

return wr in self.data

 

RecursionError: maximum recursion depth exceeded in comparison

 

As such, relying on recursion for web crawling is generally not a robust idea. We can rewrite our code as follows without recursion:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
links_todo = ['http://www.webscraping.com/crawler/'] links_seen = set()
def visit(url, links_seen):
html = requests.get(url).text
html_soup = BeautifulSoup(html, 'html.parser') new_links = []
for link in html_soup.find_all("a"): link_url = link.get('href')
if link_url is None:
continue
full_url = urljoin(url, link_url)
if full_url in links_seen:
# Normally, we'd store the results here too new_links.append(full_url)
return new_links
while links_todo:
url_to_visit = links_todo.pop() links_seen.add(url_to_visit) print('Now visiting:', url_to_visit)
new_links = visit(url_to_visit, links_seen) print(len(new_links), 'new link(s) found') links_todo += new_links

You can let this code run for a while. This solution is better, but it still has several drawbacks. If our program crashes (e.g., when your Internet connection or the website is down), you’ll have to restart from scratch again. Also, we have no idea how large the links_seen set might become.

 

Normally, your computer will have plenty of memory available to easily store thousands of URLs, though we might wish to resort to a database to store intermediate progress information as well as the results.

 

Storing Results in a Database

Database

Let’s adapt our example to make it more robust against crashes by storing progress and result information in a database.

 

We’re going to use the “records” library to manage an SQLite database (a file-based though powerful database system) in which we’ll store our queue of links and retrieved numbers from the pages we crawl, which can be installed using pip:

pip install -U records
The adapted code, then, looks as follows:
import requests
import records
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from sqlalchemy.exc import IntegrityError
db = records.Database('sqlite:///crawler_database.db') db.query('''CREATE TABLE IF NOT EXISTS links (
url text PRIMARY KEY,
created_at datetime, visited_at datetime NULL)''')
db.query('''CREATE TABLE IF NOT EXISTS numbers (url text, number integer, PRIMARY KEY (url, number))''')
def store_link(url):
try:
db.query('''INSERT INTO links (url, created_at)
VALUES (:url, CURRENT_TIMESTAMP)''', url=url)
except IntegrityError as ie:
# This link already exists, do nothing
pass
def store_number(url, number):
try:
db.query('''INSERT INTO numbers (url, number)
VALUES (:url, :number)''', url=url, number=number)
except IntegrityError as ie:
# This number already exists, do nothing
pass
def mark_visited(url):
db.query('''UPDATE links SET visited_at=CURRENT_TIMESTAMP WHERE url=:url''', url=url)
def get_random_unvisited_link():
link = db.query('''SELECT * FROM links
WHERE visited_at IS NULL
ORDER BY RANDOM() LIMIT 1''').first()
return None if link is None else link.url
def visit(url):
html = requests.get(url).text
html_soup = BeautifulSoup(html, 'html.parser') new_links = []
for td in html_soup.find_all("td"): store_number(url, int(td.text.strip()))
for link in html_soup.find_all("a"): link_url = link.get('href')
if link_url is None:
continue
full_url = urljoin(url, link_url) new_links.append(full_url)
return new_links
store_link('http://www.webscraping.com/crawler/') url_to_visit = get_random_unvisited_link()
while url_to_visit is not None: print('Now visiting:', url_to_visit) new_links = visit(url_to_visit)
print(len(new_links), 'new link(s) found')
for link in new_links: store_link(link)
mark_visited(url_to_visit)
url_to_visit = get_random_unvisited_link()

 

From SQL to ORM

Structured Query language       

We’re writing SQL (Structured Query language) statements manually here to interact with the database, which is fine for smaller projects.

For more complex projects, however, it’s worthwhile to investigate some form (object-relational mapping) libraries that are available in python, such as sqlalchemy or pee-wee.

 

Which allow for a smoother and more controllable “mapping” between a relational database and python objects, so that you can work with the latter directly, without having to deal with writing SQL statements.

 

in the examples blog, we’ll use another library called “dataset” that also offers a convenient way to quickly dump information to a database without having to write SQL. also note that you don’t have to use the “records” library to use SQLite databases in python.

 

Python already comes with a “sqlite3” module built in (which is being used by records), which you could use as well (see https://docs.python.org/3/ library/sqlite3.html). the reason why we use records here is that it involves a bit less boilerplate code.

 

Try running this script for a while and then stop it (either by closing the Python window or pressing Control+C). You can take a look at the database (“crawler_database. DB”) using an SQLite client such as “DB Browser for SQLite,” which can be obtained from http://sqli tebrowser.org/. 

 

SQLite note that the SQLite database we’re using here is fine for smaller projects, but might give trouble once you start parallelizing your programs.

 

Starting multiple instances of the script above will most likely work fine up to a certain degree, but might crash from time to time due to the SQLite database file being locked (i.e., used for too long by another script’s process).

 

Similarly, SQLite can be pretty daunting to work with when using multithreaded setups in your scripts as well. Switching to a server-client-oriented database like MySQL or PostgreSQL might be a good option in such cases.

 

Let’s now try to use the same framework in order to build a crawler for Wikipedia. Our plan here is to store page titles, as well as keep track of “(from, to)” links on each page, starting from the main page. Note that our database scheme looks a bit different here:

import requests
import records
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urldefrag
from sqlalchemy.exc import IntegrityError
db = records.Database('sqlite:///wikipedia.db')
db.query('''CREATE TABLE IF NOT EXISTS pages ( url text PRIMARY KEY,
page_title text NULL, created_at datetime, visited_at datetime NULL)''')
db.query('''CREATE TABLE IF NOT EXISTS links ( url text, url_to text,
PRIMARY KEY (url, url_to))''')
base_url = 'https://en.wikipedia.org/wiki/'
def store_page(url):
try:
db.query('''INSERT INTO pages (url, created_at)
VALUES (:url, CURRENT_TIMESTAMP)''', url=url)
except IntegrityError as ie:
# This page already exists
pass
def store_link(url, url_to):
try:
db.query('''INSERT INTO links (url, url_to)
VALUES (:url, :url_to)''', url=url, url_to=url_to)
except IntegrityError as ie:
# This link already exists
pass
def set_visited(url):
db.query('''UPDATE pages SET visited_at=CURRENT_TIMESTAMP WHERE url=:url''', url=url)
def set_title(url, page_title):
db.query('UPDATE pages SET page_title=:page_title WHERE url=:url', url=url, page_title=page_title)
def get_random_unvisited_page():
link = db.query('''SELECT * FROM pages
WHERE visited_at IS NULL
ORDER BY RANDOM() LIMIT 1''').first()
return None if link is None else link.url
def visit(url):
print('Now visiting:', url) html = requests.get(url).text
html_soup = BeautifulSoup(html, 'html.parser') page_title = html_soup.find(id='firstHeading') page_title = page_title.text if page_title else '' print(' page title:', page_title)
set_title(url, page_title)
for link in html_soup.find_all("a"): link_url = link.get('href')
if link_url is None: # No href, skip
continue
full_url = urljoin(base_url, link_url) # Remove the fragment identifier part full_url = urldefrag(full_url)[0]
if not full_url.startswith(base_url): # This is an external link, skip
continue store_link(url, full_url) store_page(full_url)
set_visited(url)
store_page(base_url)
url_to_visit = get_random_unvisited_page()
while url_to_visit is not None: visit(url_to_visit)
url_to_visit = get_random_unvisited_page()

 

Note that we incorporate some extra measures to prevent visiting links we don’t want to include: we make sure that we don’t follow external links outside the base URL.

 

We also use the urldefrag function to remove the fragment identifier (i.e., the part after “#”) in URLs of links, as we don’t want to visit the same page again even if it has a fragment identifier attached, as this will be equivalent to visiting and parsing the page without the fragment identifier.

 

In other words, we don’t want to include both “page.html#part1” and “page.html#part2” in our queue; simply including “page.html” is sufficient.

 

If you let this script run for a while, there are some fun things you can do with the collected data.

 

For instance, you can try using the “NetworkX” library in Python to create a graph using the “(from, to)” links on each page. In the final blog, we’ll explore such use cases a bit further.

 

Note that we assume here that we know in advance which information we want to get out of the pages (the page title and links, in this case). In case you don’t know beforehand what you’ll want to get out of crawled pages, you might want to split up the crawling and parsing of pages even further.

 

for example, by storing a complete copy of the HTML contents that can be parsed by a second script. This is the most versatile setup, but comes with additional complexity, as the following code shows:

import requests import records import re
import os, os.path
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urldefrag
from sqlalchemy.exc import IntegrityError
db = records.Database('sqlite:///wikipedia.db')
# This table keeps track of crawled and to-crawl pages db.query('''CREATE TABLE IF NOT EXISTS pages (
url text PRIMARY KEY, created_at datetime, html_file text NULL, visited_at datetime NULL)''')
# This table keeps track of a-tags db.query('''CREATE TABLE IF NOT EXISTS links (
url text, link_url text,
PRIMARY KEY (url, link_url))''')
# This table keeps track of img-tags db.query('''CREATE TABLE IF NOT EXISTS images (
url text, img_url text, img_file text, PRIMARY KEY (url, img_url))''')
base_url = 'https://en.wikipedia.org/wiki/' file_store = './downloads/'
if not os.path.exists(file_store): os.makedirs(file_store)
def url_to_file_name(url):
url = str(url).strip().replace(' ', '_')
return re.sub(r'(?u)[^-\w.]', '', url)
def download(url, filename):
r = requests.get(url, stream=True)
with open(os.path.join(file_store, filename), 'wb') as the_image:
for byte_chunk in r.iter_content(chunk_size=4096*4): the_image.write(byte_chunk)
def store_page(url):
try:
db.query('''INSERT INTO pages (url, created_at) VALUES (:url, CURRENT_TIMESTAMP)''',
url=url)
except IntegrityError as ie:
pass
def store_link(url, link_url):
try:
db.query('''INSERT INTO links (url, link_url) VALUES (:url, :link_url)''',
url=url, link_url=link_url)
except IntegrityError as ie:
pass
def store_image(url, img_url, img_file):
try:
db.query('''INSERT INTO images (url, img_url, img_file) VALUES (:url, :img_url, :img_file)''',
url=url, img_url=img_url, img_file=img_file)
except IntegrityError as ie:
pass
def set_visited(url, html_file): db.query('''UPDATE pages
SET visited_at=CURRENT_TIMESTAMP, html_file=:html_file
WHERE url=:url''', url=url, html_file=html_file)
def get_random_unvisited_page():
link = db.query('''SELECT * FROM pages
WHERE visited_at IS NULL
ORDER BY RANDOM() LIMIT 1''').first()
return None if link is None else link.url
def should_visit(link_url):
link_url = urldefrag(link_url)[0]
if not link_url.startswith(base_url):
return None
return link_url
def visit(url):
print('Now visiting:', url) html = requests.get(url).text
html_soup = BeautifulSoup(html, 'html.parser') # Store a-tag links
for link in html_soup.find_all("a"): link_url = link.get('href')
if link_url is None:
continue
link_url = urljoin(base_url, link_url) store_link(url, link_url)
full_url = should_visit(link_url)
if full_url:
# Queue for crawling store_page(full_url)
# Store img-class='lazy' data-src files
for img in html_soup.find_all("img"): img_url = img.get('class='lazy' data-src')
if img_url is None:
continue
img_url = urljoin(base_url, img_url) filename = url_to_file_name(img_url) download(img_url, filename) store_image(url, img_url, filename)
# Store the HTML contents
filename = url_to_file_name(url) + '.html' fullname = os.path.join(file_store, filename)
withopen(fullname, 'w', encoding='utf-8') as the_html: the_html.write(html)
set_visited(url, filename)
store_page(base_url)
url_to_visit = get_random_unvisited_page()
while url_to_visit is not None: visit(url_to_visit)
url_to_visit = get_random_unvisited_page()

 

There is quite a lot going on here. First, three tables are created:

  • A “pages” table to keep track of URLs, whether we already visited them or not (just like in the example before), but now also a “html_file” field referencing a file containing the HTML contents of the page;
  • A “links” table to keep track of links found on pages;
  • An “images” table to keep track of downloaded images.

 

For every page we visit, we extract all the “<a>” tags and store their links, and use a defined should_visit function to determine whether we should also queue the link in the “pages” table for crawling as well.

 

Next, we inspect all “<img>” tags and download them to disk. A url_to_file_name function is defined in order to make sure that we get proper filenames using a regular expression.

 

Finally, the HTML contents of the visited page are saved to disk as well, with the filename stored in the database. If you let this script run for a while, you’ll end up with a large collection of HTML files and downloaded images.

 

There are certainly many other modifications you could make to this setup. For instance, it might not be necessary to store links and images during the crawling of pages, since we could gather this information out from the saved HTML files as well.

 

On the other hand, you might want to take a “save as much as possible as early as a possible approach,” and include other media elements as well in your download list.

 

What should be clear is that web crawling scripts come in many forms and sizes, without there being a “one size fits all” to tackle every project. Hence, when working on a web crawler, carefully consider all design angles and how you’re going to implement them in your setup.

 

Going Further to mention some other modifications that are still possible to explore, you might, for instance, also wish to exclude “special” or “talk” pages, or pages that refer to a file from being crawled.

 

Also, keep in mind that our examples above do not account for the fact that information gathered from pages might become “stale.” that is, once a page has been crawled, we’ll never consider it again in our script.

 

Try thinking about a way to change this (hint: the “visited_at” timestamp could be used to determine whether it’s time to visit a page again, for instance). how would you balance exploring new pages versus refreshing older ones? Which set of pages would get priority, and how?

 

This concludes our journey through HTTP, requests, HTML, Beautiful Soup, JavaScript, Selenium, and crawling. The final blog contains some fully worked-out examples showing off some larger projects using real-life sites on the World Wide Web.

 

First, however, we’ll take a step back from the technicalities and discuss the managerial and legal aspects of web scraping.

 

Working with Forms and POST Requests

POST Requests

Websites provide a much better way to facilitate providing input and sending that input to a web server, one that you have no doubt already encountered: web forms.

 

Whether it is to provide a “newsletter sign up” form, a “buy ticket” form, or simply a “login” form, web forms are used to collect the appropriate data.

 

The way how web forms are shown in a web browser is simply by including the appropriate tags inside of HTML. That is, each web form on a page corresponds with a block of HTML code enclosed in “<form>” tags:

<form> [...]

</form>

 

Inside of these, there are a number of tags that represent the form fields themselves. Most of these are provided through an “<input>” tag, with the “type” attribute specifying what kind of field it should represent:

 

<input type="text"> for simple text fields;
<input type="password"> for password entry fields;
<input type="button"> for general-purpose buttons;
<input type="reset"> for a “reset” button (when clicked, the browser will reset all form values to their initial state, but this button
is rarely encountered these days);
<input type="submit"> for a “submit” button (more on this later);
<input type="checkbox"> for check boxes;
 <input type="radio"> for radio boxes;
<input type="hidden"> for hidden fields, which will not be shown to the user but can still contain a value.

 

Apart from these, you’ll also find pairs of tags being used (“<input>” does not come with a closing tag):

<button>...</button> as another way to define buttons;
<select>...</select> for drop-down lists. Within these, every choice is defined by using <option>...</option> tags;
<textarea>...</textarea> for larger text entry fields.

 

Navigate to http://www.webscraping.com/basicform/ to see a basic web form in action.

 

Take some time to inspect the corresponding HTML source using your web browser. You’ll notice that some HTML attributes seem to play a particular role here, that is, the “name” and “value” attributes for the form tags. As it turns out, these attributes will be used by your web browser once a web form is “submitted.”

 

To do so, try pressing the “Submit my information” button on the example web page (feel free to fill in some dummy information first). What do you see?

 

Notice that upon submitting a form, your browser fires of a new HTTP request and includes the entered information in its request.

 

In this simple form, a simple HTTP GET request is being used, basically converting the fields in the form to key-value URL parameters.

 

Submitting Forms 

Submitting Forms

A “submit” button is not the only way that forms information can be submitted. Sometimes, forms consist of simply one text field, which will be submitted when pressing enter.

 

Alternatively, some extra piece of JavaScript can also be responsible for sending the form. even when a “<form>” block does not expose any obvious way to submit itself, one could still instruct a web browser to submit the form anyway, though, for example, the console in Chrome’s Developer tools.

 

By the way: it is perfectly fine to have multiple “<form>” blocks inside of the same web page (e.g., one for a “search” form and one for a “contact us” form), though only one form (and its enclosed information) will typically be submitted once a “submit” action is undertaken by the user.

 

This way of submitting a form matches pretty well with our discussion from a few paragraphs back: URL parameters are one way that input can be sent to a web server.

 

Instead of having to enter all the information manually in the URL (imagine what a horrible user experience that would be), we have now at least seen how web forms can make this process a little bit user-friendlier.

 

However, in cases where we have to submit a lot of information (try filling the “comments” form field on the example page with lots of text, for instance), URLs become unusable to submit information, due to their maximum length restriction.

 

Even if URLs would be unbounded in terms of length, they would still not cater to a fully appropriate mechanism to submit information.

 

What would happen, for instance, if you copy-paste such a URL in an e-mail and someone else would click on it, hence “sending” the information once again to the server? Or what happens in case you accidentally refresh such an URL?

 

In a case where we are sending information to a web server of which we can expect that this submission action will make a permanent change (such as deleting a user, changing your profile information, clicking “accept” for a bank transfer, and so on), it is perhaps not a good idea to simply allow for such requests to come in as HTTP GET requests.

 

Luckily, the HTTP protocol also provides a number of different “methods” (or “verbs”) other than the GET method we’ve been working with so far.

 

More specifically, apart from getting requests, there is another type of request that your browser will often be used in case you wish to submit some information to a web server: the POST request.

 

To introduce it, navigate to the page over at http://www.webscraping.com/ platform/.

 

Notice that this page looks exactly the same as the one from above, except for one small difference in the HTML source: there’s an extra “method” attribute for the “<form>” tag now:

<form method="post"> [...]

</form>

 

The default value for the “method” attribute is “get”, basically instructing your browser that the contents of this particular form should be submitted through an HTTP GET request.

 

When set to “post”, however, your browser will be instructed to send the information through an HTTP POST request. Instead of including all the form information as URL parameters, the POST HTTP request includes this input as part of the HTTP request body instead.

 

Try this by pressing “Submit my information” while making sure that Chrome’s Developer Tools are monitoring network requests.

 

If you inspect the HTTP request, you’ll notice that the request method is now set to “POST” and that Chrome includes an extra piece of information called “Form Data” to show you which information was included as part of the HTTP request body; 

 

Finally, try navigating to http://www.webscraping.com/ postform2/.

 

This page works in exactly the same way as the one before, though with one notable difference: when you submit the form, the server doesn’t send back the same contents as before but provides an overview of the information that was just submitted.

 

Since a web server is able to read out the HTTP request line containing the HTTP request method and URL, it can dynamically generate a different response depending on the type of response that was received, as well as undertake different actions.

 

When a POST request comes in, for instance, the web server might decide to store the submitted information in a database before sending its reply.

 

Finally, there is one more thing that we need to mention regarding web forms. In all the previous examples, the data submitted to the form was sent to the web server by constructing an HTTP request containing the same URL as the page the form was on, though this is not necessarily always the case.

 

To indicate that a “submit” action should spawn a request to a different URL, the “action” attribute can be used as follows:

<form action="search.html">

<!-- E.g. a search form -->

<input type="text" name="query">

<input type="submit" value="Search!">

</form>

This form snippet can be included on various pages with their according URLs, though submitting this form will lead to a page such as, for example, “search.html?query=Test”.

 

Apart from GET and POST, there are some other HTTP methods to discuss, though these are far less commonly used, at least when surfing the web through a web browser. We’ll take a look at these later; let us first summarize the two HTTP methods we know so far:

 

  • GET: Used when typing in a URL in the address bar and pressing enter, clicking a link, or submitting a GET form. Here, the assumption is that the same request can be executed multiple times without “damaging” the user’s experience.

 

  • For instance, it’s fine to refresh the URL “search.html?query=Test”, but perhaps not the URL “doMoney Transfer?to=Bart&from=Seppe&amount=100”.

 

  • POST: Used when submitting a POST form. Here, the assumption is that this request will lead to an action being undertaken that should not be executed multiple times.

 

  • Most browser will actually warn you if you try refreshing a page that resulted from a POST request (“you’ll be resubmitting the same information again — are you sure this is what you want to do?”).

 

Before moving on to other HTTP request methods, let’s see how we can execute POST requests using Python.

 

In case a web form is using a GET request to submit information, we’ve already seen how you can handle this use case simply by using the requests. get method with the params argument to embed the information as URL parameters.

 

For a POST request, we just need to use a new method (requests.post) and a new data argument:

import requests
url = 'http://www.webscraping.com/postform2/' # First perform a GET request
r = requests.get(url)
# Followed by a POST request formdata = {
'name': 'Seppe',
'gender': 'M',
'pizza': 'like',
'haircolor': 'brown', 'comments': ''
}
r = requests.post(url, data=formdata)
print(r.text)

Just like params, the data argument is supplied as a Python dictionary object representing name-value pairs. Take some time to play around with this URL in your web browser to see how data for the various input elements is actually submitted.

 

In particular, note that for the radio buttons, they all bear the same “name” attribute to indicate that they belong to the same selection group, though the value (“M,” “F,” and “N” differs). If no selection is made at all, this element will not be included in the submitted form data. 

 

For the checkboxes, note that they all have a different name, but that all their values are the same (“like”). If a checkbox is not selected, a web browser will not include the name-value pair in the submitted form data.

 

Hence, some web servers don’t even bother to check the value of such fields to determine whether they were selected, but just whether the name is present in the submitted data, though this can differ from web server to web server.

 

Duplicate Names  Some websites, such as those built using the PHP language, also allow us to define a series of checkbox elements with the same name attribute value.

 

For sites built with PHP, you’ll see these names ending with “[]”, as in “food[]”. this signposts to the web server that the incoming values should be treated as an array.

 

On the HTTP side of things, this means that the same field name will end up multiple times in the request body. the same intricacy is present for Url parameters: technically, there is nothing preventing you from specifying multiple parameters with the same name and different values.

 

The way how they’ll be treated differs from server to server, though you might be wondering how we’d handle such a use case in requests, as both params and data are dictionary objects, which cannot contain the same keys twice. to work around this, both the data and params arguments allow to pass a list of “(name, value)” tuples to handle this issue.

 

For the “comments” field, note that it is included in the submitted form data even when nothing was filled in, in which case an empty value will be submitted. Again, in some cases, you might as well leave out the name-value pair altogether, but it is up to the web server to determine how “picky” it wants to be.

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

Pickiness

Pickiness    

Even submit buttons can be named, in which case they’ll also end up in the submitted form data (with their value being equal to their “value” HTML attribute).

 

Sometimes, web servers will use this information to determine which button was clicked in case more than one submit button is present in the same form and, in other cases, the name-value pair will end up being ignored.

 

Just as was the case with Url parameters, web servers can differ widely in terms of how strict or flexible they are when interpreting requests.

 

Nothing is stopping you to also put in other name-value pairs when submitting a post request using requests, but again, the results might vary from site to site. trying to match as closely what’s happening in the browser is generally a good recommendation.

 

Note that in our example above, we’re still “polite” in the sense that we first execute a normal GET request before sending the information through a POST, though this is not even required.

 

We can simply comment out the requests.getline to submit our information. In some cases, however, web pages will be “smart” enough to prevent you from doing so.

 

Now try the same again but wait a minute or two before pressing “Submit my information.” The web page will inform you that “You waited too long to submit this information.” Let’s try submitting this form using requests:

import requests
url = 'http://www.webscraping.com/postform3/' # No GET request needed?
formdata = {
'name': 'Seppe',
'gender': 'M',
'pizza': 'like',
'haircolor': 'brown', 'comments': ''
}
r = requests.post(url, data=formdata)
print(r.text)

 

# Will show: Are you trying to submit information from somewhere else?

This is strange: How does the web server know that we’re trying to perform a POST request from Python in this case? The answer lies in one additional form element that is now present in the HTML source code (your value might differ):

 

<input type="hidden" name="protection" value="2c17abf5d5b4e326bea802600ff88405">

 

As can be seen, this form incorporates a new hidden field that will be submitted with the rest of the form data, conveniently named “protection.” How about including it directly in our Python source code as follows:

import requests
url = 'http://www.webscraping.com/postform3/' formdata = {
'name': 'Seppe',
'gender': 'M',
'pizza': 'like', 'haircolor': 'brown', 'comments': '',
'protection': '2c17abf5d5b4e326bea802600ff88405'
}
r = requests.post(url, data=formdata)
print(r.text)
# Will show: You waited too long to submit this information. Try
<a href="./">again</a>.

 

Assuming you waited a minute before running this piece of code, the web server will now reply with a message indicating that it doesn’t want to handle this request. Indeed, we can confirm (using our browser), that the “protection” field appears to change every time we refresh the page, seemingly randomly.

 

To work our way around this, we have no other alternative but to first fetch out the form’s HTML source using a GET request, get the value for the “protection” field, and then use that value in the subsequent POST request. By bringing in Beautiful Soup again, this can be easily done:

import requests
from bs4 import BeautifulSoup
url = 'http://www.webscraping.com/postform3/' # First perform a GET request
r = requests.get(url)
# Get out the value for protection
html_soup = BeautifulSoup(r.text, 'html.parser')
p_val = html_soup.find('input', attrs={'name': 'protection'}).get('value')
# Then use it in a POST request formdata = {
'name': 'Seppe',
'gender': 'M',
'pizza': 'like', 'haircolor': 'brown', 'comments': '', 'protection': p_val
}
r = requests.post(url, data=formdata)
print(r.text)

The example above illustrates a protective measure that you will encounter from time to time in real-life situations.

 

Website administrators do not necessarily include such extra measures as a means to prevent web scraping (it makes a scraper’s life a bit harder, but we’ve seen how we can work around it), but mainly for reasons of safety and improving the user experience.

 

For instance, to prevent the same information being submitted twice (using the same “protection” value, for instance), or to prevent attacks where users are tricked to visit a certain web page on which a piece of JavaScript code will try to perform a POST request to another site.

 

For instance, to initiate a money transfer or try to obtain sensitive information. Secure websites will hence often include such additional checks on their pages.

 

View State there is one web server technology stack that is pretty famous for using such fields: Microsoft’s asp and asp.net.

 

Websites built using this technology will — in the majority of cases — include a hidden input element in all forms with the name set to a cryptic “    ViewState”, together with an encrypted value that can be very long.

 

Not including this form element when trying to perform post requests to sites built using this stack will lead to results not showing what you’d expect, and is oftentimes the first real-life annoyance web scrapers will run into. the solution is simple: just include these in your post requests.

 

Note that the resulting page sent in the HTTP reply might again contain such “    ViewState” element, so you’ll have to make sure to fetch the value, again and again, to include it in every subsequent post request.

 

There are a few more things worth mentioning before we can wrap up this section. First, you’ll no doubt have noticed that we can now use a params and data argument, which look very similar.

 

If GET requests use URL parameters, and POST requests send data as part of the HTTP request body, why do we need to separate arguments when we can already indicate the type of request, by using either the requests.get or request. post method?

 

The answer lies in the fact that it is perfectly fine for an HTTP POST request to include both a request URL with parameters, as well as a request body containing form data. Hence, if you encounter the “<form>” tag definition in a page’s source code:

<form action="submit.html?type=student" method="post"> [...]
</form>
You’ll have to write the following in Python:
r = requests.post(url, params={'type': 'student'}, data=formdata)

 

You might also be wondering what would happen if we’d try to include the same information both in the URL parameters and in the form data:

import requests
url = 'http://www.webscraping.com/postform2/' paramdata = {'name': 'Totally Not Seppe'}
formdata = {'name': 'Seppe'}
r = requests.post(url, params=paramdata, data=formdata)
print(r.text)

 

This particular web page will simply ignore the URL parameter and take into account the form data instead, but this is not necessarily always the case.

 

Also, even although a “<form>” specifies “POST” as its “method” parameter, there might be rare cases where you can just as well submit this information as URL parameters instead of using a simple GET request.

 

These situations are rare, but they can happen. Nevertheless, the best recommendation is to stick as close as you can to the behavior that you observe by using the page normally.

 

Finally, there is one type of form element we haven’t discussed before. Sometimes, you will encounter forms that allow you to upload files from your local machine to a web server:

<form action="upload.php" method="post" enctype="multipart/form-data">
<input type="file" name="profile_picture">
<input type="submit" value="Upload your profile picture">
</form>

 

Note the “file” input element in this source snippet, as well as the “type” parameter now present in the “<form>” tag. To understand what this parameter means, we need to talk a little bit about form encoding.

 

Put simply, web forms will first “encode” the information contained in the form before embedding it in the HTTP POST request body.

 

Currently, the HTML standard foresees three ways how this encoding can be done (which will end up as values for the “Content-Type” request header):

 

application/x-www-form-urlencoded (the default):

 here, the request body is formatted similarly to what we’ve seen with URL parameters,

hence the name “urlen-coded,” that is, using ampersands (“&”) and equals signs (“=”) to separate data fields and name-value parts.

 

Just an in URLs, certain characters should be encoded in a specific way, which requests will do automatically for us.

 

text/plain: introduced by HTML 5 and generally only used for debugging purposes and hence extremely rare in real-life situations.

 

Multipart/form-data: this encoding method is significantly more complicated but allows us to include a file’s contents in the request body, which might be in the form of binary, non-textual data — hence the need for a separate encoding mechanism.

 

As an example, consider an HTTP POST request with some request data included:

>POST /postform2/ HTTP/1.1
Host: www.webscraping.com
Content-Type: application/x-www-form-urlencoded [... Other headers]
name=Seppe&gender=M&pizza=like
Now consider an HTTP POST request with the request data encoded using “multipart/form-data”:
POST /postform2/ HTTP/1.1
Host: www.webscraping.com
Content-Type: multipart/form-data; boundary=BOUNDARY [... Other headers]
--BOUNDARY
Content-Disposition: form-data; name="name"
Seppe
--BOUNDARY
Content-Disposition: form-data; name="gender"
M
--BOUNDARY
Content-Disposition: form-data; name="pizza"
like
--BOUNDARY
Content-Disposition: form-data; name="profile_picture"; filename="me.jpg"
Content-Type: application/octet-stream
[... binary contents of me.jpg]

 

Definitely, the request body here looks more complicated, though we can see where the “multipart” moniker comes from: the request data is split up in multiple parts using a “boundary” string, which is determined (randomly, in most cases) by the request invoker. Luckily for us, we don’t need to care too much about this when using requests.

 

To upload a file, we simply use another argument, named files (which can be used together with the data argument):

import requests
url = 'http://www.webscraping.com/postform2/' formdata = {'name': 'Seppe'}
filedata = {'profile_picture': open('me.jpg', 'rb')}
r = requests.post(url, data=formdata, files=filedata)

The library will take care of setting the appropriate headers in the POST request (including picking a boundary) as well as encoding the request body correctly for you.

 

Multiple Files

Multiple Files      

For forms where multiple files can be uploaded, you’ll mostly find that they utilize multiple “<input>” tags, each with a different name.

 

Submitting multiple files then boils down to putting more key-value pairs in the files argument dictionary. the HTML standard also foresees a way to provide multiple files through one element only, using the “multiple” HTML parameter.

 

To handle this in requests, you can pass a list to the files argument with each element is a tuple having two entries: the form field name, which can then appear multiple times throughout the list, and the file info — a tuple on its own containing the open call and other information about the file that’s being sent.

 

More information on this can be found under “pOSt Multiple Multipart-encoded Files” in the requests documentation, though it is rather rare to encounter such upload forms in practice.

 

Other HTTP Request Methods

HTTP Request Methods

Now that we’ve seen how HTTP GET and POST requests work, we can take a brief moment to discuss the other HTTP methods that exist in the standard:

 

GET:

the GET method requests a representation of the specified URL. Requests using GET should only retrieve data and should have no other effect, such as saving or changing user information or perform other actions.

 

In other words, GET requests should be “idempotent,” meaning that it should be safe to execute the same request multiple times.

 

Keep in mind that URL parameters can be included in the request URL (this is also the case for any other HTTP method), though GET requests can — technically — include an optional request body as well, this is not recommended by the HTTP standard.

 

As such, a web browser doesn’t include anything in the request body when it performs GET requests, and they are not used by most (if not all) APIs either.

 

POST:

the POST method indicates that data is being submitted as part of a request to a particular URL, for example, a forum message, a file upload, a filled-in form, and so on.

 

Contrary to GET, POST requests are not expected to be idempotent, meaning that submitting a POST request can bring about changes on the web server’s end of things.

 

For example, updating your profile, confirming a money transaction, a purchase, and so on. A POST request encodes the submitted data as part of the request body.

 

HEAD:

the HEAD method requests a response just like the GET request does, but it indicates to the web server that it does not need to send the response body. This is useful in case you only want the response headers and not the actual response contents. HEAD requests cannot have a request body.

 

PUT:

the PUT method requests that the submitted data should be stored under the supplied request URL, thereby creating it if it does not exist already. Just as with a POST, PUT requests have a request body.

 

DELETE:

the DELETE method requests that the data listed under the request URL should be removed. The DELETE request does not have a request body.

 

CONNECT, OPTIONS, TRACE, and PATCH:

these are some less- commonly encountered request methods. CONNECT is generally used to request a web server to set up a direct TCP network connection between the client and the destination (web proxy servers will use this type of request).

 

TRACE instructs the web server to just send the request back to the client (used for debugging to see if a middleman in the connection has changed your request somewhere in-between).

 

OPTIONS requests the web server to list the HTTP methods it accepts for a particular URL (which might seem helpful, though is rarely used). PATCH finally allows us to request a partial modification of a specific resource.

 

Based on the above, it might seem that the set of HTTP methods corresponds pretty well to the basic set of SQL (Structured Query Language) commands, used to query and update information in relational databases.

 

That is, GET to “SELECT” a resource given a URL, POST to “UPDATE” it, PUT to “UPSERT” it (“UPDATE” or “INSERT” if it does not exist), and DELETE to “DELETE” it. This being said, this is not how web browsers work.

 

We’ve seen above that most web browsers will work using GET and POST requests only.

 

That means that if you create a new profile on a social network site, for instance, your form data will simply be submitted through a POST request, and not a PUT. If you change your profile, later on, another POST request is used. Even if you want to delete your profile, this action will be requested through a POST request.

 

This doesn’t mean, however, that requests don’t support these methods.

Apart from requests.get and requests.post, you can also use the requests.head, requests.put, requests.delete, requests.patch, and requests.options methods.

 

A Word About APIs even although web browsers might only stick to get and posting requests, there is a variety of networking protocols that put themselves on top of HTTP and do use the other request methods as well.

 

Also, you’ll find that many modern APIs — as offered by Facebook, twitter, linkedin, and so on — also expose their functionality through HTTP and might use other http request methods as well, a practice commonly referred to as reSt (Representational State transfer).

 

It is then helpful to know that you can just use requests to access these as well. the difference between web scraping and using an API hence lies mainly in how structured the requests and replies are. With an API, you’ll get back content in a structured format (such as XML or JSON), which can be easily parsed by computer programs.

 

On the “regular” web, content is returned mainly as HTML formatted text. this is nice for human readers to work with after a web browser is done with it, but not very convenient for computer programs. hence there is a need for something like Beautiful Soup.

 

However, that not all APIs are built on top of HTTP — some of them use other protocols, such as SOAP (Simple Object Access Protocol) as well, which then requires another set of libraries to access them.

 

More on Headers

Now that we’re finished with an overview of HTTP request methods, it’s time to take a closer look at another part of HTTP and how it comes into play when web scraping sites: the request headers.

 

Up until now, we’ve been relying on requests to construct and send these headers for us. There are various cases, however, where we’ll have to modify them ourselves.

 

Let’s get started right away with the following example:

import requests
url = 'http://www.webscraping.com/usercheck/' r = requests.get(url)
print(r.text)
# Shows: It seems you are using a scraper
print(r.request.headers)
Note that the website responds with “It seems you are using a scraper.” How does it know? When we open the same page in a normal browser, we see “Welcome, normal
user,” instead. The answer lies in the request headers that the requests library is sending:
{
'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*',
'Connection': 'keep-alive'
}

The requests library tries to be polite and includes a “User-Agent” header to announce itself. Of course, websites that want to prevent scrapers from accessing its contents can build in a simple check to block particular user agents from accessing their contents.

 

As such, we’ll have to modify our request headers to “blend in,” so to speak. In requests, sending custom headers is easily done through yet another argument: headers:

import requests
url = 'http://www.webscraping.com/usercheck/' my_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 ' + ' (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
r = requests.get(url, headers=my_headers)
print(r.text) print(r.request.headers)

 

This works. Note that the headers argument does not completely overwrite the default headers completely, but updates it instead, keeping around the default entries too.

 

Apart from the “User-Agent” header, there is another header that deserves special mention: the “Referer” header (originally a misspelling of referrer and kept that way since then). Browsers will include this header to indicate the URL of the web page that linked to the URL being requested.

 

Some websites will check this to prevent “deep links” from working. To test this out, navigate to http://www.webscraping. com/referer check/ in your browser and click the “secret page” link.

 

You’ll be linked to another page (http://www.webscraping.com/referercheck/secret. php) containing the text “This is a totally secret page.”

 

Now try opening this URL directly in a new browser tab. You’ll see a message “Sorry, you seem to come from another web page” instead. The same happens in requests:

import requests

url = 'http://www.webscraping.com/referercheck/secret.php' r = requests.get(url)

print(r.text)

# Shows: Sorry, you seem to come from another web page

 

Try inspecting the requests your browser is making using your browser’s developer tools and see if you can spot the “Referer” header being sent. You’ll note that it says “http://www.webscraping.com/referercheck/” for the GET request to the secret page.

 

When linked to from another website, or opened in a fresh tab, this referrer field will be different or not included in the request headers.

 

Especially sites hosting image galleries will often resort to this tactic to prevent images being included directly in other web pages (they’d like images to only be visible from their own site and want to prevent paying for hosting costs for other pages using the images). When encountering such checks in requests, we can simply spoof the “Referer” header as well:

import requests
url = 'http://www.webscraping.com/referercheck/secret.php' my_headers = {
'Referer': 'http://www.webscraping.com/referercheck/'
}
r = requests.get(url, headers=my_headers)
print(r.text)

 

Just as we’ve seen on various occasions before, remember that web servers can get very picky in terms of headers that are being sent as well. Rare edge cases such as the order of headers, multiple header lines with the same header name, or custom headers being included in requests can all occur in real-life situations.

 

If you see that requests are not returning the results you expect and have observed when using the site in your browser, inspect the headers through the developer tools to see exactly what is going on and duplicate it as well as possible in Python.

 

Duplicate Request and Response Headers  Just like the data and params arguments, headers can accept an OrderedDict object in case the ordering of the headers is important. passing a list, however, is not permitted here, as the HTTP standard does not allow multiple request header lines bearing the same name.

 

What is allowed is to provide multiple values for the same header by separating them with a comma, as in the line “accept-encoding: gzip, deflate”. in that case, you can just pass the value as is with requests.

 

However, that’s not to say that some extremely weird websites or apps might still use a setup where it deviates from the standard and checks for the same headers on multiple lines in the request.

 

In that case, you’ll have no choice but to implement a hack to extend requests. note that response headers can contain multiple lines with the same name. requests will automatically join them using a comma and put them under one entry when you access r.headers.

 

Finally, we should also take a closer look at the HTTP reply headers, starting first with the different HTTP response status codes. In most cases, the status code will be 200, the standard response for successful requests. The complete range of status codes can be categorized as follows:

 

1XX: informational status codes, indicating that a request was received and understood, but where the server indicates that the client should wait for additional response. These are rarely encountered on the normal web.

 

2XX: success status codes, indicating that a request was received and understood, and processed successfully. The most prevalent status code here is 200 (“OK”), though 204 (“No Content” — indicating that the server will not return any content) and 206 (“Partial Content” — indicating that the server is delivering only a part of a resource, such as a video fragment) are sometimes used as well.

 

3XX: redirection status codes, indicating that the client must take additional action to complete the request, most often by performing a new request where the actual content can be found.

 

301 (“Moved Permanently”), for instance, indicates that this and all future requests should be directed to the given URL instead of the one used, 302 (“Found”) and 303 (“See Other”) indicate that the response to the request can be found under another URL.

 

304 (“Not Modified”) is used to indicate that the resource has not been modified since the version specified by the web browser in its cache-related headers and that the browser can just reuse its previously downloaded copy.

 

307 (“Temporary Redirect”) and 308 (“Permanent Redirect”) indicate that a request should be repeated with another URL, either temporarily or permanently. More on redirects and working with them using requests later on.

 

4XX: client error status codes, indicating that an error occurred that was caused by the requester.

The most commonly known status code here is 404 (“Not Found”), indicating that a requested resource could not be found but might become available later on. 410 (“Gone”) indicates that a requested resource was available once but will not be available any longer.

 

400 (“Bad Request”) indicates that the HTTP request was formatted incorrectly, 401 (“Unauthorized”) is used to indicate that the requested resource is not available without authorization, whereas 403 (“Forbidden”) indicates that the request is valid, including authentication, but that the user does not have the right credentials to access this resource.

 

405 (“Method Not Allowed”) should be used to indicate that an incorrect HTTP request method was used. There’s also 402 (“Payment Required”), 429 (“Too Many Requests”), and even 451 (“Unavailable For Legal Reasons”) defined in the standard, though these are less commonly used.

 

5XX: server error status codes, indicating that the request appears valid, but that the server failed to process it. 500 (“Internal Server Error”) is the most generic and most widely encountered status code in this set, indicating that there’s perhaps a bug present in the server code or something else went wrong.

 

Who Needs Standards? even though there are a lot of status codes available to tackle a variety of different outcomes and situations, most web servers will not be too granular or specific in using them. it’s hence not uncommon to get a 500 status code where a 400, 403, or 405 would have been more appropriate;

 

or to get a 404 result code even when the page existed before and a 410 might be better. also, the different 3XX status codes are sometimes used interchangeably. as such, it’s best not to overthink the definitions of the status codes and just see what a particular server is replying instead.

 

From the above listing, there are two topics that warrant a closer look: redirection and authentication. Let’s take a closer look at redirection first. Open the page http://www.webscraping.com/redirect/ in your browser. You’ll see that you’re immediately sent to another page (“destination.php”).

 

Now do the same again while inspecting the network requests in your browser’s developer tools (in Chrome, you should enable the “Preserve log” option to prevent Chrome from cleaning the log after the redirect happens). Note how two requests are being made by your browser: the first to the original URL, which now returns a 302 status code.

 

This status code instructs your browser to perform a second request to the “destination.php” URL. How does the browser know what the URL should be? By inspecting the original URL’s response, you’ll note that there is now a “Location” response header present, which contains the URL to be redirected to.

 

Note that we’ve also included another header in the HTTP response here: “SECRET-CODE,” which we’ll use in the Python examples later on. First, let’s see how requests deal with redirection:

import requests

url = 'http://www.webscraping.com/redirect/' r = requests.get(url)

print(r.text) print(r.headers)

 

Note that we get the HTTP reply corresponding with the final destination (“you’ve been redirected here from another page!”). In most cases, this default behavior is quite helpful: requests are smart enough to “follow” redirects on its own when it receives 3XX status codes. But what if this is not what we want?

 

What if we’d like to get the contents of the original page? This isn’t shown in the browser either, but there might be a relevant response content present. What if we want to see the contents of the “Location” and “SECRET-CODE” headers manually?

 

To do so, you can simply turn off requests default behavior of the following redirects through the allow_redirects argument:

import requests
url = 'http://www.webscraping.com/redirect/' r = requests.get(url, allow_redirects=False)
print(r.text) print(r.headers)
Which will now show:
You will be redirected... bye bye!
{'Date': 'Fri, 13 Oct 2017 13:00:12 GMT',
'Server': 'Apache/2.4.18 (Ubuntu)', 'SECRET-CODE': '1234',
'Location': 'http://www.webscraping.com/redirect/ destination.php',
'Content-Length': '34',
'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive',
'Content-Type': 'text/html; charset=UTF-8'}

 

There aren’t many situations where you’ll need to turn off redirect following, though it might be necessary in cases where you first wish to fetch the response headers (such as “SECRET-CODE”) here before moving on. You’ll then have to retrieve the “Location” header manually to perform the next requests.get call.

 

Redirects    

redirects using 3XX status codes are often used by websites, for instance, in the HTTP response following a posted request after data has been processed to send the browser to a confirmation page (for which it can then use a get request).

 

This is another measure taken to prevent users from submitting the same post request twice in a row. note that 3XX status codes are not the only way that browsers can be sent to another location.

 

Redirect instructions can also be provided by means of a “<meta>” tag in an HTML document, which can include an optional timeout (web pages like these will often show something of the form “You’ll be redirected after 5 seconds”), or through a piece of JavaScript code, which can fire off navigation instructions as well.

 

Finally, let’s take a closer look at the 401 (“Unauthorized”) status code, which seems to indicate that HTTP provides some sort of authentication mechanism. Indeed, the HTTP standard includes a number of authentication mechanisms, one of which can be seen by accessing the URL http://www.webscraping.com/ authentication/.

 

You’ll note that this site requests a username and password through your browser. If you press “Cancel,” you’ll note that the website responds with a 401 (“Unauthorized”) result.

 

Try refreshing the page and entering any username and password combination. The server will now respond with a normal 200 (“OK”) reply.

 

What actually goes on here is the following:

  • Your browser performs a normal GET request to the page, and no authentication information is included.
  • The website responds with a 401 reply and a “WWW-Authenticate” header.

 

  • Your browser will take this as an opportunity to ask for a username and password. If “Cancel” is pressed, the 401 response is shown at this point.

 

  • If the user provides a username and password, your browser will perform an additional GET request with an “Authorization” header included and the username and password encoded (though not really through a very strong encryption mechanism).

 

The web server checks this request again, for example, to verify the username and password that were sent. If everything looks good, the server replies with a 200 page. Otherwise, a 403 (“Forbidden”) is sent (if the password was incorrect, for instance, or the user doesn’t have access to this page).

 

In requests, performing a request with basic authentication is as simple as including an “Authorization” header, so we still need to figure out how to encrypt the username and password. Instead of doing this ourselves, requests provides another means to do so, using the auth argument:

import requests

url = 'http://www.webscraping.com/authentication/' r = requests.get(url, auth=('myusername', 'mypassword')) print(r.text)

print(r.request.headers)

 

Apart from this basic authentication mechanism, which is pretty insecure (and should only be used by websites in combination with HTTPS — otherwise your information is transmitted using encryption that can be easily reversed), HTTP also supports other “schemes” such as the digest-based authentication mechanism, which requests support as well.

 

Although some older sites sometimes still use HTTP authentication, you won’t find this component of HTTP being used that often any longer. Most sites will prefer to handle their authentication using cookies instead, which we’ll deal with in the next section.

 

Dealing with Cookies

Cookies

All things considered, HTTP is a rather simple networking protocol. It is text-based and follows a simple request-and-reply-based communication scheme.

 

In the simplest case, every request-reply cycle in HTTP involves setting up a fresh new underlying network connection as well, though the 1.1 version of the HTTP standard allows us to set up “keep alive” connections.

 

Where a network connection is kept open for some period of time so that multiple request-reply HTTP messages can be exchanged over the same connection.

 

This simple request-reply-based approach poses some problems for websites, however. From a web server’s point of view, every incoming request is completely independent of any previous ones and can be handled on its own.

 

This is not, however, what users expect from most websites. Think for instance about an online shop where items can be added to a cart.

 

When visiting the checkout page, we expect the web server to “remember” the items we selected and added previously.

 

Similarly, when providing a username and password in a web form to access a protected page, the web server needs to have some mechanism to remember us, that is, to establish that an incoming HTTP request is related to a request that came in before.

 

In short, it didn’t take long after the introduction of HTTP before a need arose to add a state mechanism on top of it, or, in other words, to add the ability for HTTP servers to “remember” information over the duration of a user’s “session,” in which multiple pages can be visited.

 

Note that based on what we’ve seen above, we can already think of some ways to add such functionality to a website:

 

  • We could include a special identifier as a URL parameter that “links” multiple visits to the same user, for example, “checkout. HTML?visitor=20495”.

 

  • For POST requests, we could either use the same URL parameter or include the “session” identifier in a hidden form field.
  • Some older websites indeed use such mechanisms, though this comes with several drawbacks:

 

  •  What happens if an unsuspecting user would copy the link and paste it in an e-mail? This would mean that another party opening this link will now be considered as the same user, and will be able to look through all their information.

 

  • What happens if we close and reopen our browser? We’d have to log in again and go through all steps again as we’re starting from a fresh session.

 

Linking Requests note that you might come up with other ways to link requests together as well. What about using the IP address (perhaps combined with the User-agent) for a visiting user?

 

Sadly, these approaches all come with similar security issues and drawbacks too. IP addresses can change, and it is possible for multiple computers to share the same public-facing IP address, meaning that all your office computers would appear as the same “user” to a web server.

 

To tackle this issue in a more robust way, two headers were standardized in HTTP in order to set and send “cookies,” small textual bits of information. The way how this works is relatively straightforward. When sending an HTTP response, a web server can include “Set-Cookie” headers as follows:

HTTP/1.1 200 OK

Content-type: text/html

Set-Cookie: sessionToken=20495; Expires=Wed, 09 Jun 2021 10:10:10 GMT Set-Cookie: siteTheme=dark

[...]

 

Note that the server is sending two headers here with the same name. Alternatively, the full header can be provided as a single line as well, where each cookie will be separated by a comma, as follows:

HTTP/1.1 200 OK
Content-type: text/html
Set-Cookie: sessionToken=20495; Expires=Wed, 09 Jun 2021 10:10:10 GMT, siteTheme=dark
[...]

 

Capitalize on It  Some web servers will also use “set-cookie” in all lowercase to send back cookie headers. The value of the “Set-Cookie” headers follows a well-defined standard:

 

  • A cookie name and cookie value are provided, separated by an equals sign, “=”. In the example above, for instance, the “sessionToken” cookie is set to “20495” and might be an identifier the server will use to recognize a subsequent page visit as belonging to the same session.

 

  • Another cookie, called “site theme” is set to the value “dark,” and might be used to store a user’s preference regarding the site’s color theme.

 

  • Additional attributes can be specified, separated by a semicolon (“;”). In the example above, an “Expires” attribute is set for the “sessionToken,” indicating that a browser should store the cookie until the provided date.

 

  • Alternatively, a “Max-Age” attribute can be used to gain a similar result. If none of these are specified, the browser will be instructed to remove the cookies once the browser window is closed.

 

  • Manual Deletion notes that setting an “expires” or “Max-age” attribute should not be regarded as being a strict instruction. Users are free to delete cookies manually, for instance, or might simply switch to another browser or device as well.

 

  • A “Domain” and “Path” attribute can be set as well to define the scope of the cookie. They essentially tell the browser what website the cookie belongs to and hence in which cases to include the cookie information in subsequent requests (more on this later on). 

 

Cookies can only be set on the current resource’s top domain and its subdomains, and not for another domain and its subdomains, as otherwise, websites would be able to control the cookies of other domains.

 

The “HttpOnly” attribute directs browsers not to expose cookies through channels other than HTTP (and HTTPS) requests. This means that the cookie cannot be accessed through, for example, JavaScript.

 

Secure Sessions note that care needs to be taken when defining a value for a session-related cookie such as “session token” above.

 

if this would be set to an easy-to-guess value, like a user ID or e-mail address, it would be very easy for malicious actors to simply spoof the value, as we’ll see later on. therefore, most session identifiers will end up being constructed randomly in a hard-to-guess manner.

 

It is also good practice for websites to frequently expire session cookies or to replace them with a new session identifier from time to time, to prevent so-called “cookie hijacking”: stealing another user’s cookies to pretend that you’re them.

 

When a browser receives a “Set-Cookie” header, it will store its information in its memory and will include the cookie information in all following HTTP requests to the website (provided the “Domain,” “Path,” “Secure,” and “HttpOnly” checks pass). To do so, another header is used, this time in the HTTP request, simply named “Cookie”:

GET /anotherpage.html HTTP/1.1 Host: www.example.com

Cookie: sessionToken=20495; siteTheme=dark [...]

 

Note that here, the cookie names and values are simply included in one header line, and are separated by a semicolon (“;”), not a comma as is the case for other multi-valued headers.

 

The web server is then able to parse these cookies on its end, and can then derive that this request belongs to the same session as a previous one, or do other things with the provided information (such as determine which color theme to use).

 

Evil Cookies

Evil Cookies   

Cookies are an essential component for the modern web to work, but they have gotten a bad reputation over the past years, especially after the EU Cookie Directive was passed and cookies were mentioned in news articles as a way for social networks to track you across the internet.

 

In themselves, cookies are in fact harmless, as they can be sent only to the server setting them or a server in the same domain. however, a web page may contain images or other components stored on servers in other domains, and to fetch those, browsers will send the cookies belonging to those domains as well in the request.

 

That is, you might  be visiting a page on “www.example.com,” to which only cookies belonging to that domain will be sent, but that site might host an image coming from another website, such as “www.facebook.com/image.jpg.”

 

To fetch this image, a new request will be fired off, now including Facebook’s cookies. Such cookies are called “third-party cookies,” and are frequently used by advertisers and others to track users across the internet: 

 

if Facebook (or advertisers) instruct the original site to set the image Url to something like “www.facebook.com/image.jpg?i_ came_from=www-example-org,”

 

It will be able to stitch together the provided information and determine which users are visiting which sites. Many privacy activists have warned against the use of such cookies, and many browser vendors have built-in ways to block sending such cookies.

 

Fingerprinting  

Because of the increasing backlash against third-party cookies, many publishers on the web have been looking for other means to track users.

 

JSON Web Tokens, IP addresses, tag headers, web storage, Flash, and many other approaches have been developed to either set information in a browser that can be retrieved later on, so that users can be remembered; or to “fingerprint,”

 

A device and browser in such a way that the fingerprint is unique across the whole visitor population and can also be used as a unique identifier.

 

Some particularly annoying approaches will use a combination of various techniques to set “evercookies,” which are particularly hard to wipe from a device. not surprisingly, browser vendors continue to implement measures to prevent such practices.

 

Let’s now go over some examples to learn how we can deal with cookies in requests. The first example we’ll explore can be found at http://www. webscraping.com/cookielogin/.

 

You’ll see a simple login page. After successfully logging in (you can use any username and password in this example), you’ll be able to access a secret page over at the website http://www. webscraping.com/cookielogin/secret.php.

 

Try closing and reopening your browser (or just open an Incognito or Private Mode browser tab) and accessing the secret URL directly.

 

You’ll see that the server detects that you’re not sending the right cookie information and blocks you from seeing the secret code. The same can be observed when trying to access this page directly using requests:

import requests

url = 'http://www.webscraping.com/cookielogin/secret.php' r = requests.get(url)

print(r.text)

# Shows: Hmm... it seems you are not logged in

 

Obviously, we need to set and include a cookie. To do so, we’ll use a new argument, called cookies.

 

Note that we could use the headers argument (which we’ve seen before) to include a “Cookie” header, but we’ll see that cookies is a bit easier to use, as requests will take care of formatting the header appropriately.

 

The question is now where to get the cookie information from. We could fall back on our browser’s developer tools, and get the cookie from the request headers there and include it as follows:

import requests
url = 'http://www.webscraping.com/cookielogin/secret.php' my_cookies = {'PHPSESSID': 'ijfatbjege43lnsfn2b5c37706'}
r = requests.get(url, cookies=my_cookies)
print(r.text)
# Shows: This is a secret code: 1234

However, if we’d want to use this scraper, later on, this particular session identifier might have been flushed and become invalid.

 

PHPSESSID

php scripting        

We use the PHP scripting language to power our examples so that the cookie name to identify a user’s session is named “phpSeSSiD”.

 

Other websites might use “session,” “SeSSiOn_iD,” “session_id,” or any other name as well. Do note, however, that the value representing a session should be constructed randomly in a hard-to-guess manner.

 

Simply setting a cookie “is_logged_in=true” or “logged_in_user=Seppe” would, of course, be very easy to guess and spoof.

 

We hence need to resort to a more robust system as follows: we’ll first perform a POST request simulating a login, get out the cookie value from the HTTP response, and use it for the rest of our “session.” In requests, we can do this as follows:

import requests
url = 'http://www.webscraping.com/cookielogin/' # First perform a POST request
r = requests.post(url, data={'username': 'dummy', 'password': '1234'})
# Get the cookie value, either from
# r.headers or r.cookies print(r.cookies) my_cookies = r.cookies
# r.cookies is a RequestsCookieJar object which can also
# be accessed like a dictionary. The following also works: my_cookies['PHPSESSID'] = r.cookies.get('PHPSESSID')
# Now perform a GET request to the secret page using the cookies r = requests.get(url + 'secret.php', cookies=my_cookies)
print(r.text)
# Shows: This is a secret code: 1234

 

This works, though there are some real-life cases where you’ll have to deal with more complex login (and cookie) flows.

 

Navigate to the next example over at http://www. webscraping.com/redirlogin/. You’ll see the same login page again, but note that you’re now immediately redirected to the secret page after successfully logging in.

 

If you use the same Python code as in the fragment above, you’ll note that you’re not able to log in correctly and that the cookies being returned from the POST request are empty.

 

The reason behind this is related to something we’ve seen before: requests will automatically follow HTTP redirect status codes, but the “Set-Cookie” response header is present in the response following the HTTP POST request, and not in the response for the redirected page. We’ll hence need to use the allow_redirects argument once again:

import requests
url = 'http://www.webscraping.com/redirlogin/' # First perform a POST request -- do not follow the redirect
r = requests.post(url, data={'username': 'dummy', 'password': '1234'},
allow_redirects=False)
# Get the cookie value, either from r.headers or r.cookies
print(r.cookies)
my_cookies = r.cookies
# Now perform a GET request manually to the secret page using the cookies r = requests.get(url + 'secret.php', cookies=my_cookies)
print(r.text)
# Shows: This is a secret code: 1234

 

As a final example, navigate to http://www.webscraping.com/trickylogin/. This site works in more or less the same way (explore it in your browser), though note that the “<form>” tag now includes an “action” attribute. We might hence change our code as follows:

import requests
url = 'http://www.webscraping.com/trickylogin/'
# First perform a POST request -- do not follow the redirect # Note that the ?p=login parameter needs to be set
r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'}, allow_redirects=False)
# Set the cookies my_cookies = r.cookies
# Now perform a GET request manually to the secret page using the cookies r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)

 

# Hmm... where is our secret code?

This doesn’t seem to work for this example. The reason for this is that this particular example also checks whether we’ve actually visited the login page, and are hence not only trying to directly submit the login information. In other words, we need to add in another GET request first:

>import requests
url = 'http://www.webscraping.com/trickylogin/' # First perform a normal GET request to get the form
r = requests.post(url)
# Then perform the POST request -- do not follow the redirect r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'}, allow_redirects=False)
# Set the cookies my_cookies = r.cookies
# Now perform a GET request manually to the secret page using the cookies r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)
# Hmm... still no secret code?

 

This also does not seem to work yet. Let’s think about this for a second... Obviously, the way that the server would “remember” that we’ve seen the login screen is by setting a cookie, so we need to retrieve that cookie after the first GET request to get the session identifier at that moment:

import requests
url = 'http://www.webscraping.com/trickylogin/' # First perform a normal GET request to get the form
r = requests.post(url)
# Set the cookies already at this point! my_cookies = r.cookies
# Then perform the POST request -- do not follow the redirect # We already need to use our fetched cookies for this request! r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'}, allow_redirects=False,
cookies=my_cookies)
# Now perform a GET request manually to the secret page using the cookies r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)

 

# Still no secret?

Again, this fails... the reason for this (you can verify this as well in your browser) is that this site changes the session identifier after logging in as an extra security measure.

The following code shows what happens — and finally gets out our secret code:

import requests
url = 'http://www.webscraping.com/trickylogin/' # First perform a normal GET request to get the form
r = requests.post(url)
# Set the cookies my_cookies = r.cookies print(my_cookies)
# Then perform the POST request -- do not follow the redirect # Use the cookies we got before
r = requests.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'}, allow_redirects=False,
cookies=my_cookies)
# We need to update our cookies again
# Note that the PHPSESSID value will have changed my_cookies = r.cookies
print(my_cookies)
# Now perform a GET request manually to the secret page # using the updated cookies
r = requests.get(url, params={'p': 'protected'}, cookies=my_cookies)
print(r.text)
# Shows: Here is your secret code: 3838.

 

The above examples show a simple truth about dealing with cookies, which should not sound surprising now that we know how they work: every time an HTTP response comes in, we should update our client-side cookie information accordingly.

 

In addition, we need to be careful when dealing with redirects, as the “Set-Cookie” header might be “hidden” inside the original HTTP response, and not in the redirected page’s response.

 

This is quite troublesome and will indeed quickly lead to messy scraping code, though fear not, as requests provides another abstraction that makes all of this much more straightforward: sessions.

 

Using Sessions with Requests

Sessions with Requests

Let’s immediately jump in an introduce requests’ sessions mechanism. Our “tricky login” example above can simply be rewritten as follows:

import requests
url = 'http://www.webscraping.com/trickylogin/' my_session = requests.Session()
r = my_session.post(url)
r = my_session.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'}) r = my_session.get(url, params={'p': 'protected'})
print(r.text)
# Shows: Here is your secret code: 3838.

 

You’ll notice a few things going on here: first, we’re creating a request. Session

object and using it to perform HTTP requests, using the same methods (get, post) as above. The example now works, without us having to worry about redirects or dealing with cookies manually.

 

This is exactly what the requests’ session mechanism aims to offer: basically, it specifies that various requests belong together — to the same session and that requests should hence deal with cookies automatically behind the scenes.

 

This is a  huge benefit in terms of user-friendliness and makes requests shine compared to other HTTP libraries in Python.

 

Note that sessions also offers an additional benefit apart from dealing with cookies: if you need to set global header fields, such as the “User-Agent” header, this can simply be done once instead of using the headers argument every time to make a request:

import requests
url = 'http://www.webscraping.com/trickylogin/'
my_session = requests.Session() my_session.headers.update({'User-Agent': 'Chrome!'})
# All requests in this session will now use this User-Agent header: r = my_session.post(url)
print(r.request.headers)
r = my_session.post(url, params={'p': 'login'},
data={'username': 'dummy', 'password': '1234'})
print(r.request.headers)
r = my_session.get(url, params={'p': 'protected'})
print(r.request.headers)

 

Even if you think a website is not going to perform header checks or use cookies, it is still a good idea to create a session nonetheless and use that.

 

Clearing Cookies if you ever need to “clean” a session by clearing its cookies, you can either set up a new session or simply call: my_session.cookies.clear() this works since RequestsCookieJar objects (which represent a collection of cookies in requests) behave like normal python dictionaries.

 

Binary, JSON, and Other Forms of Content

We’re almost done covering everything requests has to offer. There are a few more intricacies we need to discuss, however.

 

So far, we’ve only used requests to fetch simple textual or HTML-based content, though remember that to render a web page, your web browser will typically fire off a lot of HTTP requests, including requests to fetch images. Additionally, files, like a PDF file, say, are also downloaded using HTTP requests.

 

PDF Scraping 

We’ll show you how to download files, though it might be interesting to know that “PDF scraping” is an interesting area on its own. You might be able to set up a scraping solution using requests to download a collection of PDF files, though extracting information from such files might still be challenging.

 

However, several tools have been developed to also help you in this task, which is out of scope here. take a look at the “PDFMiner” and “slate” libraries, for instance, to extract text, or “tabula-py,” to extract tables. if you’re willing to switch to Java, “PDF Clown” is an excellent library to work with PDF files as well.

 

Finally, for those annoying PDF files containing scanned images, OCR software such as “tesseract” might come in handy to automate your data extraction pipeline as well.

 

To explore how this works in requests, we’ll be using an image containing a lovely picture of a kitten at http://www.webscraping.com/files/kitten.jpg. You might be inclined to just use the following approach:

import requests

url = 'http://www.webscraping.com/files/kitten.jpg' r = requests.get(url)

print(r.text)

 

However, this will not work and leave you with a “UnicodeEncodeError.” This is not too unexpected: we’re downloading binary data now, which cannot be represented as Unicode text.

 

Instead of using the text attribute, we need to use content, which returns the contents of the HTTP response body as a Python bytes object, which you can then save to a file:

import requests

url = 'http://www.webscraping.com/files/kitten.jpg' r = requests.get(url)

withopen('image.jpg', 'wb') as my_file: my_file.write(r.content)

 

Don’t Print it’s not a good idea to print out the r.content attribute, as a large amount of text may easily crash your python console window. However, note that when using this method, Python will store the full file contents in memory before writing it to your file.

 

When dealing with huge files, this can easily overwhelm your computer’s memory capacity. To tackle this, requests also allows to stream in response by setting the stream argument to True:

import requests

url = 'http://www.webscraping.com/files/kitten.jpg' r = requests.get(url, stream=True)

# You can now use r.raw

# r.iter_lines

# and r.iter_content

 

Once you’ve indicated that you want to stream back a response, you can work with the following attributes and methods:

 

  • r.raw provides a file-like object representation of the response. This is not often used directly and is included for advanced purposes.
  • The iter_lines method allows you to iterate over a content body line by line. This is handy for large textual responses.
  • The iter_content method does the same for binary data. Let’s use iter_content to complete our example above:
 import requests
url = 'http://www.webscraping.com/files/kitten.jpg' r = requests.get(url, stream=True)
withopen('image.jpg', 'wb') as my_file: # Read by 4KB chunks
for byte_chunk in r.iter_content(chunk_size=4096): my_file.write(byte_chunk)

 

There’s another form of content you’ll encounter a lot when working with websites:

JSON (JavaScript Object Notation), a lightweight textual data interchange format that is both relatively easy for humans to read and write and easy for machines to parse and generate.

 

It is based on a subset of the JavaScript programming language, but its usage has become so widespread that virtually every programming language is able to read and generate it.

 

You’ll see this format used a lot by various web APIs these days to provide content messages in a structured way. There are other data interchange formats as well, such as XML and YAML, though JSON is by far the most popular one.

 

As such, it is interesting to know how to deal with JSON-based requests and response messages, not only when planning to use requests to access a web API that uses JSON, but also in various web scraping situations.

 

To explore an example, head over to http://www.webscraping.com/jsonajax/. This page shows a simple lotto number generator. Open your browser’s developer tools, and try pressing the “Get lotto numbers” button a few times...

 

By exploring the source code of the page, you’ll notice a few things going on:

  • Even though there’s a button on this page, it is not wrapped by a “<form>” tag.
  • When pressing the button, part of the page is updated without completely reloading the page.
  • The “Network” tab in Chrome will show that HTTP POST requests are being made when pressing the button.
  • You’ll notice a piece of code in the page source wrapped inside “<script>” tags.

 

This page uses JavaScript (inside the “<script>” tags) to perform so-called AJAX requests. AJAX stands for Asynchronous JavaScript And XML and refers to the use of JavaScript to communicate with a web server.

 

Although the name refers to XML, the technique can be used to send and receive information in various formats, including JSON, XML, HTML, and simple text files.

 

AJAX’s most appealing characteristic lies in its “asynchronous” nature, meaning that it can be used to communicate with a web server without having to refresh a page completely.

 

Many modern websites use AJAX to, for example, fetch new e-mails, fetch notifications, update a live news feed, or send data, all without having to perform a full page refresh or submit a form.

 

For this example, we don’t need to focus too much on the JavaScript side of things, but can simply look at the HTTP requests it is making to see how it works:

 

POST requests are being made to “results.php”.
The “Content-Type” header is set to “application/x-www-form- urlencoded,” just like before. The client-side JavaScript will make sure to reformat a JSON string to an encoded equivalent.
An “api_code” is submitted in the POST request body.
The HTTP response has a “Content-Type” header set to “application/json,” instructing the client to interpret the result as JSON data.

Working with JSON-formatted replies in requests is easy as well. We can just use text as before and, for example, convert the returned result to a Python structure manually (Python provides a JSON module to do so), but requests also provides a helpful json method to do this in one go:

import requests

url = 'http://www.webscraping.com/jsonajax/results.php' r = requests.post(url, data={'api_code': 'C123456'})

print(r.json()) print(r.json().get('results'))

 

There’s one important remark here, however. Some APIs and sites will also use an “application/json” “Content-Type” for formatting the request and hence submit the POST data as plain JSON. Using a requests’ data argument will not work in this case.

 

Instead, we need to use the json argument, which will basically instruct requests to format the POST data as JSON:

import requests
url = 'http://www.webscraping.com/jsonajax/results2.php' # Use the json argument to encode the data as JSON:
r = requests.post(url, json={'api_code': 'C123456'})
# Note the Content-Type header in the request:
print(r.request.headers)
print(r.json())

 

Internal APIs 

even if the website you wish to scrape does not provide an API, it’s always recommended to keep an eye on your browser’s developer tools networking information to see if you can spot JavaScript-driven requests to Url endpoints that return nicely structured JSON data.

 

Even although an API might not be documented, fetching the information directly from such “internal APIs” is always a good idea, as this will avoid having to deal with the HTML Soup.

Recommend