SQL Joins and Left Outer Join (Best Tutorial 2019)

SQL Joins and Left Outer Join

Understanding SQL Joins and Left Outer Join

join operation allows you to retrieve data from multiple tables in a single SELECT query. Two tables can be joined by a single join operator, but the result can be joined again with other tables.

 

There must exist a same or similar column between the tables being joined. When you design an entire database system using good design principles like normalization, we often require the use of joins to give a complete picture to a user's query.  

 

To show a report listing authors and which programming language they created, we would have to use a join.

Contents of newlang_tbl
id language year standard

1 Prolog 1972 ISO
2 Perl 1987 (null)
3 APL 1964 ANSI
4 Tcl 1988 (null)
5 BASIC 1964 ANSI

Contents of authors_tbl
author_id author language_id

1 Colmerauer 1
2 Wall 2
3 Ousterhout 4
4 Iverson 3
5 Kemeny 5
6steve5

 

We now form a query to show our desired output – the list of all authors with the corresponding language they developed. We choose our join column as the language_id field from the author's table. This corresponds to the id field in the languages table.

 

Running a join operation on our two tables

SELECT author, language
FROM authors_tbl, newlang_tbl
WHERE language_id = id;
author language
Colmerauer Prolog
Wall Perl
Ousterhout APL
Iverson Tcl
Kemeny BASIC

 

The output of our query combines a column from both tables giving us a better report. The language_id = id is called the join condition. Since the operator used in the join condition is an equality operator (=), this join is called an equijoin.

 

Another important thing to note is that the columns participating in the join condition are not the ones we choose to be in the result of the query.

 

Remember that the joining of tables to view a resultset does not affect the tables at all. Nothing physically changes in the tables themselves with respect to their structure or data. The implicit connection forming is only within the lifetime of the join query execution.

 

Alternative Join Syntax

join

You would have noticed that we formed our join query without much special syntax, using our regular FROM/WHERE combination. The SQL-92 standard introduced the JOIN keyword to allow us to form join queries. Since it was introduced earlier, the FROM/WHERE syntax is still quite popular for joins.

 

But now that the majority of database vendors have implemented most of the SQL-92 standard, the JOIN syntax is also in widespread use. Below is the JOIN syntax equivalent of the query we just wrote to display which author created which programming language.

 

Rewriting our query using the JOIN(SQL-92) syntax

SELECT author, language
FROM authors_tbl JOIN newlang_tbl
ON language_id = id;

 

Notice that instead of separating the two tables using a comma (thereby making it a list), we use the JOIN keyword. The columns that participate in the join condition are preceded by the ON keyword. The WHERE clause can then be used after the join condition specification (ON clause) to specify any further conditions if needed.

 

The kind of joins where all rows that don’t match the join condition exactly are eliminated are called inner joins. Thus we can optionally use the full keyword INNER JOIN in our queries without affecting the resultset.

Specifying INNER JOIN explicitly
SELECT author, language
FROM authors_tbl INNER JOIN newlang_tbl
ON language_id = id;

 

Resolving Ambiguity in Join Columns

In our example, the join condition fields had distinct names – id and language_id. But what if in our languages table we kept the key field's name as language_id?

 

This would create an ambiguity in the join condition, which would become the confusing language_id = language_ id. To resolve this, we need to qualify the column by prepending it by the table name it belongs to and a .(period).

 

Resolving the naming ambiguity by qualifying the columns

Resolving the naming ambiguity

CREATE TABLE languages_tbl
(language_id INTEGER, language VARCHAR(20));
INSERT INTO languages_tbl VALUES (4, 'Tcl');
SELECT author, language
FROM authors_tbl JOIN languages_tbl
ON language_id = language_id;
=> ERROR: column reference "language_id" is ambiguous
SELECT author, language
FROM authors_tbl JOIN languages_tbl
ON authors_tbl.language_id = languages_tbl.language_id;
author language

 

Ousterhout Tcl

Another way to solve such ambiguity is to qualify the columns using table aliases. The concept is to give a short name to a table and then use this to qualify the columns instead of a long, unwieldy table name.

Using table aliases

SELECT author, language
FROM authors_tbl a JOIN newlang_tbl l
ON a.language_id = http://l.id;

 

Here the author's table is given the alias a and the languages table is given the alias l. It is generally considered a good practice to qualify column names of a join condition regardless of whether there is a name ambiguity or not.

 

Left Outer Joins

Left Outer Joins

Since we encountered inner joins, it gave us a clue to the existence of outer joins. In this kind of join, the result set consists of rows that match the join condition and the rows that don’t match the condition from one of the tables.

 

If the rows from the first table that don’t match the condition are desired in the resultset, we use a left outer join. Otherwise, when rows from the second table are required, we use a right outer join.

 

This sounds a bit confusing at first, so let’s clarify the concept using an example. Let’s add a single row to the newlang_tbl about the Lisp programming language, but we will not make any entry into the author's table for this. Adding a new row to illustrate outer joins

INSERT INTO newlang_tbl

VALUES (6, 'Lisp', 1958, 'ANSI');

 

If we ran an inner join query on the two tables like Listing, we would get a similar output as the query gave that time around. This new row we added would not feature in the resultset.

 

But let’s try a left outer join where we explicitly want this new row to be in the results despite not having an entry in the author's table. Our first table must then be the newlang_tbl and the query would be as below.

A left outer join example

SELECT language, author
FROM newlang_tbl n LEFT OUTER JOIN authors_tbl a ON http://n.id = a.language_id;
language author
Prolog Colmerauer
Perl Wall
APL Ousterhout
Tcl Iverson
BASIC Kemeny
BASIC Kurtz
Lisp

 

Aha, success! The LEFT OUTER JOIN allowed us to sneak the Lisp row into the resultset with a null author value. Looking at the query listing, if you immediately think that just by switching the order of the joined tables, we can convert this into a RIGHT OUTER JOIN, then you are absolutely right.

A right outer join example
SELECT language, author
FROM authors_tbl a RIGHT OUTER JOIN newlang_tbl n ON http://n.id = a.language_id;

 

Notice that our join condition specified in the ON clause did not need any order change. Since all right outer joins can be written as left outer joins (and vice versa), it is rare to find many real-world usages of right outer joins. SQLite goes as far as not support right outer joins at all, which is just as well.

 

Cross Joins

Cross Joins

You might think what would happen if we left out the join condition from our query. Well, what happens in the background of running a join query is that first all possible combinations of rows are made from the tables participating in the join.

 

Then the rows that satisfy the join condition are chosen for the output (or further processing). If we leave out the join condition, we get as the output all possible combinations of records. This is called a Cross Join or Cartesian Product of the tables usually denoted by the sign X.

Query for showing the cartesian product of our tables

SELECT author, language
FROM authors_tbl, newlang_tbl;
author language
Colmerauer Prolog
Colmerauer Perl
Colmerauer APL
Colmerauer Tcl
Colmerauer BASIC
Colmerauer Lisp
Wall Prolog
Wall Perl
Wall APL
Wall Tcl
Wall BASIC
Wall Lisp
Ousterhout Prolog
… …

 

The output of the query is truncated here, but when you run it on your computer you should get 36 rows in the result containing each author and language combination. Another way to rewrite this query is to actually use the JOIN keyword with a preceding argument CROSS as shown below.

 

Rewriting our query using CROSS JOIN

SELECT author, language

FROM authors_tbl CROSS JOIN newlang_tbl;

 

Notice the lack of the ON clause, which means no join condition.

What if we were selecting more than one column from the newlang_tbl, say both language and year? Would the number of combinations increase dramatically from our cross join above?

 

Turns out that no, the number of records in the resultset would be exactly the same as before. A Cartesian product is the combination of records from the tables participating in the joins, not within the unit of record for a single table.

 

Selecting multiple columns from a table participating in a cross join

SELECT author, language, year
FROM authors_tbl CROSS JOIN newlang_tbl;
author language year
Colmerauer Prolog 1972
Colmerauer Perl 1987
Colmerauer APL 1964
… … …

 

A cross join is not something you would come across often. It is of some utility when either of the tables is small, that is, consisting of a few rows, and you need a combination of all the values of it joined with a bigger table.

 

However, I’d advise against running cross joins on actual production database servers unless you really understand why you need them in the scenario.

 

Self Joins

Self Joins

Sometimes a table within its own columns has meaningful data but one (or more) of its fields refer to another field in the same table.

 

For example, if we have a table in which we capture programming languages that influenced other programming languages and denote the influence relationship by the language id, to show the resolved output we would have to join the table with itself.

 

This is also called a self-join. Consider the table created below and pay close attention to the data being inserted.

 

Creating and populating our language influence table

CREATE TABLE inflang_tbl (id INTEGER PRIMARY KEY, language VARCHAR(20) NOT NULL, influenced_by INTEGER);
INSERT INTO inflang_tbl (id, language)
VALUES (1, 'Fortran');
INSERT INTO inflang_tbl (id, language, influenced_by) VALUES (2, 'Pascal', 3);
INSERT INTO inflang_tbl (id, language, influenced_by) VALUES (3, 'Algol', 1);
id language influenced_by
1 Fortran
2 Pascal 3
3 Algol 1

 

Our goal is to now write a self-join query to display which language influenced which one, that is, resolve the influenced_by column.

 

Running a self-join query

SELECT http://l1.id,
l1.language,
l2.language AS influenced FROM inflang_tbl l1, inflang_tbl l2 WHERE http://l1.id = l2.influenced_by;
id language influenced
3 Algol Pascal
1 Fortran Algol

 

Notice the use of table aliases to qualify the join condition columns as separate and the use of the AS keyword that renames the column in the output.

 

What if we wanted to use the alternative SQL-92 JOIN syntax for our self-join? Well as it turns out, there is no special self-join keyword or clause because it is not needed.

 

To the SQL query interpreter, you have created an inner join on two tables who just happen to have exactly similar contents. So we can rewrite the Listing query using our familiar JOIN keyword as below.

 

 Running a self-join query using JOIN

SELECT http://l1.id,
l1.language,
l2.language AS influenced
FROM inflang_tbl l1 JOIN inflang_tbl l2 ON http://l1.id = l2.influenced_by;

 

Non-Equi Joins

Non-Equi Joins

The joins we have seen till now have largely dealt with equality in their join condition. While this is the most common way of joining tables together, we are by no means restricted to use only equality. Let’s put another join condition between the newlang_tbl and authors_tbl between the id and author_id this time

 

A non-equijoin

SELECT id,
author_id,
author,
language
FROM authors_tbl, newlang_tbl
WHERE id < author_id;
id author_id author language
1 2 Wall Prolog
1 3 Ousterhout Prolog
2 3 Ousterhout Perl
1 4 Iverson Prolog
2 4 Iverson Perl
3 4 Iverson APL
1 5 Kemeny Prolog
2 5 Kemeny Perl
3 5 Kemeny APL
4 5 Kemeny Tcl
1 6steveProlog
id author_id author language
2 6stevePerl
3 6steveAPL
4 6steveTcl
5 6steveBASIC

 

While not the most logical of results, it does, however, satisfy our non-­ equality join condition that in each row the id field is lesser than the corresponding author_id field value. We can also freely mix this result with an equality condition using the familiar AND operator within the same query.

 

Using the equality and non-equality conditions

SELECT id,
author_id,
author,
language
FROM authors_tbl, newlang_tbl
WHERE id < author_id
AND id = language_id;
id author_id author language
3 4 Iverson APL
5 6steveBASIC

The result now consists of records where the author of a language has their author_id value greater than their created languages’ id.

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide for here]

 

Using the string concatenation operator

Another common string operation is a substring, which returns only a part of the string field value. For example, if we needed to get only the first two characters of each programming language, we would use the SUBSTR function. The general syntax of this function is given below.

General syntax of substring
SELECT SUBSTR(<field name>, <starting position>, <length>),
...
FROM <table>;

 

The starting position is the character you wish to start extracting from. Unlike most programming languages, the string index positions here don’t start from 0 but 1. The third argument length specifies how many characters should be a part of the result. For the first two characters of a programming language, the starting position would be 1 and the length would be 2.

 

Aggregation and Grouping

Aggregation and Grouping

SQL has maintained its prominent position in the technical world due to its ability to cater to a wide range of business intelligence and analytic requests. While databases are often used for finding a needle in a haystack, that is, narrowing down to a single row, a lot of interactive usage of SQL revolves around generating aggregated insights from a bunch of rows.

 

Indeed, a major advantage that SQL-based systems have over NoSQL data storage solutions is how intuitive grouping and aggregation is in the former category.

 

Aggregate Functions

An aggregate function is used to compute summarization information from a table or tables. We have already seen the COUNT aggregate function that counts the records matched.

 

Similarly, there are other aggregation functions in SQL like AVG for calculating averages; SUM for computing totals; and the extreme functions MAX, MIN for finding out maxima and minima values respectively.

 

The count and extreme functions work with all data types, but functions like SUM and AVG make sense only with numeric types and thus work only with them.

 

Let’s now try to use AVG on the only sensible numeric choice we have in our table – year. You can think of the below query as a way of finding out the mean year value of all the programming language records we have in our table.

 

Finding the Average year of creation in our programming languages table

1. SELECT AVG(year) FROM proglang_tbl;

 

avg

We can see the result as a decimal number with a default of 16 digits after the decimal point in PostgreSQL. This is slightly lowered to 10 digits after the decimal point in SQLite but more than enough still to cover all but the rare scenarios.

 

While the average value was calculated accurately, one can argue that a value like this to specify year is not useful. What we really want is a readable year value that looks like an actual year, an integer value specifically. Thus we go about casting this average value to an integer.

 

Casting values to integers

 style="margin: 0px; width: 970px; height: 63px;">SELECT CAST(AVG(year) AS INTEGER)
FROM proglang_tbl;
avg
1970

 

Conversion of data types using CAST only works with compatible data types like numerics and integers. If you try to convert a varchar into an integer, the DBMS will spit out an error message as is proper.

Casting incompatible types
testdb=# SELECT CAST(language AS INTEGER)
FROM proglang_tbl;
ERROR: invalid input syntax for integer: "Prolog"

 

The exception here is SQLite, which will obediently convert the varchar value to 0. This may be the most sensible choice for it because of its underlying engine implementation; however, I would warn you to stay away from such surprising cast operations.

 

Let’s try the other numeric aggregate function that is commonly used – SUM. Suppose we wish to find the sum of the year values in our table – the query would be written in a straightforward way using SUM.

 

Using the SUM aggregate function

SUM aggregate function

 

SELECT SUM(year)
FROM proglang_tbl;
sum
13793
As before, if we use SUM on a varchar field, PostgreSQL would spit out an error while SQLite would quietly give a value of 0.0.
Using SUM on a varchar field in PostgreSQL
testdb=# SELECT SUM(language)
FROM proglang_tbl;
ERROR: function sum(character varying) does not exist
LINE 1: SELECT SUM(language) FROM proglang_tbl;
^
HINT: No function matches the given name and argument types.
You might need to add explicit type casts.
Using the Extreme Functions – MAX and MIN

MAX and MIN are collectively called the extreme functions because they essentially find the extreme values from a set of column values. Their most intuitive application is with numeric data, but these functions can be applied to other database types as well.

 

These functions are fairly straightforward to understand and use. Let’s take the MIN first. It looks at a particular set of rows and finds the minimum value of the column that is provided as an argument to it.

 

For example, in our table, we wish to find out from which year do we have records of programming languages, that is, the earliest language year. Analyzing the problem at hand, we see that if we apply the aggregate function MIN to the field year in our table, we should get the desired output.

 

Using the MIN extreme aggregate function

SELECT MIN(year)
FROM proglang_tbl;
min
1959

We had two languages in our table corresponding to the year 1959 – APT and JOVIAL. But since this is the minimum value, it was the result once regardless of how many languages had the same value. The function MAX is similar, but its result would signify the latest year in which one of the languages in our table was created.

 

Using the MAX extreme aggregate function

 MAX extreme aggregate function

SELECT MAX(year)
FROM proglang_tbl;
max
1988
Of course you can combine them in a single query to get the result in a single row itself (Listing 10-8).
Using MAX and MIN together
SELECT MAX(year),
MIN(year)
FROM proglang_tbl;
max min
1988 1959

 

Like we read before, these functions are not limited to numeric types. So let’s combine finding the MAX year value with the MIN language value.

Mixing MAX and MIN types
SELECT MAX(year),
MIN(language)
FROM proglang_tbl;
max min
1988 APL

 

The MIN language found was APL since it’s the first alphabetically. Notice that APT was not chosen since L < T when comparing alphabets. We need to be careful while reading this result though.

 

At first glance, it gives a misleading view that 1988 corresponds to APL, which is not the case. Our query simply gives the extreme values of these two fields in our table, whether they are from the same record or not.

 

Grouping Data

Grouping Data

The GROUP BY clause of a SELECT query is used to group records based on their field values. This clause is placed after the WHERE conditional.

 

For example, in our sample table, we can group data by which committee decided on their standard. Let’s see this action after we insert another record in our table so that the logical nature of grouping becomes clearer.

 Grouping records by its fields

INSERT INTO proglang_tbl
(id, language, author, year, standard)
VALUES
(8, 'Fortran', 'Backus', 1957, 'ANSI');
SELECT standard FROM proglang_tbl WHERE standard IS NOT NULL GROUP BY standard;
standard
ECMA
ANSI
ISO
US-DOD

 

Notice how the different standards are grouped into a single value, regardless of how many times they occur in the table. Let’s try to add the language column to the output of the above query.

 

Trying to add language to our output fields

SELECT standard,
language
FROM proglang_tbl
WHERE standard IS NOT NULL
GROUP BY standard;

ERROR: column "proglang_tbl.the language" must appear in the GROUP BY clause or be used in an aggregate function

 

The database engine gave us an error for this query. This makes sense because while it bunched the different standards together because of our grouping clause, which language it should choose to display with it is ambiguous.

 

Let us take the error message’s first suggestion and also include the language field in the GROUP BY clause.

 

Adding language to our output fields

Adding language to our output fields

SELECT standard,
language
FROM proglang_tbl
WHERE standard IS NOT NULL
GROUP BY standard, language;
standard language
ECMA PL/I
ANSI APL
US-DOD JOVIAL
ISO Prolog
ISO APT
ANSI Fortran

 

The interesting thing to note here is the rule that the columns listed in the SELECT clause must be present in the GROUP BY clause. This leads us to the following two corollaries.

 

\ 1.\ You cannot group by a column that is not present in the SELECT list.

\ 2.\ You must specify all the columns in the grouping clause that are present in the SELECT list.

 

Bare Columns in SQLite If you tried executing Listing in SQLite, you would not get an error. This is because of the bare column feature in SQLite that allows a column to be present without aggregation in the SELECT clause and still be absent from the GROUP BY clause.

 

This is quite different from most DBMS systems out there, and I would suggest that you stay away from using this feature seriously because of the undefined behavior involved. However, it is good to read up on the details present at the SQLite website. SQLite Query Language: SELECT

 

Grouping and Aggregate Functions

Another useful way to use grouping is to combine the operation with an aggregate function. Suppose we wish to count how many standards a particular organization has in our table. This can be achieved by combining the GROUP BY clause with the COUNT aggregate function as given below.

 

Using GROUP BY with aggregate functions

SELECT standard,
COUNT(*)
FROM proglang_tbl
GROUP BY standard;
standard count
<null> 2
ECMA 1
ANSI 2
ISO 2
US-DOD 1

The output is intuitive enough to warrant no further explanation, but the query itself is interesting. Notice that the GROUP BY clause consisted of only the standard. The aggregate function is a result of the bunching of the grouped columns.

 

Grouping truly makes sense in SQL when used judiciously with aggregate functions. A lot of utility and intelligence from databases are derived from analysts using a combination of these applied on a well-­ designed model. Let’s see another example of combining GROUP BY with multiple aggregate functions this time.

 

Suppose we wish to find out how many languages were made in the same year, and of those languages which come first alphabetically. We can immediately see that a GROUP BY on year is needed here along with a couple of different aggregate functions.

 

 Using GROUP BY with multiple aggregate functions

multiple aggregate functions

SELECT year,
MIN(language),
COUNT(*)
FROM proglang_tbl
GROUP BY year;
year min count
1972 Prolog 1
1957 Fortan 1
1988 Tcl 1
1987 Perl 1
1964 APL 2
1959 APT 2

 

The HAVING Clause

Like a WHERE clause places conditions on the fields of a query, the HAVING clause places conditions on the groups created by GROUP BY. It must be placed immediately after the GROUP BY but before the ORDER BY clause.

 

Demonstration of the HAVING clause

SELECT language,
standard,
year
FROM proglang_tbl
GROUP BY standard,
year,
language
HAVING year < 1980;
language standard year
APT ISO 1959
JOVIAL US-DOD 1959
APL ANSI 1964
Fortran ANSI 1957
PL/I ECMA 1964
Prolog ISO 1972

From the output we can clearly see that the records for Perl and Tcl are left out since they do not satisfy the HAVING conditional of being created before 1980.

 

You might wonder why we need two different filtering clauses – WHERE and HAVING. A WHERE clause does not allow aggregate functions in its conditionals, a prime target for the HAVING clause.

 

For example, suppose we wish to check which standard values exist more than once in our table. Our first stab at this using the GROUP BY clause might look something like this.

 

Trying aggregate functions in WHERE

SELECT standard
FROM proglang_tbl
WHERE COUNT(standard) > 1
GROUP BY standard;
ERROR: aggregate functions are not allowed in WHERE

 

Just like we thought, our SQL interpreter did not allow such a travesty. Instead, we’ll use the same conditional in the HAVING clause

HAVING clause with aggregate functions
SELECT standard
FROM proglang_tbl
GROUP BY standard
HAVING COUNT(standard) > 1;
standard
ANSI
ISO

It correctly gave us the names of the two standard values with more than one occurrence. Interestingly, if we tweak the conditional to COUNT(*), we get an additional row.

 

Changing the aggregate function behavior

SELECT standard
FROM proglang_tbl
GROUP BY standard
HAVING COUNT(*) > 1;
standard
(null)
ANSI
ISO

 

The filtering clause is now not restricted to non-null values of the standard column. Since there are multiple records with null values in the field, it will also be included in the result.

 

Subqueries

Subqueries

A subquery, simply put, is a query written as a part of a bigger statement. Think of it as a SELECT statement inside another one. The result of the inner SELECT can then be used in the outer query. Let us take a simple example to illustrate this.

 

Consider the same source tables as the ones in the joins blog – authors_tbl and newlang_tbl. We will try to write a query (and a subquery) to display the author of a particular language.
A simple subquery example
SELECT author FROM authors_tbl
WHERE language_id IN
( SELECT id FROM newlang_tbl
WHERE language="Tcl");
author
Ousterhout

The subquery SELECT id FROM newlang_tbl WHERE language='Tcl' picks the correct language id from the newlang_tbl and passes it on to the outer query on the author's table. This frees us from the responsibility of joining the two tables using the language id field.

 

We can visualize the intermediate step where the subquery has already resolved to a value.

The query would now look something like SELECT author FROM authors_tbl WHERE language_id IN (4).

 

Which approach to take in certain situations – a join, a subquery, or a combination of both – is mostly a matter of personal preference. Other times, one approach will be clearly the superior choice. Remember that all joins can be rewritten as subqueries, but the reverse is not true in all cases.

 

Types of Subqueries

Types of Subqueries

We can broadly classify subqueries into three categories.

\ 1.\ Scalar subqueries A subquery that returns only a single column of a single row as its output. The example in the previous section, where the subquery returns the id for Tcl, is a scalar subquery.

 

\ 2.\ Row subqueries A subquery that returns a single row but more than one column. These are the least important type of subqueries since most database management systems do not support it, including SQLite.

 

\ 3.\ Table subqueries A table subquery can return more than a single row and many columns per row. In essence, it can return a table itself to take part in your outer query.

To illustrate the usage of table subqueries, let us take an example where we wish to display all the programming language writers who created a language after 1980.

 

A table subquery example

SELECT author, language
FROM authors_tbl a,
(SELECT id, language
FROM newlang_tbl
WHERE year > 1980) n
WHERE a.language_id = http://n.id;
author language
Wall Perl
Ousterhout Tcl

 

Carefully study the FROM clause of the query above. Our table subquery is placed within it, and it returns a set of languages that were created after 1980. The result consists of two rows and two columns, one of which, that is, language is picked up to be displayed in the final output.

 

Existence Tests in Subqueries

Existence Tests in Subqueries

The keyword EXISTS tests the presence of any number of rows returned from a subquery. We usually don’t care about the columns being returned by the mere existence of rows satisfying a specific criterion. Let’s try to use EXISTS test to display languages who have an author entry in the authors_ table.

Using an existence test
SELECT language,
year
FROM newlang_tbl
WHERE EXISTS (SELECT * FROM authors_tbl
WHERE newlang_tbl.id = language_id);
language year
Prolog 1972
Perl 1987
APL 1964
Tcl 1988
BASIC 1964

 

Notice the subquery WHERE clause in this case. It is effectively referencing the outer table field using newlang_tbl.id. For whichever languages this existence test will be satisfied, the outer query will add to the resultset. We can add the option NOT to the existence test to find the complement of the result.

 

Using NOT in the existence test

SELECT language,
year
FROM newlang_tbl
WHERE NOT EXISTS (SELECT * FROM authors_tbl
WHERE newlang_tbl.id = language_id);
language year
Lisp 1958

 

Recall that we had never put the corresponding entry in the author's table for Lisp in the last blog. So who created Lisp anyway?

 

Lisp is the second oldest programming language whose major dialect is still in active use. John McCarthy created Lisp in 1958 as a part of his research, and other people chipped in to help implement it on the computers of that era.

 

McCarthy by all accounts was a genius-level intellect, widely admired by his peers. He was one of the pioneers of the field of Artificial Intelligence and even coined the term. With the creation of Lisp, he advanced the field of programming language design by leaps and bounds.

 

Over the past three decades, features from Lisp are slowly trickling into mainstream programming languages. Many renowned technologists still marvel at the design of the decades-old Lisp dialects – Common Lisp and Scheme.

 

Using Subqueries in INSERT Statements

We can even use subqueries inside another SQL statement like INSERT. Let us try to add a new language and a new author in our tables and ease our task of remembering id numbers by just a bit by using subqueries.

 

Inserting a new programming language

INSERT INTO newlang_tbl

(id, language, year, standard) VALUES (7, 'Pascal', 1970, 'ISO');

Contents of newlang_tbl
id language year standard
1 Prolog 1972 ISO
2 Perl 1987
3 APL 1964 ANSI
4 Tcl 1988
5 BASIC 1964 ANSI
6 Lisp 1958 ANSI
7 Pascal 1970 ISO

 

While inserting a new entry into the authors_tbl, we can either remember that we used the language_id as 7 for Pascal or use a subquery. Let us see an example of the latter approach. After all, the title of the blog gave away our approach!

Inserting a new author using a subquery

INSERT INTO authors_tbl
(author_id, author, language_id)
VALUES (7, 'Wirth',
(SELECT id FROM newlang_tbl WHERE language="Pascal") );

 

We believe that this should put the correct language id for Mr. Wirth since he created Pascal. Let us verify this belief by looking at the contents of the table.

author_id author language_id
1 Colmerauer 1
2 Wall 2
3 Ousterhout 4
4 Iverson 3
5 Kemeny 5
6steve5
7 Wirth 7

You can even use subqueries to control your UPDATE and DELETE statements. The logic remains much the same as with using subqueries in SELECT and INSERT.

 

Using ANY and ALL

ANY operator

The ANY operator used with the arithmetic comparison operators can be used to check a column value in comparison to a similar value(s) generated in the subquery. For example, if we wanted to display all the languages but exclude the oldest one from the result, we could combine > and ANY to achieve this.

Using the ANY operator
SELECT language
FROM newlang_tbl
WHERE year > ANY (SELECT year FROM newlang_tbl);
language
Prolog
Perl
APL
Tcl
BASIC
Pascal

 

Only Lisp does not have a creation year that is not greater than any of the list of values returned by the subquery. Obviously, this is because the smallest value returned is the creation year of Lisp itself, and thus it does not feature in the final result.

 

Now, what would happen if we reversed our comparison operator to < ANY? The result would include all languages whose year value is less than any one of the creation years returned by the subquery.

Using the ANY operator with <
SELECT language
FROM newlang_tbl
WHERE year < ANY (SELECT year FROM newlang_tbl);
language
Prolog
Perl
APL
BASIC
Lisp
Pascal

We notice that Lisp has snuck into the resultset but Tcl is notably absent. This is because the year of Tcl, that is, 1988 is not less than any of the values returned by the subquery. Equal to? Sure, but not distinctly less than.

 

The other comparison conjunction we can use with ANY is =, but that is rarely seen because it is equivalent to using IN (), which is much more intuitive. SQLite does not support ANY or ALL operators If you tried running the above examples in SQLite, you would get an error message as below.

 

Error: near "SELECT": syntax error

SQLite currently does not support these keywords, but we can still achieve the same results using what we have to work with. Let’s attempt to rewrite Listing displaying all languages but the oldest one.

 

SELECT language FROM newlang_tbl

WHERE year <> (SELECT MIN(year) FROM newlang_tbl);

 

The above query computes our desired resultset just fine and is pretty readable. Some, including yours truly, actually prefer it to the ANY syntax. If you are wondering about <>, it means not equal to.

 

The ALL operator works similarly, but the value in the WHERE clause must hold true for all of the values returned from the subquery. One scenario where ALL gets usage is to find data related to extreme values like minima and maxima. You are of course free to choose the built-­in functions MAX and MIN for the purpose too.

Using the ALL operator
SELECT language
FROM newlang_tbl
WHERE year <= ALL (SELECT year FROM newlang_tbl);
language
Lisp

 

Only Lisp being the oldest language in our table would satisfy the ALL criteria of having a year value less than or equal to all the values from the subquery. Similarly, we can use ALL to find the latest language too.

Using the ALL operator with >=
SELECT language
FROM newlang_tbl
WHERE year >= ALL (SELECT year FROM newlang_tbl);
language
Tcl

 

Working in Sets

Set theory is a branch of discrete mathematics that deals with a collection of objects. There is a lot of conceptual overlap between set theory and relational database concepts. It is no wonder that the output of a query is frequently called a resultset.

Primitive set-theoretic operations like union, intersection, and difference are increasingly supported in various implementations. We will now explore the theory behind these operations and how to use them in SQL.

 

Union

The union is an operation that combines elements of two sets. Let’s say we have the following two sets consisting of a bunch of numbers.

Two sets containing numbers

set1 = { 1, 3, 5 }

set2 = { 1, 2, 3 }

 

The resulting union set will be a set consisting of all of these elements repeated exactly once, that is, no duplicates are allowed. Note that the order of a set is unimportant. Think of it as a bag of elements rather than an ordered collection.

The mathematical UNION operation
set1 UNION set2 = { 1, 3, 5, 2 }
Let's now look at how to use simulate the union operation in SQL.
Contents of proglang_tbl
id language author year standard
1 Prolog Colmerauer 1972 ISO
2 Perl Wall 1987
3 APL Iverson 1964 ANSI
4 JOVIAL Schwartz 1959 US-DOD
5 APT Ross 1959 ISO
6 PL/I IBM 1964 ECMA
7 Tcl Ousterhout 1988
8 Fortran Backus 1957 ANSI

 

If we wanted to get the list of creation years of languages standardized by either ANSI or ISO, we could use a UNION keyword to achieve this.

 

Using a UNION operator

UNION operator

SELECT year FROM proglang_tbl
WHERE standard='ANSI'
UNION
SELECT year FROM proglang_tbl
WHERE standard='ISO';
year
1959
1957
1964
1972

 

Since we had four entries in our table with a standard value as ANSI or ISO, we got our expected four rows in the resultset. Note that there were no duplicate entries to be processed. But what if there were duplicate entries to process with the UNION operation?

 

Using a UNION operator to eliminate duplicate values

SELECT standard FROM proglang_tbl
WHERE language = 'Fortran'
UNION
SELECT standard FROM proglang_tbl
WHERE language = 'APL';
standard
ANSI

 

Both the languages we specified in our WHERE clause were standardized by ANSI. The UNION operation, just like in discrete maths, removed the duplicated value and gave out a single row as the result.

 

There is another related SQL operation UNION ALL that will simulate the act of combination but will not eliminate duplicates. The advantage you get by using this is performance improvement since the SQL engine does not have to bother with checking for duplicates.

 

If you have constructed your participating queries in such a way that there are no repeated values, using a UNION ALL would improve your query processing time.

 

 Using a UNION ALL operator

SELECT standard FROM proglang_tbl
WHERE language = 'Fortran'
UNION ALL
SELECT standard FROM proglang_tbl
WHERE language = 'APL';
standard
ANSI
ANSI

 

Intersection

The intersection operation outputs only the common elements in the input sets. If we apply an intersection to the two sets in the previous section, we get a resulting set of two elements.

 

The mathematical INTERSECTION operation

set1 INTERSECTION set2 = { 1, 3 }

 

As with union, each common value is displayed only once. Duplicates are removed from the final result set.

Translating this to SQL is pretty simple; instead of using UNION we use the keyword INTERSECT to get common elements.

Using the INTERSECT in SQL
SELECT standard FROM proglang_tbl
WHERE year=1964
INTERSECT
SELECT standard FROM proglang_tbl
WHERE year=1957;
standard
ANSI

 

Something to keep in mind here is that the INTERSECT operator would find the exact common values between the two queries that precede and succeed it.

 

That means the entire records of the result and not just common values from a part of it. While in the previous example, our result set had only one column to be given back – standard, let’s see what happens when we add another column to the result list.

 

Using the INTERSECT with multiple columns in the results

SELECT year, standard FROM proglang_tbl
WHERE year=1964
INTERSECT
SELECT year, standard FROM proglang_tbl
WHERE year=1957;
=> (0 rows)

 

The output is no rows at all. The first query would select records for PL/I and APL while the second for Fortran. But all these languages have a different combined value of (year, standard), giving us a net zero result.

 

Note that while the ANSI SQL standard does provision for an INTERSECT ALL operator, I'm yet to come across a database management system that implements it. PostgreSQL happily ignores that you wrote the ALL clause and simply gives back an INTERSECT result.

 

Difference

difference operation

The difference operation between sets, written as set1 - set2 is a list of all elements in set1 that do not occur in set2. If an element is only in set2, it will not be captured by the plain difference operation.

The mathematical DIFFERENCE operation
set1 DIFFERENCE set2 = { 5 }
set2 DIFFERENCE set1 = { 2 }

 

Let's try and write a SQL statement to emulate this logic with our familiar IN and NOT IN operators. But first, let’s insert a row into our table so that we can see the difference operation in action.

 

Inserting a new row for RPG

INSERT INTO proglang_tbl
(id, language, author, year, standard)
VALUES
(9, 'RPG', 'IBM', 1964, 'ISO');

 

Suppose we wish to list out the years of the creation of languages that were standardized by ISO but not the ANSI. From our source table, we find that three languages were standardized by ISO with the years 1972, 1959, and 1964.

 

But since in 1964, APL was created, which was eventually standardized by ANSI, we should ideally be left with the answer 1972 and 1959.

Trying to write set difference with IN
SELECT year FROM proglang_tbl
WHERE standard IN ('ISO')
AND standard NOT IN ('ANSI');
year
1972
1959
1964

 

Whoa, what sorcery is this!?! We thought 1964 would be ineligible because of ANSI standardization. But clearly, this is not the case. What has happened actually is that first there was a scan of ISO rows – giving us three values. Then ANSI rows were discounted but not necessarily from the first result but the table as a whole.

 

So while the APL 1964 was left off, the freshly inserted RPG 1964 still remained, effectively making our second condition worthless. The correct way to achieve this is by using the set difference operator EXCEPT as below.

 

Set difference with EXCEPT

SELECT year FROM proglang_tbl WHERE standard IN ('ISO')
EXCEPT
SELECT year FROM proglang_tbl WHERE standard IN ('ANSI');
year
1972
1959

 

Voila, this seems to yield the correct answer! If you happen to be using an Oracle system, replace EXCEPT with MINUS to achieve the exact same result.

 

When we write more than a single SELECT as a part of a single query and join them using a set-theoretic operator, such statements are called compound queries.

 

Do note that many database management systems restrict the use of compound queries as subqueries. Sybase Adaptive Server Enterprise is one such popular DBMS that doesn’t allow you to write a UNION inside a subquery.

 

Views

Views

 

One of the beautiful aspects of the relational data model and SQL is that the output of a query is also a table, a relation to be precise. It may consist of a single column or a single row, but it is a table nonetheless. A view is a query that can be used as a table.

 

Think of it as a virtual table that stores for the viewer’s convenience a pre-computed resultset. It does not truly exist like a base table but provides a different angle to view the data without the tedium of details.

 

Why Are Views Needed

Most production database systems would contain a lot of tables. It is also possible that some of these tables consist of a lot of fields because of the complexity of the domain.

 

Views would come to the rescue of the casual database user, people who are not experts in all parts of the database system. They have a specific, repetitive need, and views provide them with a simpler interface to the data they need.

 

Another advantage that views bring to the table is security. We can restrict access to base tables and provide views containing only the data a particular group of users is allowed to see.

 

Good database design rules often force sensitive columns to be lumped together with oft-accessed fields. Views come to the rescue in such cases by effectively hiding the sensitive columns if you so choose.

 

For the database designers, views provide independence. To a reasonable degree, views allow the underlying base tables to change their structure to cater to evolving needs and yet views can remain the same. In other cases, views can be re-created with a different query underlying it but will contain the same data in the same format, providing a continuity to the user.

 

Creating a View

The general syntax of creating a view is pretty straightforward. In fact, it probably is as minimal and natural as you can get.

 

The general syntax of view creation

CREATE VIEW <view name> AS <query>

Now let us create a view for ourselves – language_chronology that will have only two fields, namely, languages and their years of creation.

Creating a language_chronology view
CREATE VIEW language_chronology AS
SELECT language, year
FROM proglang_tbl
ORDER BY year ASC;

 

Notice how we have explicitly added the ordering clause to the view creation. There are very few restrictions on what is allowed in the query part of CREATE VIEW. Let us now verify the results by running a query on the view exactly the same way as we would on a table.

 

Listing the contents of a view

SELECT * FROM language_chronology;
language year
Fortran 1957
JOVIAL 1959
APT 1959
RPG 1964
APL 1964
PL/I 1964
Prolog 1972
Perl 1987
Tcl 1988

 

We can also include calculated fields in the query part of view creation. The only thing we must keep in mind is how we rename the calculated field column, failing which would undoubtedly result in a loss of clarity. 

 

Creating a view with a calculated field

>CREATE VIEW language_decade AS
SELECT language,
'The '||((year/10)*10)||'s' decade FROM proglang_tbl;
language decade
Prolog The 1970s
Perl The 1980s
APL The 1960s
JOVIAL The 1950s
APT The 1950s
Tcl The 1980s
PL/I The 1960s
Fortran The 1950s
RPG The 1960s

 

If we had failed to rename the column as a decade, our DBMS systems would execute the view creation but the resultant view would be practically unusable. PostgreSQL would have renamed the column to a mysterious ? column?

 

whereas SQLite would have put the entire expression as the name of the second field 'The '||((year/10)*10)||'s'. Needless to say, we are better off renaming the fields of the view.

 

Another way to rename the fields is to specify it in the view definition clause rather than the query that populates it. This works just as well and is arguably clearer because it lists the fields up front.

 

 Renaming the field in the view definition clause

CREATE VIEW language_era (lang, era) AS SELECT language,
'The '||((year/10)*10)||'s' FROM proglang_tbl WHERE year < 1971;
lang era
APL The 1960s
JOVIAL The 1950s
APT The 1950s
PL/I The 1960s
Fortran The 1950s
RPG The 1960s

 

If you choose this method of renaming columns, you must specify the names of all the columns in the view. Note that renaming a field has no effect on its data type or null status.

 

Modifying Data Through Views

Modifying Data Through Views

There are wide-ranging opinions on whether data modification through views is a good idea. Some people prefer to treat views as a read-only listing of contents, but most DBMS systems provide some data modification ability through views.

 

Let us first try a simple update of the year column through our language_chronology table. Remember we had pulled the year field into the view along with the language name.

 

Updating a value through a row

UPDATE language_chronology
SET year=1977
WHERE language='Fortran';

 

The statement executes fine in PostgreSQL. Now to verify whether it actually made a difference, let us verify the contents of the view first.

 Checking the contents of our modified view

SELECT * FROM language_chronology WHERE language='Fortran';
language year
Fortran 1977

 

All seems to be as expected in the view. While we have an inkling that the base table would also have been updated, let’s verify this too.

Checking the contents of our base table

SELECT * FROM proglang_tbl
WHERE language='Fortran';
id language author year standard
8 Fortran Backus 1977 ANSI

 

This also means that the other view language_era that was dependent on proglang_tbl would not contain the row for Fortran since its creation involved the use of the condition WHERE year < 1971. Records move in and out of views as the underlying base table contents change over time.

 

View modification in SQLite If you attempted to execute this UPDATE command in SQLite, it would throw you an error like:

 

Error: cannot modify language_chronology because it is a view SQLite has clearly stated that it would not stand for data modification through a view. A design choice I happen to agree with.

 

Let’s attempt another data modification, but this time we will try to update the calculated field inside the view language_era. We know that JOVIAL was made in the year 1959, so we wish to round off the value and make its era to The 1960s.

 

Attempting to modify a calculated field of a view

UPDATE language_era SET era='The 1960s' WHERE lang='JOVIAL';
> ERROR: cannot update column "era" of view "language_era"
> DETAIL: View columns that are not columns of their base relation are not updatable.

 

The DBMS has rejected our request to update a calculated field because era does not exist in the base table proglang_tbl. If we think about it, this makes sense because the SQL interpreter would not know what year value to put in the base table.

 

A value of 1960, a value of 1969, and everything in between would make the era value as The 1960s. The DBMS would not attempt to choose any random value because its reasoning would be ambiguous.

 

Changing the lang field of the same view is perfectly unambiguous and hence allowed.

Query to modify a non-calculated field of a view
UPDATE language_era SET lang='Jovial' WHERE lang='JOVIAL';
> UPDATE 1
SELECT * FROM proglang_tbl
WHERE id=4;
id language author year standard
4 Jovial Schwartz 1959 US-DOD

Similarly, we can create a view with aggregated columns using the GROUP BY clause, but modifying the contents of such a view is not allowed.

 

Creating a view with aggregate columns

CREATE VIEW standards AS
SELECT standard, count(*)
FROM proglang_tbl
GROUP BY standard;
standard count
ECMA 1
ANSI 2
ISO 3
US-DOD 1

 

We know from previous experience that adding a new row or modifying the aggregated column would be ambiguous and thus not allowed. But what if we attempted to just update the standard field value just like we did with JOVIAL? Would the update be reflected in all rows containing the field value?

 

Trying to modify a field value in an aggregated view

UPDATE standards SET standard='IS' WHERE standard='ISO';
> ERROR: cannot update view "standards"
> DETAIL: Views containing GROUP BY are not automatically updatable.
> HINT: To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.

 

The operation was disallowed but PostgreSQL gave us a hint on how to go about achieving this. While we won’t cover that technique, it’s good to know that in the rare case you do need it, it’s available in some database systems.

 

Deleting Views

To delete or remove a view in its entirety, we use the DROP VIEW command.

DROP VIEW standards;
Note that you cannot accidentally drop a table using DROP VIEW, which is a relief (Listing 14-14).
Dropping a view
DROP VIEW proglang_tbl;
> ERROR: "proglang_tbl" is not a view
> HINT: Use DROP TABLE to remove a table.

 

Indexing

indexing

Databases have long been the primary data storage components from which insights are derived. As businesses increasingly adopt technology-­ enabled workflows, the rate of data generation has grown substantially. This trend has accelerated with the adoption of the Internet and mobile computing.

 

A well-sized relational database used to run into hundreds of megabytes in the 1990s. It is not uncommon to hear of database systems running into hundreds of gigabytes or even a few terabytes nowadays. As a professional, you will frequently encounter tables with million rows in them.

 

Until now we have seen the parts of SQL that allow you to perform operations and run queries, but we haven’t touched anything close to performance tuning. When you want to run your queries on multimillion record tables, taking a hard look at performance optimization is not optional.

 

One of the most common performance optimization techniques is indexing. An index allows the SQL engine to quickly look up specific records in a table.

This is not unlike jumping directly to the letter W in a dictionary when you wish to know the meaning of wistful. It would be quite tedious to go through all the pages sequentially starting with A until we reach our desired word.

 

A lot of commands in this blog are specific to the DBMS at hand. While the general concept of indexes and basic commands to create and delete indexes remain similar across products, there is no getting around the fact that as we get deeper into our SQL journey, vendor-­ based differences become more visible.

 

Creating an Index

As with most statements in SQL, index creation is pretty straightforward. You use the CREATE INDEX command to achieve this.

 

General syntax of CREATE INDEX

CREATE INDEX <index name> ON <table name> (<column list>);

 

Let’s create a simple index on the language column of proglang_tbl. We make a reasonable assumption that there are going to be a lot of queries using the language field in the WHERE clause, and creating an index on it would increase performance.

 

Creating an index on proglang_tbl

CREATE INDEX language_idx ON proglang_tbl(language);

 

If this command succeeds in PSQL, you would not get an error back but no other visual indication. Let’s verify our index creation attempt by listing the table description as we did in the blog. Instead of the detailed \d+ <table name> option, we will use the slightly compact \d <table name> command.

 

Verifying index creation in PostgreSQL \d proglang_tbl;
Table "public.proglang_tbl"
Column | Type | Modifiers
---------- +----------------------- +-----------
id | integer |
language | character varying(20) |
author | character varying(25) |
year | integer |
standard | character varying(10) |
Indexes:
"language_idx" btree (language)

 

We can see at the very end of the output, there is an entry for our index language_idx. Running the same CREATE INDEX on SQLite also succeeds, and there are two primary ways to verify the creation.

 

 Verifying index creation in SQLite using .schema

sqlite> .schema proglang_tbl
CREATE TABLE proglang_tbl (
id INTEGER NOT NULL,
language VARCHAR(20) NOT NULL,
author VARCHAR(25) NOT NULL,
year INTEGER NOT NULL,
standard VARCHAR(10) NULL);
CREATE INDEX language_idx ON proglang_tbl (language);

 

This showed us the DDL commands that were used to create the table and its related entities like indexes. Another approach is to use an SQLite pragma to list all indexes on a table. Pragmas are statements provided by SQLite to query its own metadata such as index information.

 

Verifying index creation in SQLite using a pragma

sqlite> PRAGMA index_list(proglang_tbl);
seq name unique origin partial
---------- ------------ ---------- ---------- ----------
0 language_idx 0 c 0

 

Using EXPLAIN to See Indexes at Work

We have now seen that the index we created actually exists, but how do we see it in action? We should be able to get a measurable speed up on a large table. The EXPLAIN command would come to our rescue here. But first, let’s go about creating a large table to run our index-enabled queries on.

 

A quick and dirty way to get a large table would be to use a Cartesian product or cross joins. We already have our proglang_tbl with 5 columns and 9 rows in it. Doing a cartesian product on each of the fields with each other should yield us 9 to the power 5 = 59049 rows.

 

This is not a huge table by any means but it would allow us to see the effect an index has on query execution.

 

Creating a big table using cross joins in PostgreSQL

big table using cross joins in PostgreSQL

SELECT a.language,
b.author,
c.year,
d.standard,
http://e.id
INTO biglang_tbl
FROM proglang_tbl a, proglang_tbl b, proglang_tbl c, proglang_ tbl d, proglang_tbl e;
SELECT count(*) FROM biglang_tbl;
count
-------
59049

 

Now that we have a populated table, let us try to analyze the query time for finding all Fortran rows using EXPLAIN. While this command is not in the SQL standard, most relational database systems implement it.

 

Using EXPLAIN on a query in PostgreSQL

EXPLAIN SELECT * FROM biglang_tbl WHERE language="Fortran";
QUERY PLAN
---------------------------------------------------------------
Seq Scan on biglang_tbl (cost=0.00..1150.11 rows=6486 width=24)
Filter: ((language)::text = 'Fortran'::text)
(2 rows)

 

Well, there is some output even though it is not entirely evident yet what we are seeing, A query plan, is a term for how the SQL engine is going to execute your query. Let us now contrast this output with the one after creating an index on the language field.

 

Using EXPLAIN on a query in PostgreSQL after creating an index

CREATE INDEX biglang_idx ON biglang_tbl (language);
EXPLAIN SELECT * FROM biglang_tbl WHERE language='Fortran';
QUERY PLAN
---------------------------------------------------------------
----------------
Bitmap Heap Scan on biglang_tbl (cost=126.56..619.63 rows=6486 width=24)
Recheck Cond: ((language)::text = 'Fortran'::text)
-> Bitmap Index Scan on biglang_idx (cost=0.00..124.94 rows=6486 width=0)
Index Cond: ((language)::text = 'Fortran'::text)
(4 rows)

 

We immediately see that this output is different from the previous one, and it involves the use of our recently created index. The previous output mentioned a Seq Scan or a sequential scan, which as the name suggests, involves going through our records sequentially.

 

The current output, however, mentions Bitmap Heap Scan and Bitmap Index Scan, which sound way faster than a full simple scan. The details of how these particular scans work is out of the scope of this text, but we can look at another parameter in the output to get a relative sense of the efficiency of our index in this case.

 

Both plans mention a parameter-like cost=<1st value>..<2nd value>. The second value is the estimated total cost of query execution. The smaller this value is, the greater is the efficiency of query execution.

 

In the first output without the index, this value is estimated as 1150.11 while after index creation, we reduce it down to 619.63, a big win for us.

 

While index creation on SQLite is similar to other databases, if you had tried to execute Listing in it, you would have gotten an error saying something along the lines of Error: near "INTO": syntax error. 

 

The supported way to create the biglang_tbl in SQLite would be to use the CREATE TABLE .. AS .. <query>.

 

Creating the biglang_tbl in SQLite

sql

CREATE TABLE biglang_tbl AS
SELECT a.language,
b.author,
c.year,
d.standard,
http://e.id
FROM proglang_tbl a, proglang_tbl b, proglang_tbl c, proglang_tbl d, proglang_tbl e;

Also instead of using the simple EXPLAIN, which gives a huge and pretty incomprehensible output at first glance, we use the more succinct EXPLAIN QUERY PLAN statement like below.

 

Using EXPLAIN QUERY PLAN in SQLite

EXPLAIN QUERY PLAN SELECT * FROM biglang_tbl WHERE language="Fortran";
selectid order from detail
---------- ---------- ---------- ---------------------------
0 0 0 SEARCH TABLE biglang_tbl
USING INDEX biglang_idx
(language=?)

 

While the output is pretty small as compared to the one from PostgreSQL, we can clearly see that it is going to use our index to search our table for Fortran rows.

 

Unique Indexes

We can optionally specify the keyword UNIQUE during index creation to make an index that only allows non-duplicate values. This makes the index have a dual responsibility of data integrity along with performance enhancement.

 

The general syntax of UNIQUE index creation

CREATE UNIQUE INDEX <index name> ON <table name> (<column list>)

However, since it has an implied data integrity meaning, we cannot use this kind of index on a field that is already known to have duplicate values. In our newly created biglang_tbl, the ID field is actually duplicated many times due to our cross join conditions. Creating a unique index on this field would result in an error.

 

Cannot create a unique index on a field containing duplicate values

CREATE UNIQUE INDEX id_idx ON biglang_tbl (id);
ERROR: could not create unique index "id_idx"
DETAIL: Key (id)=(4) is duplicated.

Similarly adding a duplicate value into a field that has a unique index would result in an error along the lines of ERROR: duplicate key value violates unique constraint "<index name>".

 

If you have an extremely sharp memory, you’d recall that this is the same error we saw in Listing back in blog when we were discussing unique constraints. If we now try to describe the schema of the involved table proglang_tbluk, we would see how PostgreSQL defined the constraints in terms of indexes.

 Describing a table with both a Primary Key and a Unique index

\d proglang_tbluk;
Table "public.proglang_tbluk"
Column | Type | Modifiers
---------------- + ------------------------ +-----------
id | integer | not null
language | character varying(20) | not null
author | character varying(30) | not null
year | integer | not null
standard | character varying(10) |
current_status | character varying(32) |

 

Indexes

"proglang_tbluk_pkey" PRIMARY KEY, btree (id) "proglang_tbluk_language_key" UNIQUE CONSTRAINT, btree (language)

 

How Do Indexes Work

How Do Indexes Work

Having a high-level overview of how indexes work can help the user write effective, fast-executing queries. Most SQL users ignore the conceptual understanding of indexes, but they are not really hard to grasp.

 

At the beginning of the blog, we compared the database index to look up a word in a dictionary. That lookup process was made easier by the alphabetical ordering nature of a dictionary.

 

Similarly, a blog index ­allows you to look up concepts discussed in the blog by listing them alphabetically with a page number where the concept is discussed.

 

This is very similar to an actual database index. While the underlying details vary from implementation to implementation, it is helpful to think of it as an ordered lookup table. The values of the field being indexed are sorted and stored along with the pointers to the locations of the actual record in the base table.

 

The SQL interpreter would not have to traverse through the whole of the table to find the two rows with ANSI as the standard field. The inefficient whole table traversal is what is sometimes referred to as a full table scan or a sequential scan. The point of an index is to avoid this kind of scan.

 

When someone writes a query with a WHERE clause finding the specific value of a standard, this index would come into effect automatically without the user having to specify using it. Adding or deleting more rows with different values of this field would automatically update the index too so that the index always refers to the latest data in the table.

 

Index Overheads

With all the niceties that indexes bring to the user without much effort in terms of arcane commands, sometimes users want to create indexes for every column possible. After all, there doesn’t seem to be a downside to index creation yet.

 

Well as it turns out, like with everything else in the world, there is no free lunch. If there is an index on every column for a table with N fields, then for every DML statement like INSERT, UPDATE, or DELETE, the N indexes have to be kept in sync. This makes changing data slow for large tables, sometimes annoyingly and sometimes worryingly.

 

Another serious overhead that too many indexes bring is their storage requirements. Indexes occupy physical space on the disk just like a table. While storage has become cheap in recent times, database administrators are not known for their cavalier attitude toward server free space.

 

Let’s check how the disk is impacted by index creation. We will keep our focus on the biglang_tbl and its index biglang_idx. First, let’s find the total space occupied by the table and its related objects.

 

Displaying the total size of a table and its related objects in PostgreSQL

SELECT pg_size_pretty(pg_total_relation_size('biglang_tbl'));
pg_size_pretty
----------------
4608 kB
(1 row)

 

The function pg_total_relation_size would return the disk space occupied by the table, its indexes, and a few other things. The pg_size_pretty is for prettifying the output to a more human-readable unit of kB rather than the number of bytes.

 

As is evident from these function names, they are specific to PostgreSQL. Check your DBMS manual for commands to query the database catalog in case you are using a different product.

Now let’s find how much space the table and index take out of this figure. That should give us a relative idea about how big indexes get.

 

Displaying the size of a table and its index in PostgreSQL

testdb=# SELECT pg­_size_pretty(pg_relation_size('biglang_tbl'));
pg_size_pretty
----------------
3296 kB
(1 row)
testdb=# SELECT pg_size_pretty(pg_relation_size('biglang_idx'));
pg_size_pretty
----------------
1312 kB

 

(1 row)

Our index is roughly 39% of the size of our table! Not to mention it occupies 28% of the total relation size of biglang_tbl. Keep in mind that we are talking about a single index on one column here. Clearly, we need to be parsimonious with our index creation.

 

A good rule of thumb is to rely on the primary key and unique indexes a lot during your queries. Over time you will start seeing patterns of slow-­ running queries.

 

If these queries are not run often, perhaps we can live with the extra time taken. But if the slow-running queries are run regularly and often contain the same field in the WHERE clause that is not indexed, it is a perfect candidate for index creation.

 

Index size in SQLite You might have noticed that we discussed index sizes in PostgreSQL only.

 

The helper functions like pg_size_ pretty, pg_total_relation_size, and pg_relation_size are specific to PostgreSQL and are not a part of the SQL standard. 

 

I haven’t found a good way to calculate index sizes in SQLite. Since it is a self-contained single file database, you could verify the file size before and after index creation to get a rough estimate.

 

Deleting an Index

Deleting an Index

If you no longer need an index, SQL gives you the DROP INDEX command to delete an index. The general syntax of this command is simple enough.

General syntax of DROP INDEX

DROP INDEX <index name>

 

Note that this does not change the data of the underlying table in any way. All you are doing is dropping the index, so the query time may become slower.

DROP INDEX doesn’t change the contents of the underlying table
DROP INDEX biglang_idx;
SELECT COUNT(*) FROM biglang_tbl;
count
-------
59049
(1 row)

 

Access Control Statements

Access Control Statements

Due to their ease of use, extensive feature sets, and reliability, SQL-based relational database management systems have become the golden source of truth for enterprises everywhere.

 

When large companies think of storing critical data, the only question in their minds is which relational DBMS vendor and not what kind of data store.

 

When such important data is being stored in the system and when the DBMS becomes the central data store across the organization, some level of access control is absolutely essential.

 

Access control refers to permissions within a software system. When you log onto a server or sometimes even your own computer, you have been given permission to access the resources of the system.

 

Often when you wish to install new software on your machine, you require root or administrator privileges. This is the operating system’s access control mechanisms at work.

 

Relational databases understandably also have very powerful access control mechanisms. While most systems vary widely in how they provide access control, almost all vendors to provide the Data Control Language (DCL) SQL commands of GRANT and REVOKE.

 

Access Control in SQLite Permissions truly come into play in multi-­user systems, that is when multiple users have access to a system but not equally. This is typically the case in client-server database systems like PostgreSQL, Sybase ASE, etc.

 

SQLite is a single file-­based system typically used in scenarios where we need to embed a simple database within tight constraints or within an application. It is not truly meant for multi-user access though there are provisions in it to allow it to some degree.

 

This does not mean that it is not a capable RDBMS. I personally think the world would be better off using SQLite in half the cases where more expensive and resource-hungry systems were put, but that is a discussion for another time.

 

SQLite being single file-based relies on the operating systems to grant or restrict access to its data file. Consequently, it does not implement any GRANT or REVOKE commands. The rest of the blog focuses on access control mechanisms using PostgreSQL examples.

 

Creating New Users in PostgreSQL

Creating New Users in PostgreSQL

For running the examples in the blog, we have been using the user postgres. This has served us well as a catch-all account with all rights and permissions.

 

Just to recap, we used to start our PSQL session by specifying the username in the -U option as below. The PSQL session start command with user postgres /opt/PostgreSQL/9.5/bin/PSQL -U postgres

 

But this does not accurately reflect the real-world setup. Usually, you would have your own user account, which will have lesser privileges than the administrator account. This is safer both for you and the database administrators knowing that one account being compromised does not affect everything in the system.

 

Let us go about creating a new user in PostgreSQL through PSQL, and then we’ll go on to specifying the rights that the particular user gets.

 

Creating a new user in PSQL

postgres=# CREATE USER primer with PASSWORD 'hunter2';

CREATE ROLE

postgres=#

 

When you execute this command, a new user by the name primer is created with the password hunter2 and the console displays the message CREATE ROLE.

 

Since the session was created with the user postgres and the text before the =# is still the same, we can see that the CREATE USER command does not switch to the new user directly but continues the same session with postgres.

 

The legend of hunter2 Before social networks was popular, Internet Relay Chat (IRC) ruled the instant messaging landscape. People connected to an IRC server and joined one or more chat rooms called channels and talked with like-minded people.

 

If a conversation was particularly funny, people would post it to a site Quote Database Home. Around 2004, somebody posted a funny exchange between two users where the first user convinces the other one that when they type their real password in IRC, everyone else sees only ******’s.

 

This was of course not true, but funny nonetheless. Whether the conversation truly happened can also not be verified, but the password in question was ‘hunter2’.

We will now try to verify whether the user was indeed created. Like before, PSQL gives us a short command for this – \du.

 

Checking if the user primer has been created

postgres=# \du
List of roles
Role name | Attributes |
Member of
-----------+------------------------------------------------+
postgres | Superuser, Create role, Create DB, Replication,|
| Bypass RLS |
{}
primer | |
{}

We see that indeed our user has been created, though the list of its attributes is empty. Don’t worry, we will get to that in a bit. If you want to verify the same information without using the PSQL metacommand, we can query the inbuilt database catalog information.

 

Querying user information from the database catalog

SELECT usename,
usesysid,
usecreatedb,
usesuper FROM pg_user;
usename usesysid usecreatedb usesuper
postgres 10 t t
primer 41095 f f

 

Grant Privileges to Users

laptop

Let us try to open a PSQL session using this newly created user. We will pass the value primer to the -U option of PSQL

Trying to log in using our newly created user
/opt/PostgreSQL/9.5/bin/psql -U primer
Password for user primer:
PSQL.bin: FATAL: database "primer" does not exist

Along with the user, it also tried to open the default database for the new user whose name was assumed to be the same as the username. We’ll remedy this by explicitly stating that we want to operate on test db.

 

Connecting to test db using primer user

/opt/PostgreSQL/9.5/bin/psql -U primer -d testdb
Password for user primer:
PSQL.bin (9.5.8)
Type "help" for help.
testdb=> SELECT * FROM proglang_tbl;
ERROR: permission denied for relation proglang_tbl

 

Logging into the test db database worked. But when we ran a basic query on one of the tables, it immediately gave us a permission denied error. This seems logical in retrospect since we haven’t granted any special access to the primer. We don’t want any new user to immediately gain access to our meticulously created tables.

 

We will use the GRANT statement to give specific privileges to our newly created user. The general syntax of a GRANT is given below.

 

The general syntax of GRANT

GRANT <privilege> ON <table name> TO <user>

The most obvious privilege we wish to give the primer user is the ability to query proglang_tbl. This is equivalent to giving the user a read-­ only access to the particular table. We run this statement as the superuser Postgres who will bestow privileges to other users.

Granting SELECT to primer

GRANT SELECT ON proglang_tbl TO primer;

 

Now we can exit the PSQL session as Postgres and reopen testdb as the user primer. We will attempt to first query and then update a row in the table just to see whether the GRANT statement worked as advertised.

 

Verifying whether primer can query or update the table

testdb=> SELECT count(*) FROM proglang_tbl;
count
-------
9
(1 row)
testdb=> UPDATE proglang_tbl SET year=1982 WHERE author='Ross';
ERROR: permission denied for relation proglang_tbl

 

The query worked fine but the row update did not, so everything is working as expected. As you might have guessed, we would need to GRANT the UPDATE privilege too for the second statement to work.

 

SELECT and UPDATE are not the only privileges available for fine-grained access control. You can specify other privileges like INSERT and DELETE too.

 

Finally, there is an ALL privilege that grants all the available privileges on that particular database object to the user specified. If you wish to specify multiple privileges in one go, you can specify them like a list.

 

Granting multiple privileges in one go

GRANT SELECT, UPDATE, INSERT ON proglang_tbl TO primer;

Granting privileges is usually done when the users who wish to access the table or database object in question are not the ones who created it. If a user has created a table, they get all privileges on it by default. There are other privileges in most DBMS systems out there than the four basic ones we covered.

 

However, their use is usually of interest to the database administrators rather than query users. Feel free to refer to your DB manual to know more about the supported privileges.

 

Revoking Privileges

The REVOKE command is the exact opposite of GRANT. It allows you to remove privileges from a user for a database object. Its general syntax is similar to GRANT with the exception that it uses FROM instead of TO.

General syntax of REVOKE

REVOKE <privilege> ON <table name> FROM <user>

 

we had mentioned how views help in data security by providing a virtual table containing only the fields you want to show to others. But this plan would be foiled if the users could query the base table too.

 

Using REVOKE here is a good idea. We’d allow users to query the view but not the underlying table. This way we ensure that usability is not hampered while still being able to keep all kinds of fields together that make sense on a data-modeling level.

 

Revoking access on the base table

testdb=# GRANT SELECT ON language_decade TO primer;
testdb=# REVOKE SELECT ON proglang_tbl FROM primer;

 

We are running these commands using the superuser Postgres. Let us now log in as the user primer and see how these statements have affected our privileges.

 

 Checking privileges of the user primer

SELECT * FROM proglang_tbl;
ERROR: permission denied for relation proglang_tbl SELECT * FROM language_decade WHERE decade='The 1950s';
language | decade
---------- + -----------
Jovial | The 1950s
APT | The 1950s
(2 rows)

 

We notice that the user can query the view but not the base table. This is, in fact, a very common access control workflow in large databases. Right after the data definition process, different views are created on the basis of how we expect the data to be queried, and then base table privileges are revoked for interactive querying.

 

While we can specify a list of users in GRANT and REVOKE, we cannot reasonably expect the list of users of a database system to remain the same over time. Most DBMS software provides a keyword PUBLIC to refer to all current and future users of a system. This can be used with our access control statements to minimize the need for routine access administration.

 

Using the PUBLIC with access control statements

testdb=# GRANT ALL ON proglang_tbl TO PUBLIC;
testdb=# REVOKE DELETE ON proglang_tbl FROM PUBLIC;

 

What these two statements in succession would achieve is to first open up the proglang_tbl for everyone and then remove only the DELETE privilege.

 

The other privileges like INSERT, UPDATE, etc., would be available to all users of the system without us having to list them one by one. If a new user is created, these access control levels would be applicable to them too.

Recommend