Join (SQL)
A join combines records from two tables in a relational database and results in a new (temporary) table, also called joined table. In the Structured Query Language (SQL:2003), there are two types of joins: inner and outer.
A join predicate is applied to identify the records to be joined. If the predicate evaluates true, then the combined record inserts into the joined table; otherwise, it does not contribute. The join predicate can be any predicate supported by SQL, for example, WHERE clauses.
As a special case, a table (base table, view, or joined table) can be joined with itself again. This is called self-join.
Mathematically, a join is a relation composition. It is the fundamental operation in relational algebra and generalizes function composition.
Sample tables
All subsequent explanations on join types make use of the following two tables. The rows in these tables are used to illustrate the effect of different types of joins and join predicates.
DepartmentID | DepartmentName |
---|---|
31 | Sales |
33 | Engineering |
34 | Clerical |
35 | Marketing |
LastName | DepartmentID |
---|---|
Rafferty | 31 |
Jones | 33 |
Steinberg | 33 |
Robinson | 34 |
Smith | 34 |
Jasper | 36 |
Inner join
An inner join essentially combines the records from two tables A and B based on a given join predicate. The cross product of all records in table is computed. Thus, each record in table A is combined with every record in table B. Only those records in the joined table that satisfy the join predicate remain. This is the most common type of join used in applications, and is considered the default join type.
SQL:2003 specifies two different syntactical ways to express joins. The first, called explicit join notation, uses the keyword JOIN
, whereas the second is the implicit join notation. The implicit join notation uses commas to separate the tables to be joined in the FROM clause of a SELECT statement. Thus, it always computes a cross join and the WHERE clause may apply additional filter predicates. Those filter predicates are comparable to join predicates in the explicit notation.
Inner joins can be further classified into equi-joins, natural joins, and cross joins.
Special care must be taken when joining tables on columns that can be NULL since NULL will never match any other value or NULL itself, unless the join condition uses explicitly the IS NULL
or IS NOT NULL
predicates.
As an example, the following query takes all the records from table Employee and finds the matching record(s) in table Department based on the join predicate. The join predicate compares the values in the DepartmentID column in both tables. If no match is found, i.e. the department-id of an employee does not match with the current department-id from the Department table, then the joined record is not included in the joined table, i.e. the (intermediate) result of the join.
Example of an explicit inner join:
SELECT * FROM employee INNER JOIN department ON employee.DepartmentID = department.DepartmentID
Example of an implicit inner join:
SELECT * FROM employee, department WHERE employee.DepartmentID = department.DepartmentID
Inner join result:
Employee.LastName | Employee.DepartmentID | Department.DepartmentName | Department.DepartmentID |
---|---|---|---|
Smith | 34 | Clerical | 34 |
Jones | 33 | Engineering | 33 |
Robinson | 34 | Clerical | 34 |
Steinberg | 33 | Engineering | 33 |
Rafferty | 31 | Sales | 31 |
Notice that employee Jasper and department Marketing do not appear. Neither of these have any matching records in the respective other table, i.e. there is no department with the ID 36 and no employee has the department ID 35. Thus, the information for both does not appear in the joined table.
Equi-join
An equi-join is a specific type of comparator-based join, or theta join, that uses only equality comparisons in the join predicate. Using other comparison operators such as <
disqualifies a join as equi-join. The query 2 sections above is already an example for equi-joins.
SELECT * FROM employee INNER JOIN department ON employee.DepartmentID = department.DepartmentID
The resulting joined table contains two columns named DepartmentID, one from table Employee and one from table Department.
SQL:2003 does not have a specific syntax to express equi-joins, but some database engines provide a shorthand syntax, e.g. MySQL supports USING(DepartmentID)
in addition to the ON ...
syntax.
Natural join
A natural join is a further specialization of equi-joins. The join predicate is implicitly given by comparing all columns in both tables that have the same column name in the tables to be joined. The resulting joined table contains only one column for each pair of equally-named columns.
The above sample query for inner joins can be expressed as natural join in the following way:
SELECT * FROM employee NATURAL JOIN department
The result is slightly different, however, because only one DepartmentID column occurs in the joined table.
Employee.LastName | DepartmentID | Department.DepartmentName |
---|---|---|
Smith | 34 | Clerical |
Jones | 33 | Engineering |
Robinson | 34 | Clerical |
Steinberg | 33 | Engineering |
Rafferty | 31 | Sales |
Using the NATURAL JOIN keyword to express joins is ambiguous at best and leaves you open to problems if schema evolution occurs in the database system. For example, the removal, addition, or renaming of columns changes the semantics of a natural join. Thus, it is usually better to explicitly code the join condition using a regular inner join.
Cross join
A cross join or cartesian join is the foundation upon which all types of inner joins are built. A cross join returns the cartesian product of the sets of records from the two joined tables. Thus, it is an inner join where the join condition always evaluates to True.
If A and B are two sets then cross join = A X B.
The SQL code for a cross join lists the tables to be joined (FROM
), but does not include any filtering join predicate.
Example of an explicit cross join:
SELECT * FROM employee CROSS JOIN department
Example of an implicit cross join:
SELECT * FROM employee, department;
Employee.LastName | Employee.DepartmentID | Department.DepartmentName | Department.DepartmentID |
---|---|---|---|
Rafferty | 31 | Sales | 31 |
Jones | 33 | Sales | 31 |
Steinberg | 33 | Sales | 31 |
Smith | 34 | Sales | 31 |
Robinson | 34 | Sales | 31 |
Jasper | 36 | Sales | 31 |
Rafferty | 31 | Engineering | 33 |
Jones | 33 | Engineering | 33 |
Steinberg | 33 | Engineering | 33 |
Smith | 34 | Engineering | 33 |
Robinson | 34 | Engineering | 33 |
Jasper | 36 | Engineering | 33 |
Rafferty | 31 | Clerical | 34 |
Jones | 33 | Clerical | 34 |
Steinberg | 33 | Clerical | 34 |
Smith | 34 | Clerical | 34 |
Robinson | 34 | Clerical | 34 |
Jasper | 36 | Clerical | 34 |
Rafferty | 31 | Marketing | 35 |
Jones | 33 | Marketing | 35 |
Steinberg | 33 | Marketing | 35 |
Smith | 34 | Marketing | 35 |
Robinson | 34 | Marketing | 35 |
Jasper | 36 | Marketing | 35 |
The cross join does not apply any predicate to filter records from the joined table. These joins are very rarely used. Still, the results of a cross join can be further filtered using a WHERE clause.
Outer joins
Outer joins do not require that each record in the two joined tables has a matching record in the other table. The record is retained in the joined table if no other matching record exists. Outer joins are subdivided further into left outer joins, right outer joins, and full outer joins, depending from which table the rows shall be retained (left, right, or both).
There is no implicit join notation for outer joins in SQL:2003.
Left outer join
The result of a left outer join for tables A and B always contains all records of the "left" table (A), even if the join condition does not find any matching record in the "right" table (B). This means that if the ON clause matches 0 (zero) records in B, a row in the result will still be returned — but with NULL in each column from B.
- A left outer join returns all the values from left table and matched values from right table (or NULL in case of no matching join predicate).
For example, this allows us to find the employee's departments, but still show the employee even when their department does not exist (contrary to the inner join example above, where employees in non-existent departments were filtered out).
Example of a left outer join:
SELECT distinct * FROM employee LEFT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID
Employee.LastName | Employee.DepartmentID | Department.DepartmentName | Department.DepartmentID |
---|---|---|---|
Jones | 33 | Engineering | 33 |
Rafferty | 31 | Sales | 31 |
Robinson | 34 | Clerical | 34 |
Smith | 34 | Clerical | 34 |
Jasper | 36 | NULL | NULL |
Steinberg | 33 | Engineering | 33 |
Right outer join
A right outer join is much like a left outer join, except that the tables are reversed. Every record from the "right" table (B) will be in the joined table at least once. If there is no matching row from the "left" table (A), and NULL will be used for columns from A for those records that have any match in A.
- A right outer join returns all the values from right table and matched values from left table (or NULL in case of no matching join predicate).
Example right outer join :
SELECT * FROM employee RIGHT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID
Employee.LastName | Employee.DepartmentID | Department.DepartmentName | Department.DepartmentID |
---|---|---|---|
Smith | 34 | Clerical | 34 |
Jones | 33 | Engineering | 33 |
Robinson | 34 | Clerical | 34 |
Steinberg | 33 | Engineering | 33 |
Rafferty | 31 | Sales | 31 |
NULL | NULL | Marketing | 35 |
Full outer join
A full outer join combines the results of both left and right outer joins. The joined table will contain all records from both tables, and fill in NULLs for missing matches on either side.
Example full outer join:
SELECT * FROM employee FULL OUTER JOIN department ON employee.DepartmentID = department.DepartmentID
Employee.LastName | Employee.DepartmentID | Department.DepartmentName | Department.DepartmentID |
---|---|---|---|
Smith | 34 | Clerical | 34 |
Jones | 33 | Engineering | 33 |
Robinson | 34 | Clerical | 34 |
Jasper | 36 | NULL | NULL |
Steinberg | 33 | Engineering | 33 |
Rafferty | 31 | Sales | 31 |
NULL | NULL | Marketing | 35 |
Some database systems do not support this functionality, but it can be emulated through the use of left and right outer joins and unions. The same example can be expressed in this way:
SELECT * FROM employee LEFT JOIN department ON employee.DepartmentID = department.DepartmentID UNION SELECT * FROM employee RIGHT JOIN department ON employee.DepartmentID = department.DepartmentID WHERE employee.DepartmentID IS NULL
Implementation
The efficient implementation of joins has been the goal of much work in database systems, because joins are both extremely common and difficult to execute efficiently. The difficulty results from the fact that (inner) joins are both commutative and associative. In practice, this means that the user merely supplies the list of tables to be joined and the join conditions to be used, and the database system has the task of determining the most efficient way to perform the operation. Determining how to execute a query containing joins is done by the query optimizer. It has two basic freedoms:
- Join order
- Because joins are commutative, the order in which tables are joined does not change the final result set of the query. However, join order does have an enormous impact on the cost of the join operation, so choosing the best join order is very important.
- Join method
- Given two tables and a join condition, there are multiple algorithms to produce the result set of the join. Which algorithm is most efficient depends on the sizes of the input tables, the number of rows from each table that match the join condition, and the operations required by the rest of the query.
Many join algorithms treat their inputs differently. The inputs to a join are referred to as the outer and inner join operands, or left and right, respectively. In the case of nested loops, for example, the entire inner relation will be scanned for each row of the outer relation. Query plans involving joins can be classified as:
- Left-deep
- the inner operand of each join in the plan is a base table (rather than another join).
- Right-deep
- the outer operand of each join in the plan is a base table.
- Bushy
- neither left-deep nor right-deep; both inputs to a join may be joins themselves.
These names are derived from the appearance of the query plan if drawn as a tree, with the outer join relation on the left and the inner relation on the right (as is the convention).
Join algorithms
There are three fundamental algorithms to perform a join operation.
Nested loops
This is the simplest join algorithm. For each tuple in the outer join relation, the entire inner join relation is scanned, and any tuples that match the join condition are added to the result set. Naturally, this algorithm performs poorly if either the inner or outer join relation is very large. The performance can be enhanced if the inner relation has an index on columns in the join predicate.
A refinement to this technique is called "block nested loops" (BNL): for every block in the outer relation, the entire inner relation is scanned. For each match between the current inner tuple and one of the tuples in the current block of the outer relation, a tuple is added to the join result set. This variant means that more computation is done for each tuple of the inner relation, but far fewer scans of the inner relation are required.
Merge join
If both join relations are sorted by the join attribute(s), the join can be performed trivially:
- For each tuple in the outer relation,
- Consider the current "group" of tuples from the inner relation; a group consists of a set of contiguous tuples in the inner relation with the same value in the join attribute.
- For each matching tuple in the current inner group, add a tuple to the join result. Once the inner group has been exhausted, both the inner and outer scans can be advanced to the next group.
This is one reason why many optimizers keep track of the sort order of query nodes — if one or both input relations to a merge join is already sorted on the join attribute, an additional sort is not required. Otherwise, the DBMS will need to perform the sort, usually using an external sort to avoid consuming too much memory.
Hash join
A hash join algorithm can be used for equi-joins. The access to the tables to be joined is preformed by building hash tables on the join attributes. The lookup in hash tables is much faster than through index trees. However, hashed values can only be compared for equality.