SQL Joins

Use CROSS JOIN +to Force a Particular Join Ordering

The SQLite query optimizer +will attempt to reorder the tables in a join in order to find the most efficient +way to evaluate the join. The optimizer usually does this job well, but occasionally +it will make a bad choice. When that happens, it might be necessary to override +the optimizer's choice by explicitly specifying the order of tables in the +SELECT statement.

To illustrate the problem, consider the following +schema:

+CREATE TABLE node( + id INTEGER PRIMARY KEY, + name TEXT +); + +CREATE INDEX node_idx ON node(name); + +CREATE TABLE edge( + orig INTEGER REFERENCES node, + dest INTEGER REFERENCES node, + PRIMARY KEY(orig, dest) +); + +CREATE INDEX edge_idx ON edge(dest,orig); +

This schema defines a directed graph with the ability to store +a name on each node of the graph. Similar designs (though usually more complicated) +arise frequently in application development. Now consider a three-way join +against the above schema:

+SELECT e.* +FROM edge AS e, + node AS n1, + node AS n2 +WHERE n1.name = 'alice' + AND n2.name = 'bob' + AND e.orig = n1.id + AND e.dest = n2.id; +

This query asks for information about all edges that go from +nodes labelled “alice” over to nodes labelled “bob”.

There are many +ways that the optimizer might choose to implement this query, but they all +boil down to two basic designs. The first option looks for edges between all +pairs of nodes. The following pseudocode illustrates:

+foreach n1 where n1.name='alice' do: + foreach n2 where n2.name='bob' do: + foreach e where e.orig=n1.id and e.dest=n2.id do: + return e.* + end + end +end +

The second design is to loop over all 'alice' nodes and follow +edges off of those nodes looking for nodes named 'bob'. (The roles of 'alice' +and 'bob' might be reversed here without changing the fundamental character +or the algorithm):

+foreach n1 where n1.name='alice' do: + foreach e where e.orig=n1.id do: + foreach n2 where n2.id=e.dest and n2.name='bob' do: + return e.* + end + end +end +

The first algorithm above corresponds to a join order of n1-n2-e. +The second algorithm corresponds to a join order of n1-e-n2.

The question +the optimizer has to answer is which of these two algorithms is likely to +give the fastest result, and it turns out that the answer depends on the nature +of the data stored in the database.

Let the number of alice nodes +be M and the number of bob nodes be N. Consider two scenarios: In the first +scenario, M and N are both 2 but there are thousands of edges on each node. +In this case, the first algorithm is preferred. With the first algorithm, +the inner loop checks for the existence of an edge between a pair of nodes +and outputs the result if found. But because there are only 2 alice and bob +nodes each, the inner loop only has to run 4 times and the query is very quick.

The +second algorithm would take much longer here. The outer loop of the second +algorithm only executes twice, but because there are a large number of edges +leaving each 'alice' node, the middle loop has to iterate many thousands of +times. So in the first scenario, we prefer to use the first algorithm.

Now +consider the case where M and N are both 3500. But suppose each of these nodes +is connected by only one or two edges. In this case, the second algorithm +is preferred.

With the second algorithm, the outer loop still has +to run 3500 times, but the middle loop only runs once or twice for each outer +loop and the inner loop will only run once for each middle loop, if at all. +So the total number of iterations of the inner loop is around 7000.

The +first algorithm, on the other hand, has to run both its outer loop and its +middle loop 3500 times each, resulting in 12 million iterations of the middle +loop. Thus in the second scenario, second algorithm is nearly 2000 times faster +than the first.

In this particular example, if you run ANALYZE on +your database to collect statistics on the tables, the optimizer will be able +to figure out the best algorithm to use. But if you do not want to run ANALYZE +or if you do not want to waste database space storing the SQLITE_STAT1 statistics +table that ANALYZE generates, you can manually override the decision of the +optimizer by specifying a particular order for tables in a join. You do this +by substituting the keyword phrase “CROSS JOIN” in place of commas in the +FROM clause.

The “CROSS JOIN” phrase forces the table to the left +to be used before the table to the right. For example, to force the first +algorithm, write the query this way:

+SELECT * +FROM node AS n1 CROSS JOIN + node AS n2 CROSS JOIN + edge AS e +WHERE n1.name = 'alice' + AND n2.name = 'bob' + AND e.orig = n1.id + AND e.dest = n2.id; +

And to force the second algorithm, write the query like this:

+SELECT * +FROM node AS n1 CROSS JOIN + edge AS e CROSS JOIN + node AS n2 +WHERE n1.name = 'alice' + AND n2.name = 'bob' + AND e.orig = n1.id + AND e.dest = n2.id; +

The CROSS JOIN keyword phrase is perfectly valid SQL syntax +according to the SQL standard, but it is syntax that is rarely if ever used +in real-world SQL statements. Because it is so rarely used otherwise, SQLite +has appropriated the phrase as a means to override the table reordering decisions +of the query optimizer.

The CROSS JOIN connector is rarely needed +and should probably never be used prior to the final performance tuning phase +of application development. Even then, SQLite usually gets the order of tables +in a join right without any extra help. But on those rare occasions when SQLite +gets it wrong, the CROSS JOIN connector is an invaluable way of tweaking the +optimizer to do what you want.