SQL Joins

SQL JoinsThis guide explains how to use CROSS JOIN phrases to override the -optimizer's ordering of tables. -

Introduction

SQLite uses the “CROSS JOIN” phrase -as a means to override the table reordering decisions of the query optimizer. -The CROSS JOIN connector is rarely needed and should probably never be used -prior to the final performance tuning phase of application development. Even -then, SQLite usually gets the order of tables in a join right without any -extra help. But on those rare occasions when SQLite gets it wrong, the CROSS -JOIN connector is an invaluable way of tweaking the optimizer to do what you -want.

Intended -audience:

This document is intended to be used by Symbian platform -licensees and third party application developers.

Use CROSS JOIN -to Force a Particular Join Ordering

The SQLite query optimizer -will attempt to reorder the tables in a join in order to find the most efficient -way to evaluate the join. The optimizer usually does this job well, but occasionally -it will make a bad choice. When that happens, it might be necessary to override -the optimizer's choice by explicitly specifying the order of tables in the -SELECT statement.

To illustrate the problem, consider the following -schema:

-CREATE TABLE node( - id INTEGER PRIMARY KEY, - name TEXT -); - -CREATE INDEX node_idx ON node(name); - -CREATE TABLE edge( - orig INTEGER REFERENCES node, - dest INTEGER REFERENCES node, - PRIMARY KEY(orig, dest) -); - -CREATE INDEX edge_idx ON edge(dest,orig); -

This schema defines a directed graph with the ability to store -a name on each node of the graph. Similar designs (though usually more complicated) -arise frequently in application development. Now consider a three-way join -against the above schema:

-SELECT e.* -FROM edge AS e, - node AS n1, - node AS n2 -WHERE n1.name = 'alice' - AND n2.name = 'bob' - AND e.orig = n1.id - AND e.dest = n2.id; -

This query asks for information about all edges that go from -nodes labelled “alice” over to nodes labelled “bob”.

There are many -ways that the optimizer might choose to implement this query, but they all -boil down to two basic designs. The first option looks for edges between all -pairs of nodes. The following pseudocode illustrates:

-foreach n1 where n1.name='alice' do: - foreach n2 where n2.name='bob' do: - foreach e where e.orig=n1.id and e.dest=n2.id do: - return e.* - end - end -end -

The second design is to loop over all 'alice' nodes and follow -edges off of those nodes looking for nodes named 'bob'. (The roles of 'alice' -and 'bob' might be reversed here without changing the fundamental character -or the algorithm):

-foreach n1 where n1.name='alice' do: - foreach e where e.orig=n1.id do: - foreach n2 where n2.id=e.dest and n2.name='bob' do: - return e.* - end - end -end -

The first algorithm above corresponds to a join order of n1-n2-e. -The second algorithm corresponds to a join order of n1-e-n2.

The question -the optimizer has to answer is which of these two algorithms is likely to -give the fastest result, and it turns out that the answer depends on the nature -of the data stored in the database.

Let the number of alice nodes -be M and the number of bob nodes be N. Consider two scenarios: In the first -scenario, M and N are both 2 but there are thousands of edges on each node. -In this case, the first algorithm is preferred. With the first algorithm, -the inner loop checks for the existence of an edge between a pair of nodes -and outputs the result if found. But because there are only 2 alice and bob -nodes each, the inner loop only has to run 4 times and the query is very quick.

The -second algorithm would take much longer here. The outer loop of the second -algorithm only executes twice, but because there are a large number of edges -leaving each 'alice' node, the middle loop has to iterate many thousands of -times. So in the first scenario, we prefer to use the first algorithm.

Now -consider the case where M and N are both 3500. But suppose each of these nodes -is connected by only one or two edges. In this case, the second algorithm -is preferred.

With the second algorithm, the outer loop still has -to run 3500 times, but the middle loop only runs once or twice for each outer -loop and the inner loop will only run once for each middle loop, if at all. -So the total number of iterations of the inner loop is around 7000.

The -first algorithm, on the other hand, has to run both its outer loop and its -middle loop 3500 times each, resulting in 12 million iterations of the middle -loop. Thus in the second scenario, second algorithm is nearly 2000 times faster -than the first.

In this particular example, if you run ANALYZE on -your database to collect statistics on the tables, the optimizer will be able -to figure out the best algorithm to use. But if you do not want to run ANALYZE -or if you do not want to waste database space storing the SQLITE_STAT1 statistics -table that ANALYZE generates, you can manually override the decision of the -optimizer by specifying a particular order for tables in a join. You do this -by substituting the keyword phrase “CROSS JOIN” in place of commas in the -FROM clause.

The “CROSS JOIN” phrase forces the table to the left -to be used before the table to the right. For example, to force the first -algorithm, write the query this way:

-SELECT * -FROM node AS n1 CROSS JOIN - node AS n2 CROSS JOIN - edge AS e -WHERE n1.name = 'alice' - AND n2.name = 'bob' - AND e.orig = n1.id - AND e.dest = n2.id; -

And to force the second algorithm, write the query like this:

-SELECT * -FROM node AS n1 CROSS JOIN - edge AS e CROSS JOIN - node AS n2 -WHERE n1.name = 'alice' - AND n2.name = 'bob' - AND e.orig = n1.id - AND e.dest = n2.id; -

The CROSS JOIN keyword phrase is perfectly valid SQL syntax -according to the SQL standard, but it is syntax that is rarely if ever used -in real-world SQL statements. Because it is so rarely used otherwise, SQLite -has appropriated the phrase as a means to override the table reordering decisions -of the query optimizer.

The CROSS JOIN connector is rarely needed -and should probably never be used prior to the final performance tuning phase -of application development. Even then, SQLite usually gets the order of tables -in a join right without any extra help. But on those rare occasions when SQLite -gets it wrong, the CROSS JOIN connector is an invaluable way of tweaking the -optimizer to do what you want.

-SQL Overview - -SQL Server -Guide -SQL DB Overview - -Avoid Transient -Tables -Prevent Datafile -Corruption -SQL Index -Tips -SQL Insertion -Tips -SQL Schema -Tips -SQL Expressions - -SQL Statement -Tips - ANALYZE -Command - SQL WHERE -CLause Tips + + + + + +SQL JoinsThis guide explains how to use CROSS JOIN phrases to override the +optimizer's ordering of tables. +

Introduction

SQLite uses the “CROSS JOIN” phrase +as a means to override the table reordering decisions of the query optimizer. +The CROSS JOIN connector is rarely needed and should probably never be used +prior to the final performance tuning phase of application development. Even +then, SQLite usually gets the order of tables in a join right without any +extra help. But on those rare occasions when SQLite gets it wrong, the CROSS +JOIN connector is an invaluable way of tweaking the optimizer to do what you +want.

Intended +audience:

This document is intended to be used by Symbian platform +licensees and third party application developers.

Use CROSS JOIN +to Force a Particular Join Ordering

The SQLite query optimizer +will attempt to reorder the tables in a join in order to find the most efficient +way to evaluate the join. The optimizer usually does this job well, but occasionally +it will make a bad choice. When that happens, it might be necessary to override +the optimizer's choice by explicitly specifying the order of tables in the +SELECT statement.

To illustrate the problem, consider the following +schema:

+CREATE TABLE node( + id INTEGER PRIMARY KEY, + name TEXT +); + +CREATE INDEX node_idx ON node(name); + +CREATE TABLE edge( + orig INTEGER REFERENCES node, + dest INTEGER REFERENCES node, + PRIMARY KEY(orig, dest) +); + +CREATE INDEX edge_idx ON edge(dest,orig); +

This schema defines a directed graph with the ability to store +a name on each node of the graph. Similar designs (though usually more complicated) +arise frequently in application development. Now consider a three-way join +against the above schema:

+SELECT e.* +FROM edge AS e, + node AS n1, + node AS n2 +WHERE n1.name = 'alice' + AND n2.name = 'bob' + AND e.orig = n1.id + AND e.dest = n2.id; +

This query asks for information about all edges that go from +nodes labelled “alice” over to nodes labelled “bob”.

There are many +ways that the optimizer might choose to implement this query, but they all +boil down to two basic designs. The first option looks for edges between all +pairs of nodes. The following pseudocode illustrates:

+foreach n1 where n1.name='alice' do: + foreach n2 where n2.name='bob' do: + foreach e where e.orig=n1.id and e.dest=n2.id do: + return e.* + end + end +end +

The second design is to loop over all 'alice' nodes and follow +edges off of those nodes looking for nodes named 'bob'. (The roles of 'alice' +and 'bob' might be reversed here without changing the fundamental character +or the algorithm):

+foreach n1 where n1.name='alice' do: + foreach e where e.orig=n1.id do: + foreach n2 where n2.id=e.dest and n2.name='bob' do: + return e.* + end + end +end +

The first algorithm above corresponds to a join order of n1-n2-e. +The second algorithm corresponds to a join order of n1-e-n2.

The question +the optimizer has to answer is which of these two algorithms is likely to +give the fastest result, and it turns out that the answer depends on the nature +of the data stored in the database.

Let the number of alice nodes +be M and the number of bob nodes be N. Consider two scenarios: In the first +scenario, M and N are both 2 but there are thousands of edges on each node. +In this case, the first algorithm is preferred. With the first algorithm, +the inner loop checks for the existence of an edge between a pair of nodes +and outputs the result if found. But because there are only 2 alice and bob +nodes each, the inner loop only has to run 4 times and the query is very quick.

The +second algorithm would take much longer here. The outer loop of the second +algorithm only executes twice, but because there are a large number of edges +leaving each 'alice' node, the middle loop has to iterate many thousands of +times. So in the first scenario, we prefer to use the first algorithm.

Now +consider the case where M and N are both 3500. But suppose each of these nodes +is connected by only one or two edges. In this case, the second algorithm +is preferred.

With the second algorithm, the outer loop still has +to run 3500 times, but the middle loop only runs once or twice for each outer +loop and the inner loop will only run once for each middle loop, if at all. +So the total number of iterations of the inner loop is around 7000.

The +first algorithm, on the other hand, has to run both its outer loop and its +middle loop 3500 times each, resulting in 12 million iterations of the middle +loop. Thus in the second scenario, second algorithm is nearly 2000 times faster +than the first.

In this particular example, if you run ANALYZE on +your database to collect statistics on the tables, the optimizer will be able +to figure out the best algorithm to use. But if you do not want to run ANALYZE +or if you do not want to waste database space storing the SQLITE_STAT1 statistics +table that ANALYZE generates, you can manually override the decision of the +optimizer by specifying a particular order for tables in a join. You do this +by substituting the keyword phrase “CROSS JOIN” in place of commas in the +FROM clause.

The “CROSS JOIN” phrase forces the table to the left +to be used before the table to the right. For example, to force the first +algorithm, write the query this way:

+SELECT * +FROM node AS n1 CROSS JOIN + node AS n2 CROSS JOIN + edge AS e +WHERE n1.name = 'alice' + AND n2.name = 'bob' + AND e.orig = n1.id + AND e.dest = n2.id; +

And to force the second algorithm, write the query like this:

+SELECT * +FROM node AS n1 CROSS JOIN + edge AS e CROSS JOIN + node AS n2 +WHERE n1.name = 'alice' + AND n2.name = 'bob' + AND e.orig = n1.id + AND e.dest = n2.id; +

The CROSS JOIN keyword phrase is perfectly valid SQL syntax +according to the SQL standard, but it is syntax that is rarely if ever used +in real-world SQL statements. Because it is so rarely used otherwise, SQLite +has appropriated the phrase as a means to override the table reordering decisions +of the query optimizer.

The CROSS JOIN connector is rarely needed +and should probably never be used prior to the final performance tuning phase +of application development. Even then, SQLite usually gets the order of tables +in a join right without any extra help. But on those rare occasions when SQLite +gets it wrong, the CROSS JOIN connector is an invaluable way of tweaking the +optimizer to do what you want.

+SQL Overview + +SQL Server +Guide +SQL DB Overview + +Avoid Transient +Tables +Prevent Datafile +Corruption +SQL Index +Tips +SQL Insertion +Tips +SQL Schema +Tips +SQL Expressions + +SQL Statement +Tips + ANALYZE +Command + SQL WHERE +CLause Tips

\ No newline at end of file