SQL Index +Tips

Introduction

You can use indexes to speed up access. +You create indexes automatically using PRIMARY KEY and UNIQUE.

Intended +audience:

This document is intended to be used by Symbian platform +licensees and third party application developers.

Use an Index +to Speed up Access

Suppose you have a table like this:

+CREATE TABLE demo5( + id INTEGER, + content BLOB +); +

Further suppose that this table contains thousands or millions +of rows and you want to access a single row with a particular ID:

+SELECT content FROM demo5 WHERE id=? +

The only want that SQLite can perform this query, and be certain +to get every row with the chosen ID, is to examine every single row, check +the ID of that row, and return the content if the ID matches. Examining every +single row this way is called a full table scan.

Reading and +checking every row of a large table can be very slow, so you want to avoid +full table scans. The usual way to do this is to create an index on the column +you are searching against. In the example above, an appropriate index would +be this:

+CREATE INDEX demo5_idx1 ON demo5(id); +

With an index on the ID column, SQLite is able to use a binary +search to locate entries that contain a particular value of ID. So if the +table contains a million rows, the query can be satisfied with about 20 accesses +rather than 1000000 accesses. This is a huge performance improvement.

One +of the features of the SQL language is that you do not have to figure out +what indexes you may need in advance of coding your application. It is perfectly +acceptable, even preferable, to write the code for your application using +a database without any indexes. Then once the application is running and you +can make speed measurements, add whatever indexes are needed in order to make +it run faster.

When you add indexes, the query optimizer within the +SQL compiler is able to find new more efficient bytecode procedures for carrying +out the operations that your SQL statements specify. In other words, by adding +indexes late in the development cycle you have the power to completely reorganize +your data access patterns without changing a single line of code.

Create Indexes +Automatically Using PRIMARY KEY and UNIQUE

Any column of a table +that is declared to be the PRIMARY KEY or that is declared UNIQUE will be +indexed automatically. There is no need to create a separate index on that +column using the CREATE INDEX statement. So, for example, this table declaration:

+CREATE TABLE demo39a( + id INTEGER, + content BLOB +); + +CREATE INDEX demo39_idx1 ON demo39a(id); +

Is roughly equivalent to the following:

+CREATE TABLE demo39b( + id INTEGER UNIQUE, + content BLOB +); +

The two examples above are “roughly” equivalent, but not exactly +equivalent. Both tables have an index on the ID column. In the first case, +the index is created explicitly. In the second case, the index is implied +by the UNIQUE keyword in the type declaration of the ID column. Both table +designs use exactly the same amount of disk space, and both will run queries +such as

+SELECT content FROM demo39 WHERE id=? +

using exactly the same bytecode. The only difference is that +table demo39a lets you insert multiple rows with the same ID whereas table +demo39b will raise an exception if you try to insert a new row with the same +ID as an existing row.

If you use the UNIQUE keyword in the CREATE +INDEX statement of demo39a, like this:

+CREATE UNIQUE INDEX demo39_idx1 ON demo39a(id); +

Then both table designs really would be exactly the same in +every way. In fact, whenever SQLite sees the UNIQUE keyword on a column type +declaration, all it does is create an automatic unique index on that column.

The +PRIMARY KEY modifier on a column type declaration works like UNIQUE; it causes +a unique index to be created automatically. The main difference is that you +are only allowed to have a single PRIMARY KEY. This restriction of only allowing +a single PRIMARY KEY is part of the official SQL language definition.

The +idea is that a PRIMARY KEY is used to order the rows on disk. Some SQL database +engines actually implement PRIMARY KEYs this way. But with SQLite, a PRIMARY +KEY is like any other UNIQUE column, with only one exception: INTEGER PRIMARY +KEY is a special case which is handled differently, as described in the next +section.

Use Multi-Column +Indexes

SQLite is able to make use of multi-column indexes. The +rule is that if an index is over columns X 0 , X 1 , X 2 , +..., X n of some table, then the index can be used if the +WHERE clause contains equality constraints for some prefix of those columns X 0 , X 1 , X 2 , +..., X i where i is less than n.

As +an example, suppose you have a table and index declared as follows:

+CREATE TABLE demo314(a,b,c,d,e,f,g); +CREATE INDEX demo314_idx ON demo314(a,b,c,d,e,f); +

Then the index might be used to help with a query that contained +a WHERE clause like this:

+... WHERE a=1 AND b='Smith' AND c=1 +

All three terms of the WHERE clause would be used together +with the index in order to narrow the search. But the index could not be used +if there WHERE clause said:

+... WHERE b='Smith' AND c=1 +

The second WHERE clause does not contain equality terms for +a prefix of the columns in the index because it omits a term for the “a” column.

In +a case like this:

+... WHERE a=1 AND c=1 +

Only the “a=1” term in the WHERE clause could be used to help +narrow the search. The “c=1” term is not part of the prefix of terms in the +index which have equality constraints because there is no equality constraint +on the “b” column.

SQLite only allows a single index to be used per +table within a simple SQL statement. For UPDATE and DELETE statements, this +means that only a single index can ever be used, since those statements can +only operate on a single table at a time.

In a simple SELECT statement +multiple indexes can be used if the SELECT statement is a join – one index +per table in the join. In a compound SELECT statement (two or more SELECT +statements connected by UNION or INTERSECT or EXCEPT) each SELECT statement +is treated separately and can have its own indexes. Likewise, SELECT statements +that appear in subexpressions are treated separately.

Some other SQL +database engines (for example PostgreSQL) allow multiple indexes to be used +for each table in a SELECT. For example, if you had a table and index in PostgreSQL +like this:

+CREATE TABLE pg1(a INT, b INT, c INT, d INT); +CREATE INDEX pg1_ix1 ON pg1(a); +CREATE INDEX pg1_ix2 ON pg1(b); +CREATE INDEX pg1_ix3 ON pg1(c); +

And if you were to run a query like the following:

+SELECT d FROM pg1 WHERE a=5 AND b=11 AND c=99; +

Then PostgreSQL might attempt to optimize the query by using +all three indexes, one for each term of the WHERE clause.

SQLite does +not work this way. SQLite is compelled to select a single index to use in +the query. It might select any of the three indexes shown, depending on which +one the optimizer things will give the best speedup. But in every case it +will only select a single index and only a single term of the WHERE clause +will be used.

SQLite prefers to use a multi-column index such as this:

+CREATE INDEX pg1_ix_all ON pg1(a,b,c); +

If the pg1_ix_all index is available for use when the SELECT +statement above is prepared, SQLite will likely choose it over any of the +single-column indexes because the multi-column index is able to make use of +all 3 terms of the WHERE clause.

You can trick SQLite into using multiple +indexes on the same table by rewriting the query. Instead of the SELECT statement +shown above, if you rewrite it as this:

+SELECT d FROM pg1 WHERE RowID IN ( + SELECT RowID FROM pg1 WHERE a=5 + INTERSECT + SELECT RowID FROM pg1 WHERE b=11 + INTERSECT + SELECT RowID FROM pg1 WHERE c=99 +) +

Then each of the individual SELECT statements will using a +different single-column index and their results will be combined by the outer +SELECT statement to give the correct result. The other SQL database engines +like PostgreSQL that are able to make use of multiple indexes per table do +so by treating the simpler SELECT statement shown first as if they where the +more complicated SELECT statement shown here.

Use Inequality +Constraints on the Last Index Term

Terms in the WHERE clause of +a query or UPDATE or DELETE statement are mostly likely to trigger the use +of an index if they are an equality constraint – in other words if the term +consists of the name of an indexed column, an equal sign (“=”), and an expression.

So, +for example, if you have a table and index that look like this:

+CREATE TABLE demo315(a,b,c,d); +CREATE INDEX demo315_idx1 ON demo315(a,b,c); +

And a query like this:

+SELECT d FROM demo315 WHERE a=512; +

The single “a=512” term of the WHERE clause qualifies as an +equality constraint and is likely to provoke the use of the demo315_idx1 index.

SQLite +supports two other kinds of equality constraints. One is the IN operator:

+SELECT d FROM demo315 WHERE a IN (512,1024); +SELECT d FROM demo315 WHERE a IN (SELECT x FROM someothertable); +

There other is the IS NULL constraint:

+SELECT d FROM demo315 WHERE a IS NULL; +

SQLite allows at most one term of an index to be constrained +by an inequality such as less than “<”, greater than “>”, less than or +equal to “<=”, or greater than or equal to “>=”.

The column that +the inequality constrains will be the right-most term of the index that is +used. So, for example, in this query:

+SELECT d FROM demo315 WHERE a=5 AND b>11 AND c=1; +

Only the first two terms of the WHERE clause will be used +with the demo315_idx1 index. The third term, the “c=1” constraint, cannot +be used because the “c” column occurs to the right of the “b” column in the +index and the “b” column is constrained by an inequality.

SQLite allows +up to two inequalities on the same column as long as the two inequalities +provide an upper and lower bound on the column. For example, in this query:

+SELECT d FROM demo315 WHERE a=5 AND b>11 AND b<23; +

All three terms of the WHERE clause will be used because the +two inequalities on the “b” column provide an upper and lower bound on the +value of “b”.

SQLite will only use the four inequalities mentioned +above to help constrain a search: “<”, “>”, “<=”, and “>=”. Other inequality +operators such as not equal to (“!=” or “<>”) and NOT NULL are not helpful +to the query optimizer and will never be used to control an index and help +make the query run faster.

Use Indexes +To Help ORDER BY Clauses Evaluate Faster

The default method for +evaluating an ORDER BY clause in a SELECT statement is to first evaluate the +SELECT statement and store the results in a temporary tables, then sort the +temporary table according to the ORDER BY clause and scan the sorted temporary +table to generate the final output.

This method always works, but +it requires three passes over the data (one pass to generate the result set, +a second pass to sort the result set, and a third pass to output the results) +and it requires a temporary storage space sufficiently large to contain the +entire results set.

Where possible, SQLite will avoid storing and +sorting the result set by using an index that causes the results to emerge +from the query in sorted order in the first place.

The way to get +SQLite to use an index for sorting is to provide an index that covers the +same columns specified in the ORDER BY clause. For example, if the table and +index are like this:

+CREATE TABLE demo316(a,b,c,data); +CREATE INDEX idx316 ON demo316(a,b,c); +

And you do a query like this:

+SELECT data FROM demo316 ORDER BY a,b,c; +

SQLite will use the idx316 index to implement the ORDER BY +clause, obviating the need for temporary storage space and a separate sorting +pass.

An index can be used to satisfy the search constraints of a +WHERE clause and to impose the ORDER BY ordering of outputs all at once. The +trick is for the ORDER BY clause terms to occur immediately after the WHERE +clause terms in the index. For example, one can write:

+SELECT data FROM demo316 WHERE a=5 ORDER BY b,c; +

The “a” column is used in the WHERE clause and the immediately +following terms of the index, “b” and “c” are used in the ORDER BY clause. +So in this case the idx316 index would be used both to speed up the search +and to satisfy the ORDER BY clause.

This query also uses the idx316 +index because, once again, the ORDER BY clause term “c” immediate follows +the WHERE clause terms “a” and “b” in the index:

+SELECT data FROM demo316 WHERE a=5 AND b=17 ORDER BY c; +

But now consider this:

+SELECT data FROM demo316 WHERE a=5 ORDER BY c; +

Here there is a gap between the ORDER BY term “c” and the +WHERE clause term “a”. So the idx316 index cannot be used to satisfy both +the WHERE clause and the ORDER BY clause. The index will be used on the WHERE +clause and a separate sorting pass will occur to put the results in the correct +order.

Add Result +Columns To The End Of Indexes

Queries will sometimes run faster +if their result columns appear in the right-most entries of an index. Consider +the following example:

+CREATE TABLE demo317(a,b,c,data); +CREATE INDEX idx317 ON demo316(a,b,c); +

A query where all result column terms appears in the index, +such as

+SELECT c FROM demo317 WHERE a=5 ORDER BY b; +

will typically run about twice as fast or faster than a query +that uses columns that are not in the index, e.g.

+SELECT data FROM demo317 WHERE a=5 ORDER BY b; +

The reason for this is that when all information is contained +within the index entry only a single search has to be made for each row of +output. But when some of the information is in the index and other parts are +in the table, first there must be a search for the appropriate index entry +then a separate search is made for the appropriate table row based on the +RowID found in the index entry. Twice as much searching has to be done for +each row of output generated.

The extra query speed does not come +for free, however. Adding additional columns to an index makes the database +file larger. So when developing an application, the programmer will need to +make a space versus time trade-off to determine whether the extra columns +should be added to the index or not.

Note that if any column of the +result must be obtained from the original table, then the table row will have +to be searched for anyhow. There will be no speed advantage, so you might +as well omit the extra columns from the end of the index and save on storage +space. The speed-up described in this section can only be realized when every +column in a table is obtainable from the index.

Taking into account +the results of the previous few sections, the best set of columns to put in +an index can be described as follows:

The first columns in +the index should be columns that have equality constraints in the WHERE clause +of the query.
The second group of +columns should match the columns specified in the ORDER BY clause.
Add additional columns +to the end of the index that are used in the result set of the query.

Resolve Indexing +Ambiguities Using the Unary “+” Operator

The SQLite query optimizer +usually does a good job of choosing the best index to use for a particular +query, especially if ANALYZE has been run to provide it with index performance +statistics. But occasions do arise where it is useful to give the optimizer +hints.

One of the easiest ways to control the operation of the optimizer +is to disqualify terms in the WHERE clause or ORDER BY clause as candidates +for optimization by using the unary “+” operator.

In SQLite, a unary +“+” operator is a no-op. It makes no change to its operand, even if the operand +is something other than a number. So you can always prefix a “+” to an expression +in without changing the meaning of the expression. As the optimizer will only +use terms in WHERE, HAVING, or ON clauses that have an index column name on +one side of a comparison operator, you can prevent such a term from being +used by the optimizer by prefixing the column name with a “+”.

For +example, suppose you have a database with a schema like this:

+CREATE TABLE demo321(a,b,c,data); +CREATE INDEX idx321a ON demo321(a); +CREATE INDEX idx321b ON demo321(b); +

If you issue a query such as this:

+SELECT data FROM demo321 WHERE a=5 AND b=11; +

The query optimizer might use the “a=5” term with idx321a +or it might use the “b=11” term with the idx321b index. But if you want to +force the use of the idx321a index you can accomplish that by disqualifying +the second term of the WHERE clause as a candidate for optimization using +a unary “+” like this:

+SELECT data FROM demo321 WHERE a=5 AND +b=11; +

The “+” in front of the “b=11” turns the left-hand side of +the equals comparison operator into an expression instead of an indexed column +name. The optimizer will then not recognize that the second term can be used +with an index and so the optimizer is compelled to use the first “a=5” term.

The +unary “+” operator can also be used to disable ORDER BY clause optimizations. +Consider this query:

+SELECT data FROM demo321 WHERE a=5 ORDER BY b; +

The optimizer has the choice of using the “a=5” term of the +WHERE clause with idx321a to restrict the search. Or it might choose to use +do a full table scan with idx321b to satisfy the ORDER BY clause and thus +avoid a separate sorting pass. You can force one choice or the other using +a unary “+”.

To force the use of idx321a on the WHERE clause, add +the unary “+” in from of the “b” in the ORDER BY clause:

+SELECT data FROM demo321 WHERE a=5 ORDER BY +b; +

To go the other way and force the idx321b index to be used +to satisfy the ORDER BY clause, disqualify the WHERE term by prefixing with +a unary “+”:

+SELECT data FROM demo321 WHERE +a=5 ORDER BY b; +

The reader is cautioned not to overuse the unary “+” operator. +The SQLite query optimizer usually picks the best index without any outside +help. Premature use of unary “+” can confuse the optimizer and cause less +than optimal performance. But in some cases it is useful to be able override +the decisions of the optimizer, and the unary “+” operator is an excellent +way to do this when it becomes necessary.

Avoid Indexing +Large BLOBs and CLOBs

SQLite stores indexes as b-trees. Each b-tree +node uses one page of the database file. In order to maintain an acceptable +fan-out, the b-tree module within SQLite requires that at least 4 entries +must fit on each page of a b-tree. There is also some overhead associated +with each b-tree page. So at the most there is about 250 bytes of space available +on the main b-tree page for each index entry.

If an index entry exceeds +this allotment of approximately 250 bytes excess bytes are spilled to overflow +pages. There is no arbitrary limit on the number of overflow pages or on the +length of a b-tree entry, but for maximum efficiency it is best to avoid overflow +pages, especially in indexes. This means that you should strive to keep the +number of bytes in each index entry below 250.

If you keep the size +of indexes significantly smaller than 250 bytes, then the b-tree fan-out is +increased and the binary search algorithm used to search for entries in an +index has fewer pages to examine and therefore runs faster. So the fewer bytes +used in each index entry the better, at least from a performance perspective.

For +these reasons, it is recommended that you avoid indexing large BLOBs and CLOBs. +SQLite will continue to work when large BLOBs and CLOBs are indexed, but there +will be a performance impact.

On the other hand, if you need to lookup +entries using a large BLOB or CLOB as the key, then by all means use an index. +An index on a large BLOB or CLOB is not as fast as an index using more compact +data types such as integers, but it is still many order of magnitude faster +than doing a full table scan. So to be more precise, the advice of this section +is that you should design your applications so that you do not need to lookup +entries using a large BLOB or CLOB as the key. Try to arrange to have compact +keys consisting of short strings or integers.

Note that many other +SQL database engines disallow the indexing of BLOBs and CLOBs in the first +place. You simple cannot do it. SQLite is more flexible that most in that +it does allow BLOBs and CLOBs to be indexed and it will use those indexes +when appropriate. But for maximum performance, it is best to use smaller search +keys.

Avoid Excess +Indexes

Some developers approach SQL-based application development +with the attitude that indexes never hurt and that the more indexes you have, +the faster your application will run. This is definitely not the case. There +is a costs associated with each new index you create:

Each new index takes +up additional space in the database file. The more indexes you have, the larger +your database files will become for the same amount of data.
Every INSERT and UPDATE +statement modifies both the original table and all indexes on that table. +So the performance of INSERT and UPDATE decreases linearly with the number +of indexes.
Compiling new SQL statements +using Prepare() takes longer when there are more indexes +for the optimizer to choose between.
Surplus indexes give +the optimizer more opportunities to make a bad choice.

Your policy on indexes should be to avoid them wherever you can. +Indexes are powerful medicine and can work wonders to improve the performance +of a program. But just as too many drugs can be worse than none at all, so +also can too many indexes cause more harm than good.

When building +a new application, a good approach is to omit all explicitly declared indexes +in the beginning and only add indexes as needed to address specific performance +problems.

Take care to avoid redundant indexes. For example, consider +this schema:

+CREATE TABLE demo323a(a,b,c); +CREATE INDEX idx323a1 ON demo323(a); +CREATE INDEX idx323a2 ON demo323(a,b); +

The idx323a1 index is redundant and can be eliminated. Anything +that the idx323a1 index can do the idx323a2 index can do better.

Other +redundancies are not quite as apparent as the above. Recall that any column +or columns that are declared UNIQUE or PRIMARY KEY (except for the special +case of INTEGER PRIMARY KEY) are automatically indexed. So in the following +schema:

+CREATE TABLE demo323b(x TEXT PRIMARY KEY, y INTEGER UNIQUE); +CREATE INDEX idx323b1 ON demo323b(x); +CREATE INDEX idx323b2 ON demo323b(y); +

Both indexes are redundant and can be eliminated with no loss +in query performance. Occasionally one sees a novice SQL programmer use both +UNIQUE and PRIMARY KEY on the same column:

+CREATE TABLE demo323c(p TEXT UNIQUE PRIMARY KEY, q); +

This has the effect of creating two indexes on the “p” column +– one for the UNIQUE keywords and another for the PRIMARY KEY keyword. Both +indexes are identical so clearly one can be omitted. A PRIMARY KEY is guaranteed +to always be unique so the UNIQUE keyword can be removed from the demo323c +table definition with no ambiguity or loss of functionality.

It is +not a fatal error to create too many indexes or redundant indexes. SQLite +will continue to generate the correct answers but it may take longer to produce +those answers and the resulting database files might be a little larger. So +for best results, keep the number of indexes to a minimum.

Avoid Tables +and Indexes with an Excessive Number of Columns

SQLite places no +arbitrary limits on the number of columns in a table or index. There are known +commercial applications using SQLite that construct tables with tens of thousands +of columns each. And these applications actually work.

However the +database engine is optimized for the common case of tables with no more than +a few dozen columns. For best performance you should try to stay in the optimized +region. Furthermore, we note that relational databases with a large number +of columns are usually not well normalized. So even apart from performance +considerations, if you find your design has tables with more than a dozen +or so columns, you really need to rethink how you are building your application.

There +are a number of places in Prepare() that run in time O(N²) +where N is the number of columns in the table. The constant of proportionality +is small in these cases so you should not have any problems for N of less +than one hundred but for N on the order of a thousand, the time to run Prepare() can +start to become noticeable.

When the bytecode is running and it needs +to access the i-th column of a table, the values of the previous i-1 columns +must be accessed first. So if you have a large number of columns, accessing +the last column can be an expensive operation. This fact also argues for putting +smaller and more frequently accessed columns early in the table.

There +are certain optimizations that will only work if the table has 30 or fewer +columns. The optimization that extracts all necessary information from an +index and never refers to the underlying table works this way. So in some +cases, keeping the number of columns in a table at or below 30 can result +in a 2-fold speed improvement.

Indexes will only be used if they contain +30 or fewer columns. You can put as many columns in an index as you want, +but if the number is greater than 30, the index will never improve performance +and will never do anything but take up space in your database file.