SQL Index Tips

This document includes several tips for using SQL indexes.

Introduction

You can use indexes to speed up access. You create indexes automatically using PRIMARY KEY and UNIQUE.

Intended audience:

This document is intended to be used by Symbian OS licensees and third party application developers.

Use an Index to Speed up Access

Suppose you have a table like this:

+CREATE TABLE demo5( + id INTEGER, + content BLOB +); +

Further suppose that this table contains thousands or millions of rows and you want to access a single row with a particular ID:

+SELECT content FROM demo5 WHERE id=? +

The only want that SQLite can perform this query, and be certain to get every row with the chosen ID, is to examine every single row, check the ID of that row, and return the content if the ID matches. Examining every single row this way is called a full table scan.

Reading and checking every row of a large table can be very slow, so you want to avoid full table scans. The usual way to do this is to create an index on the column you are searching against. In the example above, an appropriate index would be this:

+CREATE INDEX demo5_idx1 ON demo5(id); +

With an index on the ID column, SQLite is able to use a binary search to locate entries that contain a particular value of ID. So if the table contains a million rows, the query can be satisfied with about 20 accesses rather than 1000000 accesses. This is a huge performance improvement.

One of the features of the SQL language is that you do not have to figure out what indexes you may need in advance of coding your application. It is perfectly acceptable, even preferable, to write the code for your application using a database without any indexes. Then once the application is running and you can make speed measurements, add whatever indexes are needed in order to make it run faster.

When you add indexes, the query optimizer within the SQL compiler is able to find new more efficient bytecode procedures for carrying out the operations that your SQL statements specify. In other words, by adding indexes late in the development cycle you have the power to completely reorganize your data access patterns without changing a single line of code.

Create Indexes Automatically Using PRIMARY KEY and UNIQUE

Any column of a table that is declared to be the PRIMARY KEY or that is declared UNIQUE will be indexed automatically. There is no need to create a separate index on that column using the CREATE INDEX statement. So, for example, this table declaration:

+CREATE TABLE demo39a( + id INTEGER, + content BLOB +); + +CREATE INDEX demo39_idx1 ON demo39a(id); +

Is roughly equivalent to the following:

+CREATE TABLE demo39b( + id INTEGER UNIQUE, + content BLOB +); +

The two examples above are “roughly” equivalent, but not exactly equivalent. Both tables have an index on the ID column. In the first case, the index is created explicitly. In the second case, the index is implied by the UNIQUE keyword in the type declaration of the ID column. Both table designs use exactly the same amount of disk space, and both will run queries such as

+SELECT content FROM demo39 WHERE id=? +

using exactly the same bytecode. The only difference is that table demo39a lets you insert multiple rows with the same ID whereas table demo39b will raise an exception if you try to insert a new row with the same ID as an existing row.

If you use the UNIQUE keyword in the CREATE INDEX statement of demo39a, like this:

+CREATE UNIQUE INDEX demo39_idx1 ON demo39a(id); +

Then both table designs really would be exactly the same in every way. In fact, whenever SQLite sees the UNIQUE keyword on a column type declaration, all it does is create an automatic unique index on that column.

The PRIMARY KEY modifier on a column type declaration works like UNIQUE; it causes a unique index to be created automatically. The main difference is that you are only allowed to have a single PRIMARY KEY. This restriction of only allowing a single PRIMARY KEY is part of the official SQL language definition.

The idea is that a PRIMARY KEY is used to order the rows on disk. Some SQL database engines actually implement PRIMARY KEYs this way. But with SQLite, a PRIMARY KEY is like any other UNIQUE column, with only one exception: INTEGER PRIMARY KEY is a special case which is handled differently, as described in the next section.

Use Multi-Column Indexes

SQLite is able to make use of multi-column indexes. The rule is that if an index is over columns X 0 , X 1 , X 2 , ..., X n of some table, then the index can be used if the WHERE clause contains equality constraints for some prefix of those columns X 0 , X 1 , X 2 , ..., X i where i is less than n.

As an example, suppose you have a table and index declared as follows:

+CREATE TABLE demo314(a,b,c,d,e,f,g); +CREATE INDEX demo314_idx ON demo314(a,b,c,d,e,f); +

Then the index might be used to help with a query that contained a WHERE clause like this:

+... WHERE a=1 AND b='Smith' AND c=1 +

All three terms of the WHERE clause would be used together with the index in order to narrow the search. But the index could not be used if there WHERE clause said:

+... WHERE b='Smith' AND c=1 +

The second WHERE clause does not contain equality terms for a prefix of the columns in the index because it omits a term for the “a” column.

In a case like this:

+... WHERE a=1 AND c=1 +

Only the “a=1” term in the WHERE clause could be used to help narrow the search. The “c=1” term is not part of the prefix of terms in the index which have equality constraints because there is no equality constraint on the “b” column.

SQLite only allows a single index to be used per table within a simple SQL statement. For UPDATE and DELETE statements, this means that only a single index can ever be used, since those statements can only operate on a single table at a time.

In a simple SELECT statement multiple indexes can be used if the SELECT statement is a join – one index per table in the join. In a compound SELECT statement (two or more SELECT statements connected by UNION or INTERSECT or EXCEPT) each SELECT statement is treated separately and can have its own indexes. Likewise, SELECT statements that appear in subexpressions are treated separately.

Some other SQL database engines (for example PostgreSQL) allow multiple indexes to be used for each table in a SELECT. For example, if you had a table and index in PostgreSQL like this:

+CREATE TABLE pg1(a INT, b INT, c INT, d INT); +CREATE INDEX pg1_ix1 ON pg1(a); +CREATE INDEX pg1_ix2 ON pg1(b); +CREATE INDEX pg1_ix3 ON pg1(c); +

And if you were to run a query like the following:

+SELECT d FROM pg1 WHERE a=5 AND b=11 AND c=99; +

Then PostgreSQL might attempt to optimize the query by using all three indexes, one for each term of the WHERE clause.

SQLite does not work this way. SQLite is compelled to select a single index to use in the query. It might select any of the three indexes shown, depending on which one the optimizer things will give the best speedup. But in every case it will only select a single index and only a single term of the WHERE clause will be used.

SQLite prefers to use a multi-column index such as this:

+CREATE INDEX pg1_ix_all ON pg1(a,b,c); +

If the pg1_ix_all index is available for use when the SELECT statement above is prepared, SQLite will likely choose it over any of the single-column indexes because the multi-column index is able to make use of all 3 terms of the WHERE clause.

You can trick SQLite into using multiple indexes on the same table by rewriting the query. Instead of the SELECT statement shown above, if you rewrite it as this:

+SELECT d FROM pg1 WHERE RowID IN ( + SELECT RowID FROM pg1 WHERE a=5 + INTERSECT + SELECT RowID FROM pg1 WHERE b=11 + INTERSECT + SELECT RowID FROM pg1 WHERE c=99 +) +

Then each of the individual SELECT statements will using a different single-column index and their results will be combined by the outer SELECT statement to give the correct result. The other SQL database engines like PostgreSQL that are able to make use of multiple indexes per table do so by treating the simpler SELECT statement shown first as if they where the more complicated SELECT statement shown here.

Use Inequality Constraints on the Last Index Term

Terms in the WHERE clause of a query or UPDATE or DELETE statement are mostly likely to trigger the use of an index if they are an equality constraint – in other words if the term consists of the name of an indexed column, an equal sign (“=”), and an expression.

So, for example, if you have a table and index that look like this:

+CREATE TABLE demo315(a,b,c,d); +CREATE INDEX demo315_idx1 ON demo315(a,b,c); +

And a query like this:

+SELECT d FROM demo315 WHERE a=512; +

The single “a=512” term of the WHERE clause qualifies as an equality constraint and is likely to provoke the use of the demo315_idx1 index.

SQLite supports two other kinds of equality constraints. One is the IN operator:

+SELECT d FROM demo315 WHERE a IN (512,1024); +SELECT d FROM demo315 WHERE a IN (SELECT x FROM someothertable); +

There other is the IS NULL constraint:

+SELECT d FROM demo315 WHERE a IS NULL; +

SQLite allows at most one term of an index to be constrained by an inequalty such as less than “<”, greater than “>”, less than or equal to “<=”, or greater than or equal to “>=”.

The column that the inequality constrains will be the right-most term of the index that is used. So, for example, in this query:

+SELECT d FROM demo315 WHERE a=5 AND b>11 AND c=1; +

Only the first two terms of the WHERE clause will be used with the demo315_idx1 index. The third term, the “c=1” constraint, cannot be used because the “c” column occurs to the right of the “b” column in the index and the “b” column is constrained by an inequality.

SQLite allows up to two inequalities on the same column as long as the two inequalities provide an upper and lower bound on the column. For example, in this query:

+SELECT d FROM demo315 WHERE a=5 AND b>11 AND b<23; +

All three terms of the WHERE clause will be used because the two inequalities on the “b” column provide an upper and lower bound on the value of “b”.

SQLite will only use the four inequalities mentioned above to help constrain a search: “<”, “>”, “<=”, and “>=”. Other inequality operators such as not equal to (“!=” or “<>”) and NOT NULL are not helpful to the query optimizer and will never be used to control an index and help make the query run faster.

Use Indexes To Help ORDER BY Clauses Evaluate Faster

The default method for evaluating an ORDER BY clause in a SELECT statement is to first evaluate the SELECT statement and store the results in a temporary tables, then sort the temporary table according to the ORDER BY clause and scan the sorted temporary table to generate the final output.

This method always works, but it requires three passes over the data (one pass to generate the result set, a second pass to sort the result set, and a third pass to output the results) and it requires a temporary storage space sufficiently large to contain the entire results set.

Where possible, SQLite will avoid storing and sorting the result set by using an index that causes the results to emerge from the query in sorted order in the first place.

The way to get SQLite to use an index for sorting is to provide an index that covers the same columns specified in the ORDER BY clause. For example, if the table and index are like this:

+CREATE TABLE demo316(a,b,c,data); +CREATE INDEX idx316 ON demo316(a,b,c); +

And you do a query like this:

+SELECT data FROM demo316 ORDER BY a,b,c; +

SQLite will use the idx316 index to implement the ORDER BY clause, obviating the need for temporary storage space and a separate sorting pass.

An index can be used to satisfy the search constraints of a WHERE clause and to impose the ORDER BY ordering of outputs all at once. The trick is for the ORDER BY clause terms to occur immediately after the WHERE clause terms in the index. For example, one can write:

+SELECT data FROM demo316 WHERE a=5 ORDER BY b,c; +

The “a” column is used in the WHERE clause and the immediately following terms of the index, “b” and “c” are used in the ORDER BY clause. So in this case the idx316 index would be used both to speed up the search and to satisfy the ORDER BY clause.

This query also uses the idx316 index because, once again, the ORDER BY clause term “c” immediate follows the WHERE clause terms “a” and “b” in the index:

+SELECT data FROM demo316 WHERE a=5 AND b=17 ORDER BY c; +

But now consider this:

+SELECT data FROM demo316 WHERE a=5 ORDER BY c; +

Here there is a gap between the ORDER BY term “c” and the WHERE clause term “a”. So the idx316 index cannot be used to satisfy both the WHERE clause and the ORDER BY clause. The index will be used on the WHERE clause and a separate sorting pass will occur to put the results in the correct order.

Add Result Columns To The End Of Indexes

Queries will sometimes run faster if their result columns appear in the right-most entries of an index. Consider the following example:

+CREATE TABLE demo317(a,b,c,data); +CREATE INDEX idx317 ON demo316(a,b,c); +

A query where all result column terms appears in the index, such as

+SELECT c FROM demo317 WHERE a=5 ORDER BY b; +

will typically run about twice as fast or faster than a query that uses columns that are not in the index, e.g.

+SELECT data FROM demo317 WHERE a=5 ORDER BY b; +

The reason for this is that when all information is contained within the index entry only a single search has to be made for each row of output. But when some of the information is in the index and other parts are in the table, first there must be a search for the appropriate index entry then a separate search is made for the appropriate table row based on the RowID found in the index entry. Twice as much searching has to be done for each row of output generated.

The extra query speed does not come for free, however. Adding additional columns to an index makes the database file larger. So when developing an application, the programmer will need to make a space versus time trade-off to determine whether the extra columns should be added to the index or not.

Note that if any column of the result must be obtained from the original table, then the table row will have to be searched for anyhow. There will be no speed advantage, so you might as well omit the extra columns from the end of the index and save on storage space. The speed-up described in this section can only be realized when every column in a table is obtainable from the index.

Taking into account the results of the previous few sections, the best set of columns to put in an index can be described as follows:

The first columns in the index should be columns that have equality constraints in the WHERE clause of the query.
The second group of columns should match the columns specified in the ORDER BY clause.
Add additional columns to the end of the index that are used in the result set of the query.

Resolve Indexing Ambiguities Using the Unary “+” Operator

The SQLite query optimizer usually does a good job of choosing the best index to use for a particular query, especially if ANALYZE has been run to provide it with index performance statistics. But occasions do arise where it is useful to give the optimizer hints.

One of the easiest ways to control the operation of the optimizer is to disqualify terms in the WHERE clause or ORDER BY clause as candidates for optimization by using the unary “+” operator.

In SQLite, a unary “+” operator is a no-op. It makes no change to its operand, even if the operand is something other than a number. So you can always prefix a “+” to an expression in without changing the meaning of the expression. As the optimizer will only use terms in WHERE, HAVING, or ON clauses that have an index column name on one side of a comparison operator, you can prevent such a term from being used by the optimizer by prefixing the column name with a “+”.

For example, suppose you have a database with a schema like this:

+CREATE TABLE demo321(a,b,c,data); +CREATE INDEX idx321a ON demo321(a); +CREATE INDEX idx321b ON demo321(b); +

If you issue a query such as this:

+SELECT data FROM demo321 WHERE a=5 AND b=11; +

The query optimizer might use the “a=5” term with idx321a or it might use the “b=11” term with the idx321b index. But if you want to force the use of the idx321a index you can accomplish that by disqualifying the second term of the WHERE clause as a candidate for optimization using a unary “+” like this:

+SELECT data FROM demo321 WHERE a=5 AND +b=11; +

The “+” in front of the “b=11” turns the left-hand side of the equals comparison operator into an expression instead of an indexed column name. The optimizer will then not recognize that the second term can be used with an index and so the optimizer is compelled to use the first “a=5” term.

The unary “+” operator can also be used to disable ORDER BY clause optimizations. Consider this query:

+SELECT data FROM demo321 WHERE a=5 ORDER BY b; +

The optimizer has the choice of using the “a=5” term of the WHERE clause with idx321a to restrict the search. Or it might choose to use do a full table scan with idx321b to satisfy the ORDER BY clause and thus avoid a separate sorting pass. You can force one choice or the other using a unary “+”.

To force the use of idx321a on the WHERE clause, add the unary “+” in from of the “b” in the ORDER BY clause:

+SELECT data FROM demo321 WHERE a=5 ORDER BY +b; +

To go the other way and force the idx321b index to be used to satisfy the ORDER BY clause, disqualify the WHERE term by prefixing with a unary “+”:

+SELECT data FROM demo321 WHERE +a=5 ORDER BY b; +

The reader is cautioned not to overuse the unary “+” operator. The SQLite query optimizer usually picks the best index without any outside help. Premature use of unary “+” can confuse the optimizer and cause less than optimal performance. But in some cases it is useful to be able override the decisions of the optimizer, and the unary “+” operator is an excellent way to do this when it becomes necessary.

Avoid Indexing Large BLOBs and CLOBs

SQLite stores indexes as b-trees. Each b-tree node uses one page of the database file. In order to maintain an acceptable fan-out, the b-tree module within SQLite requires that at least 4 entries must fit on each page of a b-tree. There is also some overhead associated with each b-tree page. So at the most there is about 250 bytes of space available on the main b-tree page for each index entry.

If an index entry exceeds this allotment of approximately 250 bytes excess bytes are spilled to overflow pages. There is no arbitrary limit on the number of overflow pages or on the length of a b-tree entry, but for maximum efficiency it is best to avoid overflow pages, especially in indexes. This means that you should strive to keep the number of bytes in each index entry below 250.

If you keep the size of indexes significantly smaller than 250 bytes, then the b-tree fan-out is increased and the binary search algorithm used to search for entries in an index has fewer pages to examine and therefore runs faster. So the fewer bytes used in each index entry the better, at least from a performance perspective.

For these reasons, it is recommended that you avoid indexing large BLOBs and CLOBs. SQLite will continue to work when large BLOBs and CLOBs are indexed, but there will be a performance impact.

On the other hand, if you need to lookup entries using a large BLOB or CLOB as the key, then by all means use an index. An index on a large BLOB or CLOB is not as fast as an index using more compact data types such as integers, but it is still many order of magnitude faster than doing a full table scan. So to be more precise, the advice of this section is that you should design your applications so that you do not need to lookup entries using a large BLOB or CLOB as the key. Try to arrange to have compact keys consisting of short strings or integers.

Note that many other SQL database engines disallow the indexing of BLOBs and CLOBs in the first place. You simple cannot do it. SQLite is more flexible that most in that it does allow BLOBs and CLOBs to be indexed and it will use those indexes when appropriate. But for maximum performance, it is best to use smaller search keys.

Avoid Excess Indexes

Some developers approach SQL-based application development with the attitude that indexes never hurt and that the more indexes you have, the faster your application will run. This is definitely not the case. There is a costs associated with each new index you create:

Each new index takes up additional space in the database file. The more indexes you have, the larger your database files will become for the same amount of data.
Every INSERT and UPDATE statement modifies both the original table and all indexes on that table. So the performance of INSERT and UPDATE decreases linearly with the number of indexes.
Compiling new SQL statements using Prepare() takes longer when there are more indexes for the optimizer to choose between.
Surplus indexes give the optimizer more opportunities to make a bad choice.

Your policy on indexes should be to avoid them wherever you can. Indexes are powerful medicine and can work wonders to improve the performance of a program. But just as too many drugs can be worse than none at all, so also can too many indexes cause more harm than good.

When building a new application, a good approach is to omit all explicitly declared indexes in the beginning and only add indexes as needed to address specific performance problems.

Take care to avoid redundant indexes. For example, consider this schema:

+CREATE TABLE demo323a(a,b,c); +CREATE INDEX idx323a1 ON demo323(a); +CREATE INDEX idx323a2 ON demo323(a,b); +

The idx323a1 index is redundant and can be eliminated. Anything that the idx323a1 index can do the idx323a2 index can do better.

Other redundancies are not quite as apparent as the above. Recall that any column or columns that are declared UNIQUE or PRIMARY KEY (except for the special case of INTEGER PRIMARY KEY) are automatically indexed. So in the following schema:

+CREATE TABLE demo323b(x TEXT PRIMARY KEY, y INTEGER UNIQUE); +CREATE INDEX idx323b1 ON demo323b(x); +CREATE INDEX idx323b2 ON demo323b(y); +

Both indexes are redundant and can be eliminated with no loss in query performance. Occasionally one sees a novice SQL programmer use both UNIQUE and PRIMARY KEY on the same column:

+CREATE TABLE demo323c(p TEXT UNIQUE PRIMARY KEY, q); +

This has the effect of creating two indexes on the “p” column – one for the UNIQUE keywords and another for the PRIMARY KEY keyword. Both indexes are identical so clearly one can be omitted. A PRIMARY KEY is guaranteed to always be unique so the UNIQUE keyword can be removed from the demo323c table definition with no ambiguity or loss of functionality.

It is not a fatal error to create too many indexes or redundant indexes. SQLite will continue to generate the correct answers but it may take longer to produce those answers and the resulting database files might be a little larger. So for best results, keep the number of indexes to a minimum.

Avoid Tables and Indexes with an Excessive Number of Columns

SQLite places no arbitrary limits on the number of columns in a table or index. There are known commercial applications using SQLite that construct tables with tens of thousands of columns each. And these applications actually work.

However the database engine is optimized for the common case of tables with no more than a few dozen columns. For best performance you should try to stay in the optimized region. Furthermore, we note that relational databases with a large number of columns are usually not well normalized. So even apart from performance considerations, if you find your design has tables with more than a dozen or so columns, you really need to rethink how you are building your application.

There are a number of places in Prepare() that run in time O(N²) where N is the number of columns in the table. The constant of proportionality is small in these cases so you should not have any problems for N of less than one hundred but for N on the order of a thousand, the time to run Prepare() can start to become noticeable.

When the bytecode is running and it needs to access the i-th column of a table, the values of the previous i-1 columns must be accessed first. So if you have a large number of columns, accessing the last column can be an expensive operation. This fact also argues for putting smaller and more frequently accessed columns early in the table.

There are certain optimizations that will only work if the table has 30 or fewer columns. The optimization that extracts all necessary information from an index and never refers to the underlying table works this way. So in some cases, keeping the number of columns in a table at or below 30 can result in a 2-fold speed improvement.

Indexes will only be used if they contain 30 or fewer columns. You can put as many columns in an index as you want, but if the number is greater than 30, the index will never improve performance and will never do anything but take up space in your database file.