ANALYZE Command

This guide gives details of how to use the ANALYZE command to help the optimizer.

Introduction

The ANALYZE command is used to provide the query optimizer with statistical information about the distribution of values in a database so that it can make better decisions about which indexes to use.

Intended audience:

This document is intended to be used by Symbian platfomr licensees and third party application developers.

Use the ANALYZE Command to Help the Optimizer

A big part of the job of the SQL compiler is to choose which of many implementation options should be used for a particular SQL statement.

Often a table will have two or more indexes and the optimizer will need to choose which one of those indexes to use. Or the optimizer will have to decide between using an index to implement the WHERE clause or the ORDER BY clause. For a join, the optimizer has to pick an appropriate order for the table in the join. And so forth.

By default, the only information the optimizer has to go on when trying to choose between two or more indexes is the database schema. In many cases the schema alone is not sufficient to make a wise choice. Suppose, for example, you have a table with two indexes:


CREATE TABLE demo318(a,b,c,data);
CREATE INDEX idx318a ON demo381(a);
CREATE INDEX idx318b ON demo381(b);

And further suppose SQLite is trying to compile a statement such as the following:


SELECT data FROM demo318 WHERE a=0 AND b=11;

The optimizer has to decide whether it should use idx318a with the “a=5” term of the WHERE clause or if it should use idx318b with the “b=11” term.

Without additional information, the two approaches are equivalent and so the choice is arbitrary. But the choice of indexes might have a big impact on performance.

Suppose that the demo318 table contains one million rows of which 55% have a=0, 40% have a=1, and the rest have a=2 or a=3. If the idx381a index is chosen, it will only narrow down the set of candidate rows by half. In this situation you are unlikely to notice any performance gain at all. In fact, in this extreme case, using the index is likely to make the query slower!

On the other hand, let us assume that there are never more than 2 or 3 rows with the same value for column b. In this case using index idx381b allows us to reduce the set of candidate rows from one million to 2 or 3 which will yield a huge speed improvement.

So clearly, in this example we hope that the optimizer chooses index idx318b. But without knowledge of the table contents and the statistical distributions of values for the “a” and “b” columns, the optimizer has no way of knowing which of the idx318a and idx318b indexes will work better.

The ANALYZE command is used to provide the query optimizer with statistical information about the distribution of values in a database so that it can make better decisions about which indexes to use.

When you run the ANALYZE command, SQLite creates a special table named SQLITE_STAT1. It then reads the entire database, gathers statistics and puts those statistics in the SQLITE_STAT1 table so that the optimizer can find them. ANALYZE is a relatively slow operation, since it has to read every single row of every single table in your database.

The statistical information generated by ANALYZE is not kept up-to-date by subsequent changes to the database. But many databases have a relatively constant statistical profile so running ANALYZE once and using the same statistical profile forever thereafter is often sufficient.

For embedded applications, you can often get by without ever having to run ANALYZE on a real database on the embedded device. Instead, generate a template database that contains typical data for your application and then run ANALYZE on that database on a developer workstation. Then just copy the content of the resulting SQLITE_STAT1 table over to the embedded device.

The SQLITE_STAT1 table is special in that you cannot CREATE or DROP it, but you can still SELECT, DELETE, and UPDATE the SQLITE_STAT1 table. So to create the SQLITE_STAT1 table on the embedded device, just run ANALYZE when the database is empty. That will take virtually no time and will leave you with an empty SQLITE_STAT1 table. Then copy the content of the SQLITE_STAT1 table that was created on the developer workstation into the database on the embedded device and the optimizer will begin using it just as if it had been created locally.

There is one entry in the SQLITE_STAT1 table for each index in your schema. Each entry contains the name of the table being indexed, the name of the index itself, and a string composed of two or more integers separated by spaces. These integers are the statistics that the ANALYZE command collects for each index.

The first integer is the total number of entries in the table. The second integer is the average number of rows in a result from a query that had an equality constraint on the first term of the index. The third integer (if there is one) is the average number of rows that would result if the index were used with equality constraints on the first two terms of the index. And so forth for as many terms as there are in the index.

The smaller the second and subsequent integers are, the more selective an index is. And the more selective the index is, the fewer rows have to be processed when using that index. So the optimizer tries to use indexes with small second and subsequent numbers.

The details are complex and do not really add any understanding to what is going on. The key point is that the second and subsequent integers in the SQLITE_STAT1 table are a measure of the selectivity of an index and that smaller numbers are better.

The statistics gathered by the ANALYZE command reflect the state of the database at a single point in time. While making the final performance tweaks on an application, some developers may want to adjust these statistics manually to better reflect the kinds of data they expect to encounter in the field. This is easy to do. The SQLITE_STAT1 table is readable and writable using ordinary SQL statements. You are free to adjust or modify the statistics in anyway that seems to performance benefit.