SQL Schema Tips

Introduction

A schema is a way of organizing a number of related database objects such as tables. The schema contains at least one object but can contain as many as the total number of tables in the database. This section provides some handy tips for working with schemas in Symbian SQL.

The following tips or techniques are discussed:

Handling SQL schema errors

In order to translate a SQL statement into bytecode, Prepare() needs to know the database schema. For any given SQL statement there are typically many different ways of implementing the statement in bytecode. The compiler (or a particular part of the compiler that is commonly referred to as the optimizer ) tries to pick the most efficient implementation based on what is known about the content of the various tables and what indexes are available. The bytecode depends so much on the database schema that the slightest change to the schema can cause totally different bytecode to be generated for the same SQL statement.

When Prepare() runs, it uses the database schema as of the last time the database was accessed when preparing the bytecode. So if the database schema has been changed by another program, or if the schema is changed after Prepare() completes but before Exec() or Next() is run, the schema of the database will not match the schema used to prepare the bytecode.

To guard against any problems, the first few instructions of bytecode in a prepared statement read the current schema cookie for the database and check to see that it is the same as it was when Prepare() was run. If the schema cookie has changed, then Exec() and Next() terminate with the error KSqlErrSchema .

When this occurs, a new prepared statement should be created by calling Prepare() again with the same SQL and the original prepared statement should be discarded.

It is important to recognize that a KSqlErrSchema error can occur on any call to Exec() or Next() . A common programming error with is to omit the necessary error handling or to do it incorrectly. The error is, unfortunately, often overlooked during testing because the usual cause of an KSqlErrSchema error is an external program making a schema change at just the wrong moment, and external programs are not normally running during testing.

Designing tables for more efficient access

Put smaller and frequently accessed columns at the beginning of tables. SQLite stores a row of a table by gathering the data for all columns in that row and packing the data together in column order into as few bytes as possible. Column order means that the data for the first declared column of the table – the column that appears first in the CREATE TABLE statement – is packed into the bundle first. The data for the second column is packed into the bundle second. And so forth.

SQLite goes to some trouble to use as few bytes as possible when storing data. A text string that is 5 bytes long, for example, is stored using just 5 bytes. The '\00' terminator on the end is omitted. No padding bytes or overhead are added even if the column specifies a larger string such as VARCHAR(1000).

Small integers (between -127 and +127) are stored as a single byte. As the magnitude of the integer increases, additional bytes of storage are added as necessary up to 8 bytes. Floating point numbers are normally stored as 8-byte IEEE floats, but if the floating point number can be represented as a smaller integer without loss of information, it is converted and stored that way to save space. BLOBs are also stored using the minimum number of bytes necessary.

These efforts to save space both help to keep SQLite database files small and also improve performance since the smaller each record is the more information will fit in the same amount of space and the less disk I/O needs to occur.

The downside of using compressed storage is that data in each column is an unpredictable size. So within each row, we do not know in advance where one column ends and the next column begins. To extract the data from the N-th column of a row, SQLite has to decode the data from the N-1 prior columns first. Thus, for example, if you have a query that reads just the fourth column of a table, the bytecode that implements the query has to read and discard the data from the first, second, and third columns in order to find where the data for the fourth column begins in order to read it out.

For this reason, it is best to put smaller and more frequently accessed columns of a table early in the CREATE TABLE statement and put large CLOBs and BLOBs and infrequently accessed information toward the end.

Storing Large BLOBs Separately from the Database

SQLite does not place arbitrary constraints on the size of BLOBs and CLOBs that it will store. The only real restriction is that the entire BLOB or CLOB must fit in memory all at once. But just because you can do this does not mean you should.

Although SQLite is able to handle really big BLOBs, it is not optimized for that case. If performance and storage efficiency is a concern, you will be better served to store large BLOBs and CLOBs in separate disk files then store the name of the file in the database in place of the actual content.

When SQLite encounters a large table row – e.g. because it contains one or more large BLOBs – it tries to store as much of the row as it can on a single page of the database. The tail of the row that will not fit on a single page is spilled into overflow pages. If more than one overflow page is required, then the overflow pages form a linked list.

To store a 1MiB BLOB in a database with a 1KiB page size will require the BLOB to be broken into a list of over a thousand links, each of which gets stored separately. Whenever you want to read the BLOB out of the database, SQLite has to walk this thousand-member linked list in order to reassemble the BLOB again. This works and is reliable, but it is not especially efficient.

Another problem with large BLOBs and CLOBs is that they can be neither read nor written incrementally. The entire BLOB or CLOB must be read or written all at once. So, for example, to append 10 bytes to the end of a 10MiB BLOB, one has to read the entire BLOB, append the 10 bytes, then write the new BLOB back out again. With a disk file, on the other hand, you can append 10 bytes quickly and efficiently without having to access the other 10 million bytes of the original file.

If you absolutely need to store large BLOBs and CLOBs in your database (perhaps so that changes to these data elements will be atomic and durable) then at least consider storing the BLOBs and CLOBs in separate tables that contain only an INTEGER PRIMARY KEY and the content of the BLOB or CLOB. So, instead of writing your schema this way:

       
        
       
       CREATE TABLE image(
    imageId INTEGER PRIMARY KEY,
    imageTitle TEXT,
    imageWidth INTEGER,
    imageHeight INTEGER,
    imageRefCnt INTEGER,
    imageBlob BLOB
);

Factor out the large BLOB into a separate table so that your schema looks more like this:

       
        
       
       CREATE TABLE image(
    imageId INTEGER PRIMARY KEY,
    imageTitle TEXT,
    imageWidth INTEGER,
    imageHeight INTEGER,
    imageRefCnt INTEGER
);

CREATE TABLE imagedata(
    imageId INTEGER PRIMARY KEY,
    imageBlob BLOB
);

Keeping all of the small columns in a separate table increases locality of reference and allows queries that do not use the large BLOB to run much faster. You can still write single queries that return both the smaller fields and the BLOB by using a simple (and very efficient) join on the INTEGER PRIMARY KEY.

Using INTEGER PRIMARY KEY

INTEGER PRIMARY KEY is a fast special case. Every row of every SQLite table has a signed 64-bit integer RowID. This RowID is the key for the b-tree that holds the table content. The RowID must be unique over all other rows in the same table. When you go to find, insert, or remove a row from a table in SQLite, the row is first located by search for its RowID. Searching by RowID is very fast. Everything in SQLite tables centres around RowIDs.

In the CREATE TABLE statement that defines a table, if you declare a column to be of type INTEGER PRIMARY KEY, then that column becomes an alias for the RowID. It is generally a good idea to create such a column whenever it is practical.

Looking up information by RowID or INTEGER PRIMARY KEY is usually about twice as fast as any other search method in SQLite. So for example, if we have an indexed table like this:

       
        
       
       CREATE TABLE demo1(
    id1 INTEGER PRIMARY KEY,
    id2 INTEGER UNIQUE,
    content BLOB
);

Then queries against the INTEGER PRIMARY KEY, e.g. of the form:

       
        
       
       SELECT content FROM demo1 WHERE id1=?;

will be about twice as fast as queries like this:

       
        
       
       SELECT content FROM demo1 WHERE id2=?;

Note that the following two queries are identical in SQLite – not just equivalent but identical. They generate exactly the same bytecode:

       
        
       
       SELECT content FROM demo1 WHERE id1=?;
SELECT content FROM demo1 WHERE RowID=?;

But for stylistic and portability reasons, the first form of the query is preferred.

Also observe that in order for the INTEGER PRIMARY KEY to truly be an alias for the RowID, the declared type must be exactly “INTEGER PRIMARY KEY”.

Variations such as “INT PRIMARY KEY” or “UNSIGNED INTEGER PRIMARY KEY” or “SHORTINT PRIMARY KEY” become independent columns instead of aliases for the RowID.

So be careful to always spell INTEGER PRIMARY KEY correctly in your CREATE TABLE statements.

Efficient Primary Key and Row ID values

Use small positive integers for INTEGER PRIMARY KEY or RowID. As explained in the section Using INTEGER PRIMARY KEY , every table row has a key which is a signed 64-bit integer: the RowID or INTEGER PRIMARY KEY. This same RowID also occurs in every index entry and is used to refer the index entry back to the original table row.

When writing the RowID to disk, SQLite encodes the 64-bit integer value using a Huffman code over a fixed probability distribution. The resulting encoding requires between 1 and 9 bytes of storage space, depending on the magnitude of the integer. Small, non-negative integers require less space than larger or negative integers. Table 3.1 shows the storage requirements:

RowID Magnitude	Bytes Of Storage
0 to 127	1
128 to 16383	2
16384 to 2097151	3
2097152 to 268435455	4
268435456 to 34359738367	5
34359738368 to 4398046511103	6
4398046511104 to 562949953421311	7
562949953421312 to 72057594037927935	8
Less than 0 or greater than 72057594037927935	9

Table 3.1: Storage requirements for RowIDs of different magnitudes

This chart makes it clear that the best way to keep reduce the number of bytes of disk space devoted to storing RowIDs is to keep RowIDs small non-negative integers. The database will continue to work fine with large or negative RowIDs, but the database files will take up more space.

SQLite normally works by assigning RowIDs automatically. An automatically generated RowID will normally be the smallest positive integer that is not already used as a RowID in the same table. The first RowID assigned will be 1, the second 2, and so forth. So the automatic RowID assignment algorithm begins by using 1-byte RowIDs, then moves to 2-byte RowIDs as the number of rows increased, then 3-byte RowIDs, and so forth. In other words, the automatic RowID assignment algorithm does a good job of selecting RowIDs that minimize storage requirements.

If a table contains an INTEGER PRIMARY KEY column, that column becomes an alias for the RowID. If you then specify a particular RowID when doing an INSERT, the RowID you specify overrides the default choice made by SQLite itself. For example, if your table is this:

       
        
       
       CREATE TABLE demo311(
    id INTEGER PRIMARY KEY,
    content VARCHAR(100)
);

Then if you insert like this:

       
        
       
       INSERT INTO demo311(id,content) VALUES(-17,'Hello');

The value -17 is used as the RowID which requires 9 bytes of space in the disk file because it is negative. If instead of specifying the id value you had just left it blank:

       
        
       
       INSERT INTO demo311(content) VALUES('Hello');

Or if you had specified a NULL as the INTEGER PRIMARY KEY value:

       
        
       
       INSERT INTO demo311(id,content) VALUES(NULL,'Hello');

Then in either case, SQLite would have automatically selected a RowID value that was the smallest positive integer not already in use, thus minimizing the amount of disk space required.

Therefore unless you have specific requirements to the contrary, it generally works best to let SQLite pick its own RowID and INTEGER PRIMARY KEY values.

Keep Schemas Small and Constant

SQLite stores the original text of all CREATE statements in a system table of the database. When a database session is opened or when the database schema is changed, SQLite has to read and parse all of these CREATE statements in order to reconstruct its internal symbol tables.

The parser and symbol table builder inside SQLite are very fast, but they still take time which is proportional to the size of the schema. To minimize the computational overhead associated with parsing the schema:

Keep the number and size of CREATE statements in the schema to a minimum.
Avoid unnecessary calls to RSqlDatabase::Open() . Reuse the same database session where possible.
Avoid changing the database schema while other threads or processes might be using the database.

Whenever the schema is reparsed, all statements that were prepared using the old schema are invalidated. A call Next() or Exec() on a RSqlStatement object that was prepared prior to the schema reparse will return KSqlErrSchema to let you know that the cause of the error was a schema change. Thus any prepared statements that were cached will have to be discarded and re-prepared following a schema change. Refilling the prepared statement cache can be as computationally expensive as parsing the schema in the first place.

Prepared statements can be invalidated for reasons other than schema changes including executing Attach() or Detach() . Avoid this if you have a large cache of prepared statements. The best strategy is to attach all associated databases as soon as a new database session is opened and before any statements are prepared.