Saturday, June 14, 2014

Index-Organized Tables and Clustered Indexes


The index-only scan executes an SQL statement using only the redundant data stored in the index. The original data in the heap table is not needed. If we take that concept to the next level and put all columns into the index, you may wonder why we need the heap table.
Some databases can indeed use an index as primary table store. The Oracle database calls this concept index-organized tables (IOT), other databases use the term clustered index. In this section, both terms are used to either put the emphasis on the table or the index characteristics as needed.
An index-organized table is thus a B-tree index without a heap table. This results in two benefits: (1) it saves the space for the heap structure; (2) every access on a clustered index is automatically an index-only scan. Both benefits sound promising but are hardly achievable in practice.

The drawbacks of an index-organized table become apparent when creating another index on the same table. Analogous to a regular index, a so-called secondary index refers to the original table data—which is stored in the clustered index. There, the data is not stored statically as in a heap table but can move at any time to maintain the index order. It is therefore not possible to store the physical location of the rows in the index-organized table in the secondary index. The database must use a logical key instead.
The following figures show an index lookup for finding all sales on May 23rd 2012. For comparison, we will first look at Figure 5.2 that shows the process when using a heap table. The execution involves two steps: (1) the INDEX RANGE SCAN; (2) the TABLE ACCESS BY INDEX ROWID.

Figure 5.2. Index-Based Access on a Heap Table






http://use-the-index-luke.com/img/fig05_02_index_on_heap_table.en.svg

Although the table access might become a bottleneck, it is still limited to one read operation per row because the index has the ROWID as a direct pointer to the table row. The database can immediately load the row from the heap table because the index has its exact position. The picture changes, however, when using a secondary index on an index-organized table. A secondary index does not store a physical pointer (ROWID) but only the key values of the clustered index—the so-called clustering key. Often that is the primary key of the index-organized table.

Why Secondary Indexes have no ROWID

Accessing a secondary index does not deliver a ROWID but a logical key for searching the clustered index. A single access, however, is not sufficient for searching clustered index—it requires a full tree traversal. That means that accessing a table via a secondary index searches two indexes: the secondary index once (INDEX RANGE SCAN), then the clustered index for each row found in the secondary index (INDEX UNIQUE SCAN).

Figure 5.3. Secondary Index on an IOT 

      

Figure 5.3 makes it clear, that the B-tree of the clustered index stands between the secondary index and the table data.

Accessing an index-organized table via a secondary index is very inefficient, and it can be prevented in the same way one prevents a table access on a heap table: by using an index-only scan—in this case better described as “secondary-index-only scan”. The performance advantage of an index-only scan is even bigger because it not only prevents a single access but an entire INDEX UNIQUE SCAN.

Important

Accessing an index-organized table via a secondary index is very inefficient.
Using this example we can also see that databases exploit all the redundancies they have. Bear in mind that a secondary index stores the clustering key for each index entry. Consequently, we can query the clustering key from a secondary index without accessing the index-organized table:
SELECT sale_id
  FROM sales_iot
 WHERE sale_date = ?;
-------------------------------------------------
| Id | Operation        | Name           | Cost |
-------------------------------------------------
|  0 | SELECT STATEMENT |                |    4 |
|* 1 |  INDEX RANGE SCAN| SALES_IOT_DATE |    4 |
-------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("SALE_DATE"=:DT)
 
The table SALES_IOT is an index-organized table that uses SALE_ID as clustering key. Although the index SALE_IOT_DATE is on the SALE_DATE column only, it still has a copy of the clustering key SALE_ID so it can satisfy the query using the secondary index only.
When selecting other columns, the database has to run an INDEX UNIQUE SCAN on the clustered index for each row:

SELECT eur_value
  FROM sales_iot
 WHERE sale_date = ?;
---------------------------------------------------
| Id  | Operation         | Name           | Cost |
---------------------------------------------------
|   0 | SELECT STATEMENT  |                |   13 |
|*  1 |  INDEX UNIQUE SCAN| SALES_IOT_PK   |   13 |
|*  2 |   INDEX RANGE SCAN| SALES_IOT_DATE |    4 |
---------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("SALE_DATE"=:DT)
   2 - access("SALE_DATE"=:DT)
 
Index-organized tables and clustered indexes are, after all, not as useful as it seems at first sight. Performance improvements on the clustered index are easily lost on when using a secondary index. The clustering key is usually longer than a ROWID so that the secondary indexes are larger than they would be on a heap table, often eliminating the savings from the omission of the heap table. The strength of index-organized tables and clustered indexes is mostly limited to tables that do not need a second index. Heap tables have the benefit of providing a stationary master copy that can be easily referenced.

Important

Tables with one index only are best implemented as clustered indexes or index-organized tables.
Tables with more indexes can often benefit from heap tables. You can still use index-only scans to avoid the table access. This gives you the select performance of a clustered index without slowing down other indexes.
Database support for index-organized tables and clustered index is very inconsistent. The following overview explains the most important specifics.
By default SQL Server uses clustered indexes (index-organized tables) using the primary key as clustering key. Nevertheless you can use arbitrary columns for the clustering key—even non-unique columns.
To create a heap table you must use the NONCLUSTERED clause in the primary key definition:
CREATE TABLE (
   id    NUMBER NOT NULL,
   [...]
   CONSTRAINT pk PRIMARY KEY NONCLUSTERED (id)
);
Dropping a clustered index transforms the table into a heap table.
SQL Server’s default behavior often causes performance problems when using secondary indexes.

No comments:

Popular Posts