Friday, January 18, 2013

Index Your Database, Like a Boss

During the holidays I had the opportunity to read the wonderful book, SQL Performance Explained and now I want to share some of my new found knowledge.

The book contains a lot more information such as how joins, datatypes, ranges, etc, affect performance. In this post I will write about indexing, what an index is, what should be indexed, when to use simple indexes and when to use composite.

I will be using a Postgres database since that is what I know best, but most of it can be applied to any database although the details may differ.

Why index?

There are at least two reasons for using an index. One is for making sure that invalid data is not entered, in that case the index, unique, is used as a constraint. The other reason for using indexes is what I'm writing about here, performance.

By tuning my database, I will not only improve the performance of single queries, I will also improve how many queries it will be able to handle at the same time. A well-tuned database uses less resources and does more with less.

What is an index?

An index is a B-Tree, a data structure that keeps data sorted and allows searches, sequential access, insertions and deletions in logarithmic time. --Wikipedia

The index is the reason why a database is fast. If I have a table with 1 million rows and I do a sequential search, I have to search through all the rows to know if something matches. If on the other hand I have an index on it I only have to search through log2(1 million) ~ 20 rows instead. Quite an improvement! And this improvement is for log2.

Databases expose this concept to a maximum extent and put as many entries as possible into each node—often hundreds. That means that every new index level supports a hundred times more entries. --SQL Performance Explained

This means a database has to search through log100(1 million) ~ 3 rows.

What should I index?

So, the million-dollar question, actually the e-book is only EUR 9.95 :), is: "What should I index?". The answer to this is, as always, it depends, (Damn consultants never willing to stand for anything!) :). But, it does not depend that much. I'm going to go out on a limb and write:

In a typical web application I will have a 100 times more queries than inserts, so I index everything I put in my where clauses.

So, with this premise, it is as easy as analyzing my queries finding all the where clauses and adding an index to each an every one of them.

Example

Let's say that we have a users table with four columns, the database usually adds an automatic index to the primary key.

A users table with four columns

  • id (primary key)
  • firstname
  • lastname
  • male_female

When I select a user by id, the primary key, this is how Postgres will execute it.

/* Explain how postgres will execute. */
explain select * from users where id = 1;

 Index Scan using users_pkey on users  (cost=0.00..9.37 rows=1 width=49)
   Index Cond: (id = 1)
/* Index Scan is good, it means Postgres is using the index. */

Postgres will use the users_pkey index and it will cost 9.37. (Cost is measured in units of disk page fetches).

Simple Indexes

If I instead try to find a user by firstname

/* Explain how postgres will execute. */
explain select * from users where firstname = 'Anders';

 Seq Scan on users  (cost=0.00..22805.00 rows=1 width=49)
   Filter: ((firstname)::text = 'Anders'::text)
/* Seq Scan is not good, it means Postgres is doing full table scan. */

Aiya! Postgres will do a full table scan and it will take cost 22805. That is more than 2000 times slower!

To fix this I need to put an index on the table.

/* Add index on firstname column of users. */
create index users_firstname on users (firstname);
CREATE INDEX

/* Explain how postgres will execute */
explain select * from users where firstname = 'Anders';

explain select * from users where firstname = 'Anders';
 Index Scan using users_firstname on users  (cost=0.00..9.81 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
/* Index Scan is good, it means Postgres is using the index. */

Zippedidoodaa! Postgres is performant again, by using my new index.

Alright, that was easy. So what happens if we want to find users by both firstname and lastname?

/* Explain how postgres will execute */
explain select * from users where firstname = 'Anders' and lastname = 'Janss';

 Index Scan using users_firstname on users  (cost=0.00..9.81 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
   Filter: ((lastname)::text = 'Janss'::text)
/* Index Scan is good, but the filter means Postgres is scanning the found data. */

The Filter above means that Postgres will scan the result for matching lastnames. We can do better. Let's add an index on lastname to see what happens.

/* Add index on lastname column of users. */
create index users_lastname on users(lastname);
CREATE INDEX

explain select * from users where firstname = 'Anders' and lastname = 'Janss';

 Index Scan using users_lastname on users  (cost=0.00..9.81 rows=1 width=49)
    Index Cond: ((lastname)::text = 'Janss'::text)
       Filter: ((firstname)::text = 'Anders'::text)
/* Index Scan is good, but the filter means Postgres is scanning the found data. */

Nothing happened! Postgres cannot take advantage of this index. Let's remove it again. My Mama always said "Don't leave indexes around unless you can prove that they help you." :).

Composite Indexes

To solve the problem with AND clauses I need to create a composite index.

/* Create a composite index on firstname and lastname on users. */
create index users_firstname_lastname on users(firstname, lastname);
CREATE INDEX
explain select * from users where firstname = 'Anders' and lastname = 'Janss';

 Index Scan using users_firstname on users  (cost=0.00..9.81 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
   Filter: ((lastname)::text = 'Janss'::text)
/* Postgres is not using our new index, it prefers our other index */

WTF? Postgres is ignoring our composite index, in preference to our firstname index. Why? Because it believes that it is faster! Is it? Let's find out. The \timing command turns on timing in Postgres.

\timing
Timing is on.

/* Select count instead to avoid printing everything to screen */
select count(*) from users where firstname = 'Anders' and lastname = 'Janss';
 count
-------
 12018
(1 row)
Time: 10.848 ms

Now I drop the firstname index to make sure that Postgres will use my index.

drop index users_firstname;
DROP INDEX

explain select count(*) from users where firstname = 'Anders' and lastname = 'Janss' ;

 Aggregate  (cost=10.70..10.71 rows=1 width=0)
   ->  Index Only Scan using users_firstname_lastname on users  (cost=0.00..10.69 rows=1 width=0)
         Index Cond: ((firstname = 'Anders'::text) AND (lastname = 'Janss'::text))
/* The calculated cost is higher than the cost for using the firstname index */


select count(*) from users where firstname = 'Anders' and lastname = 'Janss';
 count 
-------
 12018
(1 row)
Time: 9.566 ms

The query using my composite index is about 1ms faster, so in this case Postgres was wrong, but for other data distributions my guess may be wrong. Another reason to always measure.

If I now perform the firstname query again, Postgres can use my composite index, but if I try to search for lastname it cannot. The order of the fields in the composite index matters.

explain select * from users where firstname = 'Anders' ;
 Index Scan using users_firstname_lastname on users  (cost=0.00..10.69 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
/* Using the composite index at a slightly higher cost than simple index. */

explain select * from users where lastname = 'Janss' ;

 Seq Scan on users  (cost=0.00..23263.25 rows=1 width=49)
   Filter: ((lastname)::text = 'Janss'::text)```
/* The search for lastname has to do a full table scan. */

Alright, one last query before I'm done. What happens with an OR query?

explain select * from users where firstname = 'Anders' or lastname = 'Janss';

 Seq Scan on users  (cost=0.00..25818.30 rows=2 width=49)
   Filter: (((firstname)::text = 'Anders'::text) OR ((lastname)::text = 'Janss'::text))
/* The index is not used for the OR query. */

OR queries are like separate queries and they need separate indexes. I'll add an extra index to lastname and we should be good to go.

create index users_lastname on users(lastname);
CREATE INDEX

explain select * from users where firstname = 'Anders' or lastname = 'Janss';

 Bitmap Heap Scan on users  (cost=12.51..20.45 rows=2 width=49)
   Recheck Cond: (((firstname)::text = 'Anders'::text) OR ((lastname)::text = 'Janss'::text))
   ->  BitmapOr  (cost=12.51..12.51 rows=2 width=0)
         ->  Bitmap Index Scan on users_firstname_lastname  (cost=0.00..6.68 rows=1 width=0)
               Index Cond: ((firstname)::text = 'Anders'::text)
         ->  Bitmap Index Scan on users_lastname  (cost=0.00..5.82 rows=1 width=0)
               Index Cond: ((lastname)::text = 'Janss'::text)
/* Using both our composite index and the new lastname index. */

It worked! Postgres was able to use both our composite index and the new lastname index.

Summary

So to sum it up, to get a performant and scalable database:

  • Use indexes that covers all fields that are ANDed in where clauses.
  • Use separate indexes for fields that are in OR clauses.
  • Create reusable composite indexes.
    • When searching for (firstname, lastname) and firstname, add ONE composite index on (firstname, lastname).
    • When searching for (firstname, lastname) and lastname, add ONE composite index on (lastname, firstname).
    • When searching for (firstname, lastname), firstname, and lastname, add TWO indexes, one composite index on (firstname, lastname) (or reversed) and one simple index on lastname (or firstname)

I highly recommend the book. It is good!

No comments: