During the holidays I had the opportunity to read the wonderful book, SQL Performance Explained and now I want to share some of my new found knowledge.
The book contains a lot more information such as how joins, datatypes, ranges, etc, affect performance. In this post I will write about indexing, what an index is, what should be indexed, when to use simple indexes and when to use composite.
I will be using a Postgres database since that is what I know best, but most of it can be applied to any database although the details may differ.
Why index?
There are at least two reasons for using an index. One is for making sure that
invalid data is not entered, in that case the index, unique
, is used as a
constraint. The other reason for using indexes is what I'm writing about here,
performance.
By tuning my database, I will not only improve the performance of single queries, I will also improve how many queries it will be able to handle at the same time. A well-tuned database uses less resources and does more with less.
What is an index?
An index is a B-Tree, a data structure that keeps data sorted and allows searches, sequential access, insertions and deletions in logarithmic time. --Wikipedia
The index is the reason why a database is fast. If I have a table with 1
million rows and I do a sequential search, I have to search through all the
rows to know if something matches. If on the other hand I have an index on it I
only have to search through log2(1 million) ~ 20
rows instead. Quite an
improvement! And this improvement is for log2
.
Databases expose this concept to a maximum extent and put as many entries as possible into each node—often hundreds. That means that every new index level supports a hundred times more entries. --SQL Performance Explained
This means a database has to search through log100(1 million) ~ 3
rows.
What should I index?
So, the million-dollar question, actually the e-book is only EUR 9.95 :), is: "What should I index?". The answer to this is, as always, it depends, (Damn consultants never willing to stand for anything!) :). But, it does not depend that much. I'm going to go out on a limb and write:
In a typical web application I will have a 100 times more queries than inserts, so I index everything I put in my where clauses.
So, with this premise, it is as easy as analyzing my queries finding all the where clauses and adding an index to each an every one of them.
Example
Let's say that we have a users
table with four columns, the database usually
adds an automatic index to the primary key.
A users table with four columns
- id (primary key)
- firstname
- lastname
- male_female
When I select a user by id
, the primary key, this is how Postgres will execute it.
/* Explain how postgres will execute. */ explain select * from users where id = 1; Index Scan using users_pkey on users (cost=0.00..9.37 rows=1 width=49) Index Cond: (id = 1) /* Index Scan is good, it means Postgres is using the index. */
Postgres will use the users_pkey
index and it will cost 9.37. (Cost is
measured in units of disk page fetches).
Simple Indexes
If I instead try to find a user by firstname
/* Explain how postgres will execute. */ explain select * from users where firstname = 'Anders'; Seq Scan on users (cost=0.00..22805.00 rows=1 width=49) Filter: ((firstname)::text = 'Anders'::text) /* Seq Scan is not good, it means Postgres is doing full table scan. */
Aiya! Postgres will do a full table scan and it will take cost 22805. That is more than 2000 times slower!
To fix this I need to put an index on the table.
/* Add index on firstname column of users. */ create index users_firstname on users (firstname); CREATE INDEX /* Explain how postgres will execute */ explain select * from users where firstname = 'Anders'; explain select * from users where firstname = 'Anders'; Index Scan using users_firstname on users (cost=0.00..9.81 rows=1 width=49) Index Cond: ((firstname)::text = 'Anders'::text) /* Index Scan is good, it means Postgres is using the index. */
Zippedidoodaa! Postgres is performant again, by using my new index.
Alright, that was easy. So what happens if we want to find users by both
firstname
and lastname
?
/* Explain how postgres will execute */ explain select * from users where firstname = 'Anders' and lastname = 'Janss'; Index Scan using users_firstname on users (cost=0.00..9.81 rows=1 width=49) Index Cond: ((firstname)::text = 'Anders'::text) Filter: ((lastname)::text = 'Janss'::text) /* Index Scan is good, but the filter means Postgres is scanning the found data. */
The Filter
above means that Postgres will scan the result for matching
lastname
s. We can do better. Let's add an index on lastname
to see what
happens.
/* Add index on lastname column of users. */ create index users_lastname on users(lastname); CREATE INDEX explain select * from users where firstname = 'Anders' and lastname = 'Janss'; Index Scan using users_lastname on users (cost=0.00..9.81 rows=1 width=49) Index Cond: ((lastname)::text = 'Janss'::text) Filter: ((firstname)::text = 'Anders'::text) /* Index Scan is good, but the filter means Postgres is scanning the found data. */
Nothing happened! Postgres cannot take advantage of this index. Let's remove it again. My Mama always said "Don't leave indexes around unless you can prove that they help you." :).
Composite Indexes
To solve the problem with AND
clauses I need to create a composite index.
/* Create a composite index on firstname and lastname on users. */ create index users_firstname_lastname on users(firstname, lastname); CREATE INDEX explain select * from users where firstname = 'Anders' and lastname = 'Janss'; Index Scan using users_firstname on users (cost=0.00..9.81 rows=1 width=49) Index Cond: ((firstname)::text = 'Anders'::text) Filter: ((lastname)::text = 'Janss'::text) /* Postgres is not using our new index, it prefers our other index */
WTF? Postgres is ignoring our composite index, in preference to our firstname
index. Why? Because it believes that it is faster! Is it? Let's find out. The
\timing
command turns on timing in Postgres.
\timing Timing is on. /* Select count instead to avoid printing everything to screen */ select count(*) from users where firstname = 'Anders' and lastname = 'Janss'; count ------- 12018 (1 row) Time: 10.848 ms
Now I drop the firstname
index to make sure that Postgres will use my index.
drop index users_firstname; DROP INDEX explain select count(*) from users where firstname = 'Anders' and lastname = 'Janss' ; Aggregate (cost=10.70..10.71 rows=1 width=0) -> Index Only Scan using users_firstname_lastname on users (cost=0.00..10.69 rows=1 width=0) Index Cond: ((firstname = 'Anders'::text) AND (lastname = 'Janss'::text)) /* The calculated cost is higher than the cost for using the firstname index */ select count(*) from users where firstname = 'Anders' and lastname = 'Janss'; count ------- 12018 (1 row) Time: 9.566 ms
The query using my composite index is about 1ms faster, so in this case Postgres was wrong, but for other data distributions my guess may be wrong. Another reason to always measure.
If I now perform the firstname
query again, Postgres can use my composite
index, but if I try to search for lastname
it cannot. The order of the fields
in the composite index matters.
explain select * from users where firstname = 'Anders' ; Index Scan using users_firstname_lastname on users (cost=0.00..10.69 rows=1 width=49) Index Cond: ((firstname)::text = 'Anders'::text) /* Using the composite index at a slightly higher cost than simple index. */ explain select * from users where lastname = 'Janss' ; Seq Scan on users (cost=0.00..23263.25 rows=1 width=49) Filter: ((lastname)::text = 'Janss'::text)``` /* The search for lastname has to do a full table scan. */
Alright, one last query before I'm done. What happens with an OR
query?
explain select * from users where firstname = 'Anders' or lastname = 'Janss'; Seq Scan on users (cost=0.00..25818.30 rows=2 width=49) Filter: (((firstname)::text = 'Anders'::text) OR ((lastname)::text = 'Janss'::text)) /* The index is not used for the OR query. */
OR
queries are like separate queries and they need separate indexes. I'll add
an extra index to lastname
and we should be good to go.
create index users_lastname on users(lastname); CREATE INDEX explain select * from users where firstname = 'Anders' or lastname = 'Janss'; Bitmap Heap Scan on users (cost=12.51..20.45 rows=2 width=49) Recheck Cond: (((firstname)::text = 'Anders'::text) OR ((lastname)::text = 'Janss'::text)) -> BitmapOr (cost=12.51..12.51 rows=2 width=0) -> Bitmap Index Scan on users_firstname_lastname (cost=0.00..6.68 rows=1 width=0) Index Cond: ((firstname)::text = 'Anders'::text) -> Bitmap Index Scan on users_lastname (cost=0.00..5.82 rows=1 width=0) Index Cond: ((lastname)::text = 'Janss'::text) /* Using both our composite index and the new lastname index. */
It worked! Postgres was able to use both our composite index and the new
lastname
index.
Summary
So to sum it up, to get a performant and scalable database:
- Use indexes that covers all fields that are
AND
ed inwhere
clauses. - Use separate indexes for fields that are in
OR
clauses. - Create reusable composite indexes.
- When searching for
(firstname, lastname)
andfirstname
, add ONE composite index on(firstname, lastname)
. - When searching for
(firstname, lastname)
andlastname
, add ONE composite index on(lastname, firstname)
. - When searching for
(firstname, lastname)
,firstname
, andlastname
, add TWO indexes, one composite index on(firstname, lastname)
(or reversed) and one simple index onlastname
(orfirstname
)
- When searching for
I highly recommend the book. It is good!
6 comments:
Great samples, it's always important to EXPLAIN how indexes works :-) You may precise that INDEX ONLY SCAN appears in version 9.2, if not, some users running older version won't success when they'll replay your samples.
Regards
@Rodolphe, Thanks I'm glad you liked it. Even though index only scan didn't appear until 9.2. The examples should still work. Only the output will change.
Awesome post.. big thumbs up :-)
nice article. straight to the point and clear.
Thanks Mpho, I'm glad you liked it.
Nicely Explained!!
Post a Comment