Tuesday, April 30, 2013

A Responsible Programmer

In the last few years I have been asked to help savor several web projects gone bad. The quality of the projects, code, environment, documentation, and morale has been low. To know what I think is right in situations like this, I started asking myself the question, "What would a responsible programmer do?".

Clarity

Above anything else a responsible programmer values clarity. Not only does she value clear code, but also clear documentation, clear communication and a clear vision of where she and her project is going.

Coding

Write Consistent Code

The responsible programmer writes consistent code. Consistency helps other programmers read and understand her code. It lets them know what to expect. If she names constants with SCREAMING_SNAKE_CASE, they know that they wont change. When naming attributes in CSS and HTML, she will make all of them dash-er-ized, or none of them. This is easy stuff, but important. Consistency breeds familiarity. Familiarity is good, it removes worry and increases confidence.

When the responsible programmer contributes code to other projects, she will make sure that she, consistently, follows the style of the project. Sometimes it is not easy to tell what style a project uses, the responsible way is to ask what style is preferred and then use that style. By just asking the question, she will often trigger a review of the code which will help set a consistent style in the future.

By writing consistent code, a responsible programmer will make the program easier to understand and easier to maintain.

Don't Quick-Fix

A responsible programmer doesn't do quick fixes. When a bug needs to be fixed she fixes the root problem instead of fixing the symptom. If an event-handler suddenly starts receiving unnamed events the proper fix is not to ignore unnamed events but, to figure out why unnamed events come at all when not expected. She knows that fixing a symptom will only make the root cause much harder to find.

Write Short Functions

Short functions are easier to understand, easier to reason about and easier to test. It is the responsible thing to write. Enough said!

Separate Commands From Queries

CQS or Command Query Separation has become all the rave in the DDD world, but it was coined by Bertrand Meyer in the book, Object-Oriented Software Construction in 1988. A good book, read it!

The responsible programmer separates her commands from her queries because she knows that they are easier to test and that she can call the queries many times without anything bad happening. Separating your commands form your queries may be as easy as:

// Can be called whenever, no side-effects
function generateRoute(params) {
  return [params.major, params.minor, params.patch].join('/');
}

// Updates the hash with the new route.
function updateRoute(params) {
  location.hash = generateRoute(params);
}

Refactor Mercilessly

Since a responsible programmer values clarity, she refactors mercilessly when her understanding of the system changes. She knows that the time invested in making the code a little more clear will prevent bugs and frustration in the future.

Prefer Explicit

A responsible programmer prefers explicit code over implicit code. Even though her understanding of advanced concepts such as meta-programming, monads and continuations is substantial, she prefers explicit code over beautiful abstractions. New programmers (including her future self) have a lot easier to understand code that is explicit than code that is not.

Don't Fear Advanced Techniques

Since advanced techniques can make the code a lot simpler in certain situations she never shies away from advanced techniques when she realizes that they are called for. At the end of the day meta-programming and "advanced" functional programming techniques are just tools that should be used when appropriate.

Check Boundaries

A responsible programmer always checks the boundaries of her system to make sure that invalid data doesn't enter into the core of the application. This way she can avoid defensive programming in the core domain where clarity is even more essential than anywhere else.

Wrap External Services

External services is one of the main reasons that development takes time, the responsible programmer makes sure to always wrap external services with a local interface. This simplifies both testing and exchanging the services.

External Libraries

A responsible programmer will never use an external library she doesn't trust. She will never add a library into her code base unless there is a significant reason for adding it. When she adds an external library she will learn it. She will learn how she configures it, how it is to be called, what are good practices for using it, what bugs there are, etc.

Balance

A code base can be compared to a balanced tree. A balanced tree is data-structure that will rebalance itself when new items are added to it. This makes the cost of modifying the tree more expensive but has the benefit that accessing items in the tree can be performed in an optimal way.

A responsible programmer treats her code base as a balanced tree. She will never add code without thinking about balance. She knows that if the project loses its balance it may have to be entirely rewritten to regain balance again.

The balance of the code may shift as the code matures, when a major shift is called for the responsible programmer will refactor mercilessly to obtain a new optimal balance.

Documentation

A responsible programmer writes and maintains documentation as the needs come up. The needs differ between projects, but most projects benefit from a system overview, a domain overview, a style guide, and code comments.

A responsible programmer makes assumptions all the time when coding, she writes the assumptions as comments in the code when she makes them. She tags them to make it possible to generate a list of assumptions.

The System Overview

The system overview is a drawing and a description of all the servers that are involved in the system. This includes databases, queues, web-servers, external services, etc. The description describes how the pieces fit together.

The Domain Overview

This is a drawing and a high level description of how the core domain of the system works. It includes the major concepts of the domain and what they mean.

The Style Guide

The style guide may be as easy as referring to Github's style guide or to write your own that alters someone else's. Anyway you do it, it is worth having it written down.

Code Comments

Comments in code should be very sparse and only added to point out idiosyncrasies. When assumptions are made, they can be written as comments with a tag to make it possible to generate a list. Example:

// ASSUMPTION: The list is expected to be small and will
// be entirely loaded from the server
function loadCities() {
}

Testing

The responsible programmer tests! She doesn't test for the sake of testing or to increase code coverage, she tests to be sure that the code works as she expects it to.

She knows that dynamic environments such as Javascript and web browsers are fragile and that it is easy to break code without meaning to.

How to write good tests have been written about elsewhere and I wont spend any time on it here. I can recommend the last chapter in Sandi Metz' book, Practical Object-Oriented Design in Ruby if you are interested in good techniques for testing in dynamic programming languages.

Environments

The responsible programmer owns her environments. In a project there are at least three environments to care about, Production, Test, and Development. There is also the development machine itself.

The responsible programmer can set up all project environments with a single command that installs everything that is needed, including databases, seed data, libraries, search engines, tools, environment variables, SSH-keys. Everything!

This will allow a new programmer starting in the project to be setup within minutes. It will also allow her to experiment with anything without having to worry about destroying the setup and losing days debugging the environment.

Her personal development machine is also perfectly configured at all times. If she learns a new trick, she will immediately incorporate it into her toolset and into her configuration.

She is automatically prepared for catastrophes. If her hard disk crashes she can just buy a new one at the local supermarket and install her configuration files and be ready to go within hours.

To make this happen she always keeps backups of her configuration files. She keeps the non-secret ones on Github and the secret ones, such as SSH keys and passwords elsewhere.

Continuous Integration/Deployment

Another part of the environment is continuous integration. If a project doesn't use continuous integration it is a clear sign that it is not healthy.

The continuous integration server is just another environment and setting up a new is done with a single command just like the others.

The responsible programmer will set up and maintain continuous integration just like she does every other environment.

Scripting

In order to achieve the environment goals, a responsible programmer knows how to script. Scripting is not only essential for keeping your environments up to date, they are essential for automating simple tasks. Scripts are useful for generating code, testing, refactoring, renaming, installation, automating checklists, etc.

A responsible programmer ask herself, "When did I do something once? Never!" Writing a script that does what she wants frees her from having to remember the sequence of instructions required to do a task and let's her focus on more important stuff. It also serves as runnable documentation.

Tools

A responsible programmer knows her tools. She will always try to learn more about them and she will replace them if other tools are invented that work better. But she doesn't change her tools for the latest fashion.

The command line is a very powerful tool and so is scripting language and a scriptable editor.

Version Control

The responsible programmer uses version control to communicate with her future self and with other programmers. She knows that a clear commit message will help her and others understand what has happened to the system.

She prunes her commits. When she has made several changes to a code base she will make sure that she commits the different changes separately by using something like git add --patch. She also knows that if she commits something by mistake she can alter the commit message or add forgotten files with git commit --amend and that she can change the contents of her history with git rebase --interactive or with git reset

Projects

Own It

A responsible programmer owns her project, she will not allow anything bad to happen to her code. This is a difficult goal to achieve when she comes into a project that has already gone bad, but it is a worthy goal and not one that should be taken lightly. All projects must have at least one person who owns the code. When people start talking about the code as if it is someone else's, it is time to shut the project down.

If a responsible programmer decides to take ownership of a project, she makes sure that she has the authority to make the decisions that she deems necessary. No authority, no responsibility, it's as simple as that.

Estimation

Sometimes projects require estimates, most of the time they don't but, sometimes they actually do. A responsible programmer knows how to estimate. She knows that an estimate is just a guess and the bounds of any task, however trivial, always have a small probability of taking infinitely long to finish (earth quake, meteor strike, blackout). There is also a small probability that the code is already being written at the time of estimation.

Being aware that the code may take infinitely long to finish she is very careful not to make any promises and she is very clear about her estimates being guesses.

Don't Do as They Are Told

Some people may not see this as a sign of a responsible programmer but I beg to differ. When a programmer is told to do something, she will try to figure out what the real problem is. She may do this in several different ways. She may sit down and think through the "task" and come up with an alternate solution that solves the problem simpler or, even better, makes the problem go a way completely. She may ask questions to help clarify the problem for her. Why is this a problem? Why do you do it like this? Why? Why? Why?

Some people may not like this and tell her to "Just fucking do it!". Her reply to this is something along the lines of "Just fucking do it yourself!" but she is usually a lot more polite so she may very well say "I don't understand what your problem is and, therefore, I am not the best person to solve it, please ask someone else."

She believes it is her job to understand what she is doing and why, and that life is too short to be a drone!

If she feels that she is not able to be a responsible programmer on a project, the responsible thing to do is to leave.

Summary

I have found that asking myself the question "What would a responsible programmer do?" liberating. It clarifies what I should do in situations of doubt.

At the end of the day, the responsible programmer can look through the commit log and see a beautiful list of tasks that she has completed. She can look through each commit and see that they are cohesive and well described by their commit message. She can git blame the code and see that every line of code that has her name on it reads well. She can look at her days work, and she can feel proud!

Sunday, April 07, 2013

Javascript Conditionals

As we all know, Javascript is a very flexible language. In the article I will show different ways to execute conditional code by using some common idioms from Javascript and general object-oriented techniques.

Default values

Javascript does not support default values for arguments and it is common to use an if statement or a conditional expression to set default values.

function swim(direction, speed, technique) {
  // Default value with if statement
  if (!direction) direction = 'downstream';

  // Default value with conditional operator
  var speedInMph = speed ? speed : 2;
}

I usually prefer to use an or-expression instead. The short-circuiting or, ||, avoids the repetition of the conditional operator and is, in my opinion, more readable. Another advantage of avoiding repetition is that a slow executing function condition such as fastestSwimmer() will avoid the performance penalty of calling the function twice.

  // Default value with or.
  var swimTechnique = technique || 'crawl';

  // Function is only invoked once
  var siwmmer = fastestSwimmer() || 'Michael Phelps';
}

Naturally, this technique is not limited to default arguments, it can be used to set default values from object literals too.

options = { kind: 'Mountain Tapir', }
var kind = options['kind'] || 'Baird Tapir';

A simple, yet useful, technique.

Update 2013-04-13, as Jeffery mentions in a comment, the technique only works for values that are not falsy. If values such as 0 or false are acceptable values, you will have to explicitly test for undefined instead.

Call callback if present

Another idiom in Javascript, especially in Node, is passing callbacks to other functions. But, sometimes the callbacks past in are optional. In this case we can use short-circuiting and, &&, instead.

function updateStatistics(data, callback) {
  var result = doSomethingWithData(data);
  // Call the callback if it is defined
  if (callback) return callback(result);
}

function (data, callback) {
  var result = doSomethingWithData(data);
  // The last evaluated value of the `&&` is returned
  return callback && callback(result);
}

I, personally, prefer the first form with the explicit if because I think it communicates my intent better but, it is good to know about the technique anyway.

Update 2013-04-13, A better use for the technique is for testing for the presence of objects before getting their properties.

function callService(url, options) {
  ajaxCall(url, options && options.callback);
}

Lookup Tables

If you have code that behaves differently based on the value of a property, it can often result in conditional statements with multiple else ifs or a switch cases.

if (kind === 'baird')
  bairdBehavior();
else if (kind === 'malayan')
  malayanBehavior();
else if (kind === 'mountain')
  mountainBehavior();
else if (kind === 'lowland')
  lowlandBehavior();
else
  throw new Error('Invalid kind ' + kind);

I find this kind of code ugly and I don't think it looks any better with a switch statement. I prefer to use a lookup table if there is more than two options.

var kinds = {
  baird: bairdBehavior,
  malayan: malayanBehavior,
  mountain: mountainBehavior,
  lowland: lowlandBehavior
};

var func = kinds[kind];
if (!func)
  throw new Error('Invalid kind ' + kind);
func();

I find this code a lot clearer since makes it clear that the else clause handles an exceptional case and that the normal cases works similarly.

Missing Objects

If similar conditionals appear in multiple places in my code, it is a sign that I am missing an object somewhere. Since Javascript is duck typed I can use the same technique as above to create objects instead of just functions.

var kinds = {
  baird: { act: bairdBehavior, info: bairdInfo },
  malayan: { act: malayanBehavior, info: malayanInfo },
  mountain: { act: mountainBehavior, info: mountainInfo },
  lowland: { act: lowlandBehavior, info: lowlandInfo },
};

var tapir = kinds[kind];
if (!tapir)
  throw new Error('Invalid kind of tapir ' + kind);

I prefer to have this kind of code on the borders of my application. That way the code inside my core domain doesn't have to deal with complicated conditional logic. Polymorphism for the win!

Null Objects

If I notice that in many places I have to check fornulls, it is usually a sign that I haven't handled the special null case properly. In the example above I have handled it properly since I throw an Error if the kind of tapir does not exist. But sometimes it is not an error when the value is missing.

// If a non existant kind is used, tapir will become null
var tapir = kinds[kind];

// In other places of the code
if (tapir)
  tapir.act();

// Somewhere else
if (tapir)
  return tapir.info();

This type of code is rather unpleasant and it is time to break out the Null Object.

var tapir = kinds[kind];
// If a non existant kind is used I use a Null Object
if (!tapir)
  tapir = { act: doNothing, info: unknownTapirInfo }

// In other places of the code the conditionals are gone.
tapir.act();

// Somewhere else, no special case here.
return tapir.info();

Null Objects are not appropriate everywhere but, I often find it very enlightening to have them in mind when I write code.

Summary

There are a lot of elegant ways to deal with conditional code in Javascript. I didn't even mention inheritance since it works similarly to the object approach I showed above. But if I need multiple instances of something I would of course use polymorphism through inheritance instead.

Friday, February 08, 2013

Web Workers

I recently wrote a program, Word Maestro, which requires extensive calculations in Javascript. The calculations, permutations and searching, are very CPU intensive and hangs the GUI when performed in the foreground.

Web workers to the rescue! Web workers are supported by most moderns browsers with the exception of IE (Surprise!). IE10 release candidate supports them, but it is not very wide spread yet. More info can be found at Can I Use

How do Web Workers Work?

A web worker is just a plain Javascript file with anything in it. If you start an empty file empty-worker.js it will start up just fine and do absolutely nothing.

// empty-worker.js

To start a web worker you create a new Worker and give the constructor a URL as the only parameter. The URL must come from the same domain as the page loading the Worker.

// This code goes inside a script tag or in a file loaded by <script src>
// Start the empty worker, which does nothing.
var worker = new Worker('empty-worker.js');

In order to have any use for our worker we need it to communicate with us. The way a worker communicates is be sending messages. The method that does this is called postMessage(object). It takes any type of argument, primitives as well as arrays and objects.

// eager-worker.js
postMessage('I am eager for work!');
// self.postMessage('I am eager for work!'); // Safest way

It is also possible to prefix the call with this or self , they both refer to the same WorkerGlobalScope. self is the safest way since that will not change with the calling context the way this does.

Our eager-worker.js starts up and posts a message and we need to receive it. We can do that by setting the onmessage property on our worker reference.

// In script tag or file loaded by script tag
var worker = new Worker('empty-worker.js');
worker.onmessage = function(event) {
  console.log(event);
};

Reloading the page will result in the following output in the console. Notice that the data sent by the worker is available in the data property of the MessageEvent

MessageEvent {ports: Array[0], data: "I am eager for work!", source: null, lastEventId: "", origin: ""}

An alternative way of attaching a listener to the workers is to use addEventListener(&apos;message&apos;, listener). Adding the event listener this way has the advantage of allowing us to attach multiple listeners to the same worker. I have not had the need for this yet.

worker.addEventListener('message', function(event) {
  console.log('One', event.data);
});
worker.addEventListener('message', function(event) {
  console.log('Two', event.data);
});

Reloading the page with the above code, will result in two lines in the console log. Notice that I am only logging the data part of the event.

One, I am eager for work!
Two, I am eager for work!

Our eager-worker.js is really eager to work so he keeps on telling his boss that he want to work every second.

// eager-worker.js
setInterval(function() {
  postMessage('I am eager for work!');
}, 1000);

This of course annoys the boss tremendously so he decides to tell the worker to do something by sending him a message with postMessage.

// main.js
var worker = new Worker("eager-worker.js");
worker.onmessage = function(event) {
  console.log(event.data);
};
worker.postMessage('Stop bugging me and do something!');

Our eager-worker.js is not listening yet, so the boss can scream all he wants without any success. Let's change that by implementing the onmessage method in the worker as well. addEventListener also works.

// eager-worker.js
postMessage('I am eager for work!');

var timer = setInterval(function() {
  postMessage('I am eager for work!');
}, 1000);

onmessage = function(event) {
  clearInterval(timer);
  postMessage('Alright Boss!');
};

Now the output is less annoying.

I am eager for work!
Alright Boss!

Now that we know the basics of web workers, lets look of some other interesting issues that come up.

Debugging Web Workers

If you try to use console.log in your web workers you will get an error messages such as this:

Uncaught ReferenceError: console is not defined

So this is an issue with web workers, it is not possible to use console or alert to debug them.

It is not a big problem because with Chrome it is possible to debug workers. In the lower right corner of the source tab of the Chrome Developer Tools, there is a Workers panel.

Checking the checkbox Pause on start will open up a new inspector window which allow us to debug the worker just as if it was a normally loaded script. Nice!

Errors

If there are script errors in the web worker, it will send back an error event instead of a message event. The errors can be handled via the onerror property of by subscribing to the error event.

worker.onerror(function(event) {
  console.log(event);
});
// or
worker.addEventListener('error', function(event) {
  console.log(event);
});

The above code will result in an event that looks like this, showing you the filename and line number to handle the message.

ErrorEvent {lineno: 6, filename: "http://localhost/web-workers/eager-worker.js", message: "Uncaught ReferenceError: missing is not defined", clipboardData: undefined, cancelBubble: false}

Web Worker Script Loading

A web worker can load additional scrips with importScripts(URL, ...). The URLs can be relative and, if so, are relative to the file doing the importing.

importScripts('../data/swedish-word-list.js', 'word-maestro.js', 'messageHandler.js')

A larger example

In this example I will show how easy it is to create a delegating worker that allows me to call normal methods on an object.

The messages are sent using a simple protocol with an object containing two properties.

// The message object
var message = {
  method: 'The method I wish to call',
  args:   ['An array of arguments']
}

main.js starts the delegating-worker.js and sends messages to it.

// main.js
var worker = new Worker("delegating-worker.js");
worker.onmessage = function(event) {
  console.log(event.data);
};

setInterval(function() {
  // Call the method echo with the argument ['Work']
  worker.postMessage({method: 'echo', args: ['Work']});
}, 4200);

setInterval(function() {
  // Call the method ohce with the argument ['Work']
  worker.postMessage({method: 'ohce', args: ['Work']});
}, 1100);

The delegating-worker.js loads the external script echo.js, which declares the variable Echo in the global worker scope. In the onmessage method I unpack the event and delegate the method call to Echo via apply. I use apply since I want to allow a variable number of arguments. The reply is sent back to main.js along with the method that was called.

// Declares Echo
importScripts('echo.js');

onmessage = function(event) {
  var method = event.data.method;
  var args = event.data.args;

  // I use apply since to allow a variable number of arguments
  var reply = Echo[method].apply(Echo, args);
  self.postMessage({method: method, reply: reply});
};

The Echo service is a simple object with two methods.

var Echo = {
  // Return the word recieved.
  echo: function(word) {
    return word;
  },
  // Reverse the word and return it.
  ohce: function(word) {
    return word.split('').reverse().join('');
  }
};

Structuring the code in this way makes it easy to reuse the functionality in a non worker context.

Limitations of Web Workers

Since web workers are working in the background, they do not have access to the DOM, window, document or even the console. Any communication with these objects will have to be done by sending messages back to the main script

Checking for Web Worker support

Checking for Web Worker support is easy, just check if window.Worker is defined and show an error page or use an alternative solution if it is not.

function workersSupported() {
  return window.Worker;
}

if (!workersSupported()) {
  window.location = './unsupported-browser.html';
}

Wrap up

Using workers is easy, if you want to see a more thorough example, check out the source code (in Coffeescript) for Word-Maestro.

Friday, January 18, 2013

Index Your Database, Like a Boss

During the holidays I had the opportunity to read the wonderful book, SQL Performance Explained and now I want to share some of my new found knowledge.

The book contains a lot more information such as how joins, datatypes, ranges, etc, affect performance. In this post I will write about indexing, what an index is, what should be indexed, when to use simple indexes and when to use composite.

I will be using a Postgres database since that is what I know best, but most of it can be applied to any database although the details may differ.

Why index?

There are at least two reasons for using an index. One is for making sure that invalid data is not entered, in that case the index, unique, is used as a constraint. The other reason for using indexes is what I'm writing about here, performance.

By tuning my database, I will not only improve the performance of single queries, I will also improve how many queries it will be able to handle at the same time. A well-tuned database uses less resources and does more with less.

What is an index?

An index is a B-Tree, a data structure that keeps data sorted and allows searches, sequential access, insertions and deletions in logarithmic time. --Wikipedia

The index is the reason why a database is fast. If I have a table with 1 million rows and I do a sequential search, I have to search through all the rows to know if something matches. If on the other hand I have an index on it I only have to search through log2(1 million) ~ 20 rows instead. Quite an improvement! And this improvement is for log2.

Databases expose this concept to a maximum extent and put as many entries as possible into each node—often hundreds. That means that every new index level supports a hundred times more entries. --SQL Performance Explained

This means a database has to search through log100(1 million) ~ 3 rows.

What should I index?

So, the million-dollar question, actually the e-book is only EUR 9.95 :), is: "What should I index?". The answer to this is, as always, it depends, (Damn consultants never willing to stand for anything!) :). But, it does not depend that much. I'm going to go out on a limb and write:

In a typical web application I will have a 100 times more queries than inserts, so I index everything I put in my where clauses.

So, with this premise, it is as easy as analyzing my queries finding all the where clauses and adding an index to each an every one of them.

Example

Let's say that we have a users table with four columns, the database usually adds an automatic index to the primary key.

A users table with four columns

  • id (primary key)
  • firstname
  • lastname
  • male_female

When I select a user by id, the primary key, this is how Postgres will execute it.

/* Explain how postgres will execute. */
explain select * from users where id = 1;

 Index Scan using users_pkey on users  (cost=0.00..9.37 rows=1 width=49)
   Index Cond: (id = 1)
/* Index Scan is good, it means Postgres is using the index. */

Postgres will use the users_pkey index and it will cost 9.37. (Cost is measured in units of disk page fetches).

Simple Indexes

If I instead try to find a user by firstname

/* Explain how postgres will execute. */
explain select * from users where firstname = 'Anders';

 Seq Scan on users  (cost=0.00..22805.00 rows=1 width=49)
   Filter: ((firstname)::text = 'Anders'::text)
/* Seq Scan is not good, it means Postgres is doing full table scan. */

Aiya! Postgres will do a full table scan and it will take cost 22805. That is more than 2000 times slower!

To fix this I need to put an index on the table.

/* Add index on firstname column of users. */
create index users_firstname on users (firstname);
CREATE INDEX

/* Explain how postgres will execute */
explain select * from users where firstname = 'Anders';

explain select * from users where firstname = 'Anders';
 Index Scan using users_firstname on users  (cost=0.00..9.81 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
/* Index Scan is good, it means Postgres is using the index. */

Zippedidoodaa! Postgres is performant again, by using my new index.

Alright, that was easy. So what happens if we want to find users by both firstname and lastname?

/* Explain how postgres will execute */
explain select * from users where firstname = 'Anders' and lastname = 'Janss';

 Index Scan using users_firstname on users  (cost=0.00..9.81 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
   Filter: ((lastname)::text = 'Janss'::text)
/* Index Scan is good, but the filter means Postgres is scanning the found data. */

The Filter above means that Postgres will scan the result for matching lastnames. We can do better. Let's add an index on lastname to see what happens.

/* Add index on lastname column of users. */
create index users_lastname on users(lastname);
CREATE INDEX

explain select * from users where firstname = 'Anders' and lastname = 'Janss';

 Index Scan using users_lastname on users  (cost=0.00..9.81 rows=1 width=49)
    Index Cond: ((lastname)::text = 'Janss'::text)
       Filter: ((firstname)::text = 'Anders'::text)
/* Index Scan is good, but the filter means Postgres is scanning the found data. */

Nothing happened! Postgres cannot take advantage of this index. Let's remove it again. My Mama always said "Don't leave indexes around unless you can prove that they help you." :).

Composite Indexes

To solve the problem with AND clauses I need to create a composite index.

/* Create a composite index on firstname and lastname on users. */
create index users_firstname_lastname on users(firstname, lastname);
CREATE INDEX
explain select * from users where firstname = 'Anders' and lastname = 'Janss';

 Index Scan using users_firstname on users  (cost=0.00..9.81 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
   Filter: ((lastname)::text = 'Janss'::text)
/* Postgres is not using our new index, it prefers our other index */

WTF? Postgres is ignoring our composite index, in preference to our firstname index. Why? Because it believes that it is faster! Is it? Let's find out. The \timing command turns on timing in Postgres.

\timing
Timing is on.

/* Select count instead to avoid printing everything to screen */
select count(*) from users where firstname = 'Anders' and lastname = 'Janss';
 count
-------
 12018
(1 row)
Time: 10.848 ms

Now I drop the firstname index to make sure that Postgres will use my index.

drop index users_firstname;
DROP INDEX

explain select count(*) from users where firstname = 'Anders' and lastname = 'Janss' ;

 Aggregate  (cost=10.70..10.71 rows=1 width=0)
   ->  Index Only Scan using users_firstname_lastname on users  (cost=0.00..10.69 rows=1 width=0)
         Index Cond: ((firstname = 'Anders'::text) AND (lastname = 'Janss'::text))
/* The calculated cost is higher than the cost for using the firstname index */


select count(*) from users where firstname = 'Anders' and lastname = 'Janss';
 count 
-------
 12018
(1 row)
Time: 9.566 ms

The query using my composite index is about 1ms faster, so in this case Postgres was wrong, but for other data distributions my guess may be wrong. Another reason to always measure.

If I now perform the firstname query again, Postgres can use my composite index, but if I try to search for lastname it cannot. The order of the fields in the composite index matters.

explain select * from users where firstname = 'Anders' ;
 Index Scan using users_firstname_lastname on users  (cost=0.00..10.69 rows=1 width=49)
   Index Cond: ((firstname)::text = 'Anders'::text)
/* Using the composite index at a slightly higher cost than simple index. */

explain select * from users where lastname = 'Janss' ;

 Seq Scan on users  (cost=0.00..23263.25 rows=1 width=49)
   Filter: ((lastname)::text = 'Janss'::text)```
/* The search for lastname has to do a full table scan. */

Alright, one last query before I'm done. What happens with an OR query?

explain select * from users where firstname = 'Anders' or lastname = 'Janss';

 Seq Scan on users  (cost=0.00..25818.30 rows=2 width=49)
   Filter: (((firstname)::text = 'Anders'::text) OR ((lastname)::text = 'Janss'::text))
/* The index is not used for the OR query. */

OR queries are like separate queries and they need separate indexes. I'll add an extra index to lastname and we should be good to go.

create index users_lastname on users(lastname);
CREATE INDEX

explain select * from users where firstname = 'Anders' or lastname = 'Janss';

 Bitmap Heap Scan on users  (cost=12.51..20.45 rows=2 width=49)
   Recheck Cond: (((firstname)::text = 'Anders'::text) OR ((lastname)::text = 'Janss'::text))
   ->  BitmapOr  (cost=12.51..12.51 rows=2 width=0)
         ->  Bitmap Index Scan on users_firstname_lastname  (cost=0.00..6.68 rows=1 width=0)
               Index Cond: ((firstname)::text = 'Anders'::text)
         ->  Bitmap Index Scan on users_lastname  (cost=0.00..5.82 rows=1 width=0)
               Index Cond: ((lastname)::text = 'Janss'::text)
/* Using both our composite index and the new lastname index. */

It worked! Postgres was able to use both our composite index and the new lastname index.

Summary

So to sum it up, to get a performant and scalable database:

  • Use indexes that covers all fields that are ANDed in where clauses.
  • Use separate indexes for fields that are in OR clauses.
  • Create reusable composite indexes.
    • When searching for (firstname, lastname) and firstname, add ONE composite index on (firstname, lastname).
    • When searching for (firstname, lastname) and lastname, add ONE composite index on (lastname, firstname).
    • When searching for (firstname, lastname), firstname, and lastname, add TWO indexes, one composite index on (firstname, lastname) (or reversed) and one simple index on lastname (or firstname)

I highly recommend the book. It is good!

Tuesday, November 27, 2012

Configure Git, Like a Boss

I have just created a Git presentation. The presentation is named Git, Practical Tips, and it contains good practices that I have picked up during my four years as a Git user.

The presentation consists of six parts, A quick introduction; History manipulation with merge, rebase and reset; Finding with Git; Configuration; Under the hood; and Interacting with Github.

If you find this interesting and would like to hear a very practical presentation about Git tips and tricks, feel free to contact me :)

In this post I will describe how to configure Git to work well from the command line. It consists of two main parts, Git configuration and Bash configuration.

I will only describe some select samples of my configuration here. If you want to see more, my configuration files are on Github.

Git Configuration

The Git configuration part is just a bunch of aliases I use. Some are simple and some are more advanced. The aliases are declared in my global git config file, ~/.gitconfig under the [alias] tag. Here are some of most important ones.

git add --patch

[alias]
ap = "add --patch"

git ap (git add --patch) is awesome. It lets me add selected parts of the changes in my working directory, allowing me to create a consistent commit with a simple clear commit message.

git add --update

au = "add --update"

git au adds all the changed files to the index. I use it mainly when I forget to remove a file with git rm and instead remove it with rm. In this case Git will see that the file is missing but not staged for removal. When I run git au it will be added to the index as if I had used git rm in the first place.

git stash save -u

ss = "stash save -u"

git ss stashes everything in my working directory, including untracked files (-u). The reason I use git stash save instead of just git stash is that it allows me to write a message for the stash, similar to a commit message.

git amend

amend = "commit --amend -c HEAD"
amendc = "commit --amend -C HEAD"

git amend lets me add more changes to the previous commit. It is very useful when I forget to add a change to the index before I commit it. It amends the new changes in the index and lets me edit the old commit message. git amendc does the same thing but reuses the old commit message.

git alias

alias = "!git config -l | grep alias | cut -c 7-"

git alias shows me all my aliases. Starting with a bang (!) is necessary to execute arbitrary bash commands. Note that the git command must be included. The code in the alias means, list configuration, find aliases, show characters 7 and on.

git log --diff-filter

fa = "log --diff-filter=A --summary"
fd = "log --diff-filter=D --summary"

git fa (find added) and git fd (find deleted) shows me a log of commits where files were added and deleted respectively. It is great for finding out how and when my files get deleted. I use it with a filename git fd my-missing-file.rb or with with grep, git fd | grep -C 3 missing.

grep -C 3 means shows me 3 lines of context around the matching line.

git log-pretty

l = "!git log-hist"
log-hist = "!git log-pretty --graph"
log-pretty = "log --pretty='format:%C(blue)%h%C(red)%d%C(yellow) %s %C(green)%an%Creset, %ar'"

git l is my main logging command it and prints a beautiful compact log. When I reuse an alias I must use the shell command alias, the bang (!), since Git does not allow me to reference an alias from another directly.

log               = The log command
--graph           = Text-based graphical representation
--pretty='format' = Format according to spec
%C(color)         = Change color
%h                = Abbreviated commit hash (6b266c2)
%d                = Ref names (HEAD, origin/master)
%s                = Subject (first line of comment)
%an               = Author name
%ar               = Author date, relative

git log --simplify-by-decoration

lt = "!git log-hist --simplify-by-decoration"

git lt (log tagged) uses --simplify-by-decoration to show a list of "important" commits. Important in this case means commits that are pointed to by a branch or tagged. It reuses the log-hist alias above.

Bash Configuration

I used to have a bunch of aliases, such as ga, gd, etc. but, now I use my Git aliases instead. But I still have configuration for command completion and a nice informative prompt.

function g()

I use the git command more than any other command during a days work. git status is the subcommand I use mostly. I have optimized for this by creating a function g() that has status as its default argument.

# `g` is a shortcut for git, it defaults to `git s` (status) if no argument is given.
function g() {
    local cmd=${1-s}
    shift
    git $cmd $@
}

The g() function gives me a lot of power out of a single character.

$ g
## master
 M README.md
?? doc.md

$ g l
* 4f71f8d (HEAD, heroku/master, master) Send 404 for missing ...
* ec00879 Added support for options Anders Janmyr, 5 weeks ago
* 09c178f (origin/master) id cannot be a number Anders Janmyr, 6 weeks ago
* e561d03 Send status and send in one call Anders Janmyr, 6 weeks ago
* 9615be5 Added some more logging Anders Janmyr, 6 weeks ago
* de4730e Improved the code somewhat Anders Janmyr, 6 weeks ago
* 1f3f763 Added allow methods header Anders Janmyr, 6 weeks ago
* ca3065c Added filter to documentation Anders Janmyr, 6 we

function gg()

My second (and last) function is gg().

# Commit pending changes and quote all arguments as message
function gg() {
    git ci -m "$*"
}

gg() allows me to type a commit message without any quotes.

$ gg Added todo list to the Readme
[master 98556af] Added todo list to the Readme
 1 file changed, 1 insertion(+)

bash-completion

Installing bash-completion gives me command completion for commands, subcommands and more.

# An example
$ git rem<TAB> sh<TAB> o<TAB>
# will complete to
$ git remote show origin

I use Homebrew to install Git, brew install git. It gives me a new version of Git. It also installs git-completion.bash in /usr/local/etc/bash_completion.d/.

I use the same configuration on Ubuntu and I check for the file in /etc/bash-completion.d/ too.

# Prefer /usr/local/etc but fallback to /etc
if [ -f /usr/local/etc/bash_completion.d/git-completion.bash ]
then
    source /usr/local/etc/bash_completion.d/git-completion.bash
elif [ -f /etc/bash_completion.d/git ]; then
    source /etc/bash_completion.d/git
fi

This is great but, what about my beautiful little g() function? How do I make it work with command completion? It turns out to be quite easy. Include the following little snippet in a configuration file, such as .bashrc.

# Set up git command completion for g
__git_complete g __git_main

The snippet reuses the functions, __git_complete and __git_main included with git-completion.bash to make completion work with g too. Lovely!

bash-prompt

In later versions of Git, the prompt functionality has been extracted out into its own script, git-prompt.sh. I include it like this.

if [ -f /usr/local/etc/bash_completion.d/git-prompt.sh ]
then
    source /usr/local/etc/bash_completion.d/git-prompt.sh
fi

I configure my prompt like this, it contains a little more magic than the plain Git configuration. I put it in one of my bash configuration files, such as .bashrc.

function prompt {
  # Check exit status of last command
  if [[ "$?" -eq "0" ]]; then
    # If it is OK (0) color the prompt ($) green
    local status=""
    local sign=$(echo -ne "\[${GREEN}\]\$\[${NO_COLOR}\]")
  else
    # If not OK (not 0) color the prompt ($) red and set status to exit code
    local status=" \[${RED}\]$?\[${NO_COLOR}\] "
    local sign=$(echo -ne "\[${RED}\]\$\[${NO_COLOR}\]")
  fi
  # Get the current SHA of the repository
  local sha=$(git rev-parse --short HEAD 2>/dev/null)
  # Set the prompt
  # \!                 - history number
  # :                  - literal :
  # \W                 - Basename of current working directory
  # $sha               - The SHA calculated above
  # $(__git_ps1 '@%s') - literal @ followed by Git branch, etc.
  # $status            - The exit status calculated above
  # $sign              - The red or green prompt, calculated above
  export PS1="[\!:${LIGHT_GRAY}\W${NO_COLOR} $sha${GREEN}$(__git_ps1 '@%s')${NO_COLOR}$status]\n$sign "
}

# Tell bash to invoke the above function when printing the prompt
PROMPT_COMMAND='prompt'

The function __git_ps1() can further be configured with some environment variables. This is what use.

# Git prompt config
export GIT_PS1_SHOWDIRTYSTATE=true
export GIT_PS1_SHOWUNTRACKEDFILES=true
export GIT_PS1_SHOWUPSTREAM="auto"
# export GIT_PS1_SHOWSTASHSTATE=true

The resulting prompt looks like this:

The different signs to the right indicate:

# * - Changed files in working dir
# + - Changed files in index
# % - Untracked files in working dir
# < - The branch is behind upstream
# > - The branch is ahead of upstream (Yes, it can be both)

More info can be found in git-prompt.sh.

Credits

Obviously I have not figured this out all by myself. Here are some of my sources: