Thursday, March 19, 2015

A Not Very Short Introduction to Docker

This is the notes that accompany my presentation called Docker, the Future of DevOps. It turned out, quite fittingly, to be a whale-sized article :).

What Is Docker And Why Should You Care?

Contrary to many others I believe that saying that Docker is a lightweight virtual machine is a very good description. Another way to look at Docker is chroot on steroids. The last explanation probably doesn't help much unless you know what chroot is.

Chroot is an operation that changes the apparent root directory for 
the current running process and their children. A program that is run 
in such a modified environment cannot access files and commands outside 
that environmental directory tree. This modified environment is called 
a chroot jail.

-- From Archwiki, chroot

VM vs. Docker

The image describes the difference between a VM and Docker. Instead of a hypervisor with Guest OSes on top, Docker uses a Docker engine and containers on top. Does this really tell us anything? What is the difference between a "hypervisor" and the "Docker engine"? A nice way of illustrating this difference is through listing the running processes on the Host.

The following simplified process trees illustrates the difference.

On the Host running the VM there is only one process running on the Host even though there are many processes running in the VM.

# Running processes on Host for a VM
$ pstree VM

-+= /VirtualBox.app
|--= coreos-vagrant

On the Host running the Docker Engine all the processes running are visible. The contained processes are running on the Host! They can be inspected and manipulated with normal commands like, ps, and kill.

# Running processes on Host for a Docker Engine
$ pstree docker

-+= /docker
|--= /bin/sh
|--= node server.js
|--= go run app
|--= ruby server.rb
...
|--= /bin/bash

Now when everything is crystal clear, what does this mean? It means that Docker containers are smaller, faster, and more easily integrated with each other than VMs as the table illustrates.

The size of a small virtual machine image with Core OS is about 1.2 GB. The size of a small container with busybox is 2.5 MB.

The startup time of a fast virtual machine is measured in minutes. The startup time of a container is often less than a second.

Integrating virtual machines running on the same host must be done by setting up the networking properly. Integrating containers is supported by Docker out of the box.

So, containers are lightweight, fast and easily integrated, but that is not all.

Docker is a Contract

Docker is also the contract between Developers and Operations. Developers and Operations often have very different attitudes when it comes to choosing tools and environments.

Developers want to use the next shiny thing, we want to use Node.js, Rust, Go, Microservices, Cassandra, Hadoop, blablabla, blablabla, ...

Operations want to use the same as they used yesterday, what they used last year, because it is proven, it works!

(Yes, I know this is stereotypical, but there is some truth in it :)

But, this is where Docker shines. Operations are satisfied because they only have to care about one thing. They have to support deploying containers. Developers are also happy. They can develop with whatever the fad of the day is and then just stick it into a container and throw it over the wall to Operations. Yippie ki-yay!

But, it does not end here. Since Operations are, usually, better than development when it comes to optimizing for production, they can help developers build optimized containers that can be used for local development. Not a bad situation at all.

Better Utilization

A few years ago, before virtualization, when we needed to create a new service, we had to acquire an actual machine, hardware. It could take months, depending on the processes of the company you were working for. One the server was in place we created the service and most of the time it did not become the success we were hoping for. The machine was ticking along with a CPU utilization of 5%. Expensive!

Then, virtualization entered the arena and it was possible to spin up a new machine in minutes. It was also possible to run multiple virtual machines on the same hardware so the utilization increased from 5%. But, we still need to have a virtual machine per service so the we cannot utilize the machine as much as we would want.

Containerization is the next step in this process. Containers can be spun up in seconds and they can be deployed at a much more granular level than virtual machines.

Dependencies

It is indeed nice that Docker can help us speed up our slow virtual machines but why can't we just deploy all our services on the same machine? You already know the answer, dependency hell. Installing multiple independent services on a single machine, real or virtual, is a recipe for disaster. Docker Inc. calls this the matrix of hell.

Docker eliminates the matrix of hell by keeping the dependencies contained inside the containers.

Speed

Speed is of course always nice, but being 100 times faster is not only nice, it changes what is possible. This much increase enables whole new possibilities. It is now possible to create throw-away environments. Need to change your entire development environment from Golang to Clojure? Fire up a container. Need to provide a production database for integration and performance testing? Fire up a container. Need to switch the entire production server from Apache to Nginx? Fire up a container!

How Does Docker Work?

Docker is implemented as a client-server system; The Docker daemon runs on the Host and it is accessed via a socket connection from the client. The client may, but does not have to, be on the same machine as the daemon. The Docker CLI client works the same way as any other client but it is usually connected through a Unix domain socket instead of a TCP socket.

The daemon receives commands from the client and manages the containers on the Host where it is running.

Docker Concepts and Interactions

  • Host, the machine that is running the containers.
  • Image, a hierarchy of files, with meta-data for how to run a container.
  • Container, a contained running process, started from an image.
  • Registry, a repository of images.
  • Volume, storage outside the container.
  • Dockerfile, a script for creating images.

We can build an image from a Dockerfile. We can also create an image by commiting a running container. The image can be tagged and it can be pushed to and pulled from a registry. A container is started by runing or createing an image. A container can be stopped and started. It can be removed with rm.

Images

An image is a file structure, with meta-data for how to run a container. The image is built on a union filesystem, a filesystem built out of layers. Every command in the Dockerfile creates a new layer in the filesystem.

When a container is started all images are merged together into what appears to the process as unified. When files are removed in the union file system they are only marked as deleted. The files will still exist in the layer where they were last present.

# Commands for interacting with images
$ docker images  # shows all images.
$ docker import  # creates an image from a tarball.
$ docker build   # creates image from Dockerfile.
$ docker commit  # creates image from a container.
$ docker rmi     # removes an image.
$ docker history # list changes of an image.

Image Sizes

Here are some data on commonly used images:

  • scratch - this is the ultimate base image and it has 0 files and 0 size.
  • busybox - a minimal Unix weighing in at 2.5 MB and around 10000 files.
  • debian:jessie - the latest Debian is 122 MB and around 18000 files.
  • ubuntu:14.04 - Ubuntu is 188 MB and has around 23000 files.

Creating images

Images can be created with docker commit container-id, docker import url-to-tar, or docker build -f Dockerfile .

# Creating an image with commit
$ docker run -i -t debian:jessie bash
root@e6c7d21960:/# apt-get update
root@e6c7d21960:/# apt-get install postgresql
root@e6c7d21960:/# apt-get install node
root@e6c7d21960:/# node --version
root@e6c7d21960:/# curl https://iojs.org/dist/v1.2.0/iojs-v1.2.0-
linux-x64.tar.gz -o iojs.tgz
root@e6c7d21960:/# tar xzf iojs.tgz
root@e6c7d21960:/# ls
root@e6c7d21960:/# cd iojs-v1.2.0-linux-x64/
root@e6c7d21960:/# ls
root@e6c7d21960:/# cp -r * /usr/local/
root@e6c7d21960:/# iojs --version
1.2.0
root@e6c7d21960:/# exit
$ docker ps -l -q
e6c7d21960
$ docker commit e6c7d21960 postgres-iojs
daeb0b76283eac2e0c7f7504bdde2d49c721a1b03a50f750ea9982464cfccb1e

As you can see from the above session, it is possible to create images with docker commit but it is kind messy and it is hard to reproduce. It is better to create images with Dockerfiles since they are clear and are easily reproduced.

FROM debian:jessie
# Dockerfile for postgres-iojs

RUN apt-get update
RUN apt-get install postgresql
RUN curl https://iojs.org/dist/iojs-v1.2.0.tgz -o iojs.tgz
RUN tar xzf iojs.tgz
RUN cp -r iojs-v1.2.0-linux-x64/* /usr/local

Build it with

$ docker build -tag postgres-iojs .

Since every command in the Dockerfile creates a new layer it is often better to run similar commands together. Group the commands with and and split them over several lines for readability.

FROM debian:jessie
# Dockerfile for postgres-iojs

RUN apt-get update && \
  apt-get install postgresql && \
  curl https://iojs.org/dist/iojs-v1.2.0.tgz -o iojs.tgz && \
  tar xzf iojs.tgz && \
  cp -r iojs-v1.2.0-linux-x64/* /usr/local

The ordering of the lines in the Dockerfile is important as Docker caches the intermediate images, in order to speed up image building. Order your Dockerfile by putting the lines that change more often at the bottom of the file. ADD and COPY get special treatment from the cache and are re-run whenever an affected file changes even though the line does not change.

Dockerfile Commands

The Dockerfile supports 13 commands. Some of the commands are used when you build the image and some are used when you run a container from the image. Here is a table of the commands and when they are used.

BUILD Commands

  • FROM - The image the new image will be based on.
  • MAINTAINER - Name and email of the maintainer of this image.
  • COPY - Copy a file or a directory into the image.
  • ADD - Same as COPY, but handle URL:s and unpack tarballs automatically.
  • RUN - Run a command inside the container, such as apt-get install.
  • ONBUILD - Run commands when building an inherited Dockerfile.
  • .dockerignore - Not a command, but it controls what files are added to the build context. Should include .git and other files not needed when building the image.

RUN Commands

  • CMD - Default command to run when running the container. Can be overridden with command line parameters.
  • ENV - Set environment variable in the container.
  • EXPOSE - Expose ports from the container. Must be explicitly exposed by the run command to the Host with -p or -P.
  • VOLUME - Specify that a directory should be stored outside the union file system. If is not set with docker run -v it will be created in /var/lib/docker/volumes
  • ENTRYPOINT - Specify a command that is not overridden by giving a new command with docker run image cmd. It is mostly used to give a default executable and use commands as parameters to it.

Both BUILD and RUN Commands

  • USER - Set the user for RUN, CMD and ENTRYPOINT.
  • WORKDIR - Sets the working directory for RUN, CMD, ENTRYPOINT, ADD and COPY.

Running Containers

When a container is started, the process gets a new writable layer in the union file system where it can execute.

Since version 1.5, it is also possible to make this top layer read-only, forcing us to use volumes for all file output such as logging, and temp-files.

# Commands for interacting with containers
$ docker create  # creates a container but does not start it.
$ docker run     # creates and starts a container.
$ docker stop    # stops it.
$ docker start   # will start it again.
$ docker restart # restarts a container.
$ docker rm      # deletes a container.
$ docker kill    # sends a SIGKILL to a container.
$ docker attach  # will connect to a running container.
$ docker wait    # blocks until container stops.
$ docker exec    # executes a command in a running container.

docker run

As the list above describes, docker run is the command used to start new containers. Here are some common ways to run containers.

# Run a container interactively
$ docker run -it --rm ubuntu

This is the way to run a container if you want to interact with it as a normal terminal program. If you want to pipe into the container, you should not use the -t option.

  • --interactive (-i) - send stdin to the process.
  • -tty (-t) - tell the process that a terminal is present. This affects how the process outputs data and how it treats signals such as (Ctrl-C).
  • --rm - remove the container on exit.
  • ubuntu - use the ubuntu:latest image.
# Run a container in the background
$ docker run -d hadoop
  • --detached (-d) - Run in detached mode, you can attach again with docker attach

docker run --env

# Run a named container and pass it some environment variables
$ docker run \
  --name mydb \
  --env MYSQL_USER=db-user \
  -e MYSQL_PASSWORD=secret \
  --env-file ./mysql.env \
  mysql
  • --name - name the container, otherwise it gets a random name.
  • -env (-e) - Set the environment variable in the container
  • --env-file - Set all environment variables in env-file
  • mysql - use the mysql:latest image.

docker run --publish

# Publish container port 80 on a random port on the Host
$ docker run -p 80 nginx

# Publish container port 80 on port 8080 on the Host
$ docker run -p 8080:80 nginx

# Publish container port 80 on port 8080 on the localhost interface on the Host
$ docker run -p 127.0.0.1:8080:80 nginx

# Publish all EXPOSEd ports from the container on random ports on the Host
$ docker run -P nginx

The nginx image, for example, exposes port 80 and 443.

  1 FROM debian:wheezy
  2
  3 MAINTAINER NGINX "docker-maint@nginx.com"
 21
 22 EXPOSE 80 443
 23
 ```

### docker run --link

# Start a postgres container, named mydb
$ docker run --name mydb postgres

# Link mydb as db into myqpp
$ docker run --link mydb:db myapp

Linking a container sets up networking from the linking container into the linked container. It does two things:

  • It updates the /etc/hosts with the link name given to the container, db in the example above. Making it possible to access the container by the name db. This is very good.
  • It creates environment variables for the EXPOSEd ports. This is practically useless since I can access the same port by using a hostname:port combination anyway.

The linked networking is not constrained by the ports EXPOSEd by the image. All ports are available to the linking container.

docker run limits

It is also possible to limit how much access the container has to the Host's resources.

# Limit the amount of memory
$ docker run -m 256m yourapp

# Limit the number of shares of the CPU this process uses (out of 1024)
$ docker run --cpu-shares 512 mypp

# Change the user for the process to www instead of root (good for security)
$ docker run -u=www nginx

Setting CPU shares to 512 out of 1024 does not mean that the process gets access to half of the CPU, it means that it gets half as many shares as a container that is run without any limit. If we have two containers running with 1024 shares and one with 512 shares the 512-container will get about 1 fifth of the CPU shares.

docker exec container

docker exec allows us to run commands inside already running containers. This is very good for debugging among other things.

# Run a shell inside the container with id 6f2c42c0
$ docker exec -it 6f2c42c0 sh

Volumes

Volumes provide persistent storage outside the container. That means the data will not be saved if you commit the new image.

# Start a new nginx container with /var/log as a volume
$ docker run  -v /var/log nginx

Since the directory of the host is not given, the volume is created in /var/lib/docker/volumes/ec3c543bc..535.

The exact name of the directory can be found by running docker inspect container-id.

# Start a new nginx container with /var/log as a volume mapped to /tmp on Host
$ docker run -v /tmp:/var/log nginx

It is also possible to mount volumes from another container with --volumes-from.

# Start a db container
$ docker run -v /var/lib/postgresql/data --name mydb postgres

# Start a backup container with the volumes taken from the mydb container
$ docker run --volumes-from mydb backup

Docker Registries

Docker Hub is the official repository for images. It supports public (free) and private (fee) repositories. Repositories can be tagged as official and this means that they are curated by the maintainers of the project (or someone connected with it).

Docker Hub also supports automatic builds of projects hosted on Github and Bitbucket. If automatic build is enabled an image will automatically be built every time you push to your source code repository.

If you don't want to use automatic builds, you can also docker push directly to Docker Hub. docker pull will pull images. docker run with an image that does not exist locally will automatically initiate a docker pull.

It is also possible to host your images elsewhere. Docker maintains code for docker-registry on Github. But, I have found it to be slow and buggy.

Quay, Tutum, and Google also provides hosting of private docker images.

Inspecting Containers

A lot of commands are available for inspecting containers:

$ docker ps      # shows running containers.
$ docker inspect # info on a container (incl. IP address).
$ docker logs    # gets logs from container.
$ docker events  # gets events from container.
$ docker port    # shows public facing port of container.
$ docker top     # shows running processes in container.
$ docker diff    # shows changed files in container's FS.
$ docker stats   # shows metrics, memory, cpu, filsystem

I will only elaborate on docker ps and docker inspect since they are the most important ones.

# List all containers, (--all means including stopped)
$ docker ps --all
CONTAINER ID   IMAGE            COMMAND    NAMES
9923ad197b65   busybox:latest   "sh"       romantic_fermat
fe7f682cf546   debian:jessie    "bash"     silly_bartik
09c707e2ec07   scratch:latest   "ls"       suspicious_perlman
b15c5c553202   mongo:2.6.7      "/entrypo  some-mongo
fbe1f24d7df8   busybox:latest   "true"     db_data


# Inspect the container named silly_bartik
# Output is shortened for brevity.
$ docker inspect silly_bartik
    1 [{
    2     "Args": [
    3         "-c",
    4         "/usr/local/bin/confd-watch.sh"
    5     ],
    6     "Config": {
   10         "Hostname": "3c012df7bab9",
   11         "Image": "andersjanmyr/nginx-confd:development",
   12     },
   13     "Id": "3c012df7bab977a194199f1",
   14     "Image": "d3bd1f07cae1bd624e2e",
   15     "NetworkSettings": {
   16         "IPAddress": "",
   18         "Ports": null
   19     },
   20     "Volumes": {},
   22 }]

Tips and Tricks

To get the id of a container is useful for scripting.

# Get the id (-q) of the last (-l) run container
$ docker ps -l -q
c8044ab1a3d0

docker inspect can take a format string, a Go template, and it allows you to be more specific about what data you are interested in. Again, useful for scripting.

$ docker inspect -f '{{ .NetworkSettings.IPAddress }}' 6f2c42c05500

172.17.0.11

Use docker exec to interact with a running container.

# Get the environment variables of a running container.
$ docker exec -it 6f2c42c05500 env

PATH=/usr/local/sbin:/usr...
HOSTNAME=6f2c42c05500
REDIS_1_PORT=tcp://172.17.0.9:6379
REDIS_1_PORT_6379_TCP=tcp://172.17.0.9:6379
...

Use volumes to avoid having to rebuild an image every time you run it. Every time the below Dockerfile is built it copies the current directory into the container.

  1 FROM dockerfile/nodejs:latest
  2
  3 MAINTAINER Anders Janmyr "anders@janmyr.com"
  4 RUN apt-get update && \
  5   apt-get install zlib1g-dev && \
  6   npm install -g pm2 && \
  7   mkdir -p /srv/app
  8
  9 WORKDIR /srv/app
 10 COPY . /srv/app
 11
 12 CMD pm2 start app.js -x -i 1 && pm2 logs
 13
# Build and run the image
$ docker build -t myapp .
$ docker run -it --rm myapp

To avoid the rebuild, build the image once and then mount the local directory when you run it.

$ docker run -it --rm -v $(PWD):/srv/app myapp

Security

You may have heard that it is not secure to use Docker. This is not untrue, but it does not have to be a problem.

The following security problems currently exists with Docker.

  • Image signatures are not properly verified.
  • If you have root in a container you can, potentially, get root on the entire box.

Security Remedies

  • Use trusted images from your private repositories.
  • Don't run containers as root, if possible.
  • Treat root in a container as root outside a container

If you own all the containers running on the server, you don't have to worry about them interacting with each other maliciously.

Container "Options"

I put "options" in quotes since there are not really any options at the moment, but a lot of players want to get in the game. Ubuntu is working on something called LXD and Microsoft on something called Drawbridge. But, the one that seems most interesting is the one called Rocket.

Rocket is developed by Core OS, who is a big container (Docker) platform. The reason for developing it is that they feel that Docker Inc. are bloating Docker and also that they are moving into the same area as Core OS, which is container hosting in the cloud.

With this new container specification they are trying to remove some of the warts which Docker has for historical reasons and to provide a simple container with support for socket activation and security built in from the start.

Orchestration

When we split up our application into multiple different containers we get some new problems. How do we make the different parts talk to each other? On a single host? On multiple hosts?

Docker solves the problem of orchestration with on single host with links.

To simplify the linking of containers Docker provides a tool called docker-compose. It was previously called fig and was developed by another company which was recently acquired by Docker.

docker-compose

docker-compose declares the information for multiple containers in a single file, docker-compose.yml. Here is an example of a file that manages two containers, web and redis.

  1 web:
  2   build: .
  3   command: python app.py
  4   ports:
  5    - "5000:5000"
  6   volumes:
  7    - .:/code
  8   links:
  9    - redis
 10 redis:
 11   image: redis

To start the above containers, you can run the command docker-compose up.

  1 $ docker-compose up
  2 Pulling image orchardup/redis...
  3 Building web...
  4 Starting figtest_redis_1...
  5 Starting figtest_web_1...
  6 redis_1 | [8] 02 Jan 18:43:35.576 # Server
  7 started, Redis version 2.8.3
  8 web_1   |  * Running on http://0.0.0.0:5000/

It is also possible to start the containers in detached mode with docker-compose up -d and I can find out what containers are running with docker-compose ps.

  1 $ docker-compose up -d
  2 Starting figtest_redis_1...
  3 Starting figtest_web_1...
  4 $ docker-compose ps
  5 Name              Command                    State   Ports
  6 ------------------------------------------------------------
  7 figtest_redis_1   /usr/local/bin/run         Up
  8 figtest_web_1     /bin/sh -c python app.py   Up      5000->5000

It is possible to run commands that work with a single container or commands that work with all containers at once.

  1 # Get env variables for web container
  2 $ docker-compose run web env

  3 # Scale to multiple containers
  4 $ docker-compose scale web=3 redis=2

  5 # Get logs for all containers
  6 $ docker-compose logs

As you can see from the above commands, scaling is supported. The application must be written in a way that can handle multiple containers. Load-balancing is not supported out of the box.

Docker Hosting

A number of companies want to get in on the business of hosting Docker in the cloud. The image below shows a collection.

These provider try to solve different problems, from simple hosting to becoming a "cloud operating system". I will only elaborate on two of them

Core OS

As the image shows, Core OS is a collection of services to enable hosting of multiple containers in a Core OS cluster.

  • The Core OS Linux distribution is a stripped down Linux. It uses 114MB of RAM on initial boot. It does not provide a package manager, since it uses Docker or their own Rocket container to run everything.
  • Core OS uses Docker (or Rocket) to install an application on a host.
  • It uses systemd as init-service since it has great performance, handles start-up dependencies well, has great logging, and supports socket-activation.
  • etcd is a distributed, consistent key value store for shared configuration and service discovery.
  • fleet is a cluster manager. It is an extension of systemd to work with multiple machines. It uses etcd to manage configuration and it is running on every Core OS machine.

AWS

It is possible to host Docker containers on Amazon in two ways.

  • Elastic Beanstalk can deploy Docker containers. This works fine but I find it to be very slow. A new deploy takes several minutes and it does not feel right when a container can be started in seconds.
  • ECS, Elastic Container Service, is Amazon's upcoming container cluster solution. It is currently in preview 3 and it looks very promising. Just as with Amazon's other services, you interact with it through simple web service calls.

Summary

  • Docker is here to stay.
  • It fixes dependency hell.
  • Containers are fast!
  • Cluster solutions exists, but don't expect them to be seamless, yet!

Tuesday, December 16, 2014

Lambda, Javascript Micro-Services on AWS

Amazon just released a bunch of new services. My favorite is Lambda. Lambda allows me to deploy simple micro-services without having to setup any servers at all. Everything is hosted in the AWS cloud. Another cool thing about Lambda services is that the default runtime is Node.js!

To get access to AWS Lambda, you have to sign in to the [AWS Console] and select the Lambda service. You have to fill out a form to request access, which may take a while to come through. Once you have access you can edit the functions in a web form.

A lambda service is a Node module which exports an object with one function, the handler. In the AWS examples this is usually called handler and I'm going to follow their example.

Here is a simple function that can be edited and invoked in the online Lambda Edit/Test tool.

// hello-event.js
exports.handler = function(event, context) {
  console.log('Hello', event);
  context.done(null, 'Success');
}

The event is any JSON object and since a String is a valid object it can be invoked with "Tapir", which results in the following output in Lambda tool.

Logs
----
START RequestId: 3e21d80e-7e31-11e4-912c-2f870de05098
2014-12-07T16:51:47.163Z 3e21d80e-7e31-11e4-912c-2f870de05098 Hello Tapir
END RequestId: 3e21d80e-7e31-11e4-912c-2f870de05098
REPORT RequestId: 3e21d80e-7e31-11e4-912c-2f870de05098 Duration: 3.89 ms Billed Duration: 100 ms  Memory Size: 128 MB Max Memory Used: 9 MB
Message
-------
Success

Working in the Lambda online tool is sufficient for simple examples examples but quickly gets annoying and once you need to add extra modules, you have to upload zip-archives and this is both error prone and tedious. Here is a simple script to zip relevant files and upload them to Lambda. Make sure to update the region and the role to your own specific properties.

#!/bin/bash
#
# upload-lambda.sh
# Zip and upload lambda function
#

program=`basename $0`

set -o errexit

function usage() {
  echo "Usage: $program <function.js>"
}

if [ $# -lt 1 ]
then
  echo 'Missing required parameters'
  usage
  exit 1
fi

main=${1%.js}
file="./${main}.js"
zip="./${main}.zip"

role='arn:aws:iam::638281126589:role/lambda_exec_role'
region='eu-west-1'

zip_package() {
  zip -r $zip $file lib node_modules
}

upload_package() {
  aws lambda upload-function \
     --region $region \
     --role $role\
     --function-name $main  \
     --function-zip $zip \
     --mode event \
     --handler $main.handler \
     --runtime nodejs \
     --debug \
     --timeout 10 \
     --memory-size 128
}

# main
zip_package
upload_package

A Larger Example

Now that I know the Lambda works it is time to try out something more elaborate. I have read that it is not only possible to get access to npm modules but I also have access to the operating system when writing my service.

My bigger example consists of something I often have use for, a way to serve media files so that I don't have to check them into git. The way I want to do this is to upload a tarball to S3 and then have Lambda unpack the archive, checksum the files and upload them into another bucket.

Something like this:

  • React to the ObjectCreated:Put event
  • Download the tarball from S3
  • Extract tarball into temp directory
  • Checksum the files and rename them with the checksum
  • Upload the checksummed file to another S3 bucket
  • Upload an index of the files with mapping from old to new filename.

React to ObjectCreated:Put event

An AWS S3 ObjectCreated:Put event looks something like this in a trimmed down format

{
  "Records": [ {
      "eventVersion": "2.0",
      "eventSource": "aws:s3",
      "eventName": "ObjectCreated:Put",
      "s3": {
        "bucket": {
          "name": "anders-source",
        },
        "object": {
          "key": "tapirs.tgz",
          "size": 1024,
          "eTag": "d41d8cd98f00b204e9800998ecf8427e"
        }
      }
    }
  ]
}

To handle this event we need a handler function. All the handler needs to do is to extract the relevant properties from the file and then call assetify which will do the rest of the work. Breaking up the code like this allows me to use assetify locally and not only as a Lambda handler.

assetify.handler = function(event, context) {
    console.log('Received event:');
    console.log(JSON.stringify(event, null, '  '));

    var bucket = event.Records[0].s3.bucket.name;
    var key = event.Records[0].s3.object.key;
    assetify(bucket, key, function(err, result) {
        context.done(err, util.inspect(result));
    });
};

assetify

In order to use assetify as a normal module on a local machine I export the function with module.exports. This code needs to come before the assetify.handler declaration above. When exported this way, it is possible to require the function without involving Lambda.

function assetify(sourceBucket, key, callback) {
    var tgzRegex = new RegExp('\\.tgz');
    if (!key.match(tgzRegex)) return callback('no match');
    var prefix = path.basename(key, '.tgz');

    async.waterfall([
        downloadFile.bind(null, sourceBucket, key),
        extractTarBall,
        checksumFiles,
        uploadFiles.bind(null, prefix),
        uploadIndex.bind(null, prefix)
    ], function(err, result) {
        if (err) return callback(err);
        callback(null, result);
    });
}

module.exports = assetify;

I'm using async.waterfall in combination with bind to get a nice flat structure of the code which clearly resembles the described flow above.

Download file

The downloadFile function uses a nice feature of s3.getObject, streaming. After creating a temporary file with tmp.file, I create a request and then I stream the contents from S3 directly into a write stream. Very nice! I also need to hook up some event handler to allow me to notify the callback once the streaming is complete.

function downloadFile(sourceBucket, key, callback) {
    console.log('downloadFile', sourceBucket, key)
    tmp.file({postfix: '.tgz'}, function tmpCreated(err, tmpfile) {
        if (err) return callback(err);
        var awsRequest = s3.getObject({Bucket: sourceBucket, Key:key});
        awsRequest.on('success', function() {
            return callback(null, tmpfile);
        });
        awsRequest.on('error', function(response) {
            return callback(response.error);
        });
        var stream = fs.createWriteStream(tmpfile);
        awsRequest.createReadStream().pipe(stream);
    });
}

Extract tarball

In order to extract the tarball I'm using the ordinary tar command instead of relying on a Node module. This works fine as Lambda seems to include a full standard AWS distribution. Very nice to have access to all the common Unix utilities. The glob function makes it easy to traverse the full tree structure of the archive and I use this to return (or pass on via callback) a map of filenames to the temporary files.

function extractTarBall(tarfile, callback) {
    tmp.dir(function(err, dir) {
        if (err) return callback(err);
        var cmd = 'tar -xzf ' + tarfile + ' -C ' + dir;
        exec(cmd, function (err) {
            if (err) return callback(err);
            glob(dir + '**/*.*', function(err, files) {
                if (err) return callback(err);
                var fs = files.map(function(file) {
                    return {
                        path: file,
                        originalFile: file.replace(dir, '')
                    };
                });
                return callback(null, fs);
            });
        });
    });
}

Checksum

checksumFiles uses async.map to call the singular version checksumFile. This creates a checksum of the file and does some string manipulation in order to create a name with a checksum in it.

function checksumFiles(files, callback) {
    async.map(files, checksumFile, callback);
}

function checksumFile(file, callback) {
    checksum.file(file.path, { algorithm: 'md5'}, function(err, sum) {
        if (err) return callback(err);
        var filename = file.originalFile;
        var ext = path.extname(filename);
        var base = filename.replace(ext, '');
        var checksumFile = base + '-' + sum + ext;

        callback(null, {
            path: file.path,
            originalFile: file.originalFile,
            checksumFile: checksumFile
        });
    });
}

Upload files to S3

When the new filenames have been created the files can now be uploaded to S3 via s3.putObject. Unfortunately, putObject does not support pipe, but I can use a ReadStream as the value of the body object and this is good enough. It uses the mime module to calculate the content-type from the filename. After the file is uploaded an object with a mapping between the original name and the URL is returned.

function uploadFiles(prefix, files, callback) {
    console.log('uploadFiles', prefix, files)
    async.map(files, uploadFile.bind(null, prefix), callback);
}

function uploadFile(prefix, file, callback) {
    var stream = fs.createReadStream(file.path);
    var s3options = {
        Bucket: config.bucket,
        Key: prefix + file.checksumFile,
        Body: stream,
        ContentType: mime.lookup(file.path)
    };
    s3.putObject(s3options, function(err, data) {
        if (err) return callback(err);
        console.log('Object added', s3options);
        callback(null, {
            originalFile: file.originalFile,
            url: config.url + config.bucket + '/' + prefix + file.checksumFile
        });
    });
}

Upload the index

The last thing to is to upload the index with the filename-to-URL map as a JSON-file. This is done in a similar way as the upload of the images.

function uploadIndex(prefix, files, callback) {
    var s3options = {
        Bucket: config.bucket,
        Key: prefix + '/index.json',
        Body: JSON.stringify(files),
        ContentType: 'application/json'
    };

    s3.putObject(s3options, function(err, data) {
        if (err) return callback(err);
        console.log('Object added', s3options.Key);
        callback(null, {
            files: files,
            url: config.url + config.bucket + '/' + prefix + '/index.json'
        });
    });

}

The final index.json file loooks something like this.

[{
  originalFile: "/Tapir_standing_profile.jpg",
  url: "https://s3-eu-west-1.amazonaws.com/anders-dest/tapirs/Tapir_standing_profile-624bd0ac55d5140a78a2ea9d1409e2f6.jpg"
},
{
  originalFile: "/tapir-sticker.png",
  url: "https://s3-eu-west-1.amazonaws.com/anders-dest/tapirs/tapir-sticker-8522f4228bbc995d73ee1ead9d5e8e4f.png"
},
{
  originalFile: "/tapir.jpg",
  url: "https://s3-eu-west-1.amazonaws.com/anders-dest/tapirs/tapir-eb09705a33f6c6896def4e452fa77272.jpg"
}]

Summary

Lambda is very simple to work with and it allows me to create small services that react to events without the need to setup any servers at all.

Apart from the integration with S3, it also integrates with Kinesis and with DynamoDB allowing for very cool application to built.

Saturday, September 27, 2014

Fallacies and Biases of our Imperfect Mind

Our mind is the most advanced computer we know about. It can perform tremendous feats. Yet, it is fooling us a lot more than most of us would care to admit. The reason for this is that the mind takes shortcuts to save energy and speed up our thinking.

In this article I will present how science now believes that the brain works, the problems it has, and suggestions about what to do about it.

Our Incredible Mind

Imagine you are riding a bicycle into an intersection. Cars, motorcycles, mopeds and other bikes are coming from all directions. Your brain takes in the whole scene and makes instantaneous decisions about what route to take. You communicate both consciously and unconsciously with the other drivers and you cross the intersection as if it was nothing.

This is an example of what our incredible mind can do. But, in order to do this it takes shortcuts and these shortcuts are not always appropriate. The rest of this article will discuss the problems that occur when the shortcuts are not to our advantage.

Belief

What is belief? Why do we believe the things we do? What do we truly know? When we start to really analyze our beliefs we often realize that we don't know why we believe in something, we just do. And, we may also know that something is not correct but still act as if it is.

Can you get a cold from being cold?

No! The only way to get a cold is by being exposed to the cold virus. If you catch a cold after being cold it is only a coincidence. Yet, many of us tell our children to dress warmly to avoid getting a cold!

Perception

We think that our perception is infallible. We think we see what is real! This is not the case, our senses are easily fooled and also affected by what we expect to experience.

Shadow Illusion

Which one of A and B is lighter?

It is a trick question, they are both the same color as we see in this picture. Yet, even when we know this, it is impossible to see!

Pattern Illusion

Can you see anything in this image? Can you see the dalmatian?

If we draw the contour of the dalmatian it becomes obvious. But now, if you look at the above picture. Can you not see the dalmatian? Our perception is influenced by what we expect to see.

Attention Test

Watch this film and try to count the passes made by the white-dressed basketball players.

Did you get the count right? Did you see the gorilla? In the original study about half of the people that were shown this film didn't see the gorilla! Being focused on one thing can make us completely miss another.

This happens to us all the time in real life. People look at the same situation and interpret it completely differently.

Memory

We believe that we remember things as they actually were, but in reality our memories are reconstructed every time we remember something. We fill in new details.

Source and Truth Amnesia

We have a tendency to forget the source and the truthiness about facts that we know. We remember the facts, but we don't know where they come from or if they are true or not!

We may have heard about a correlation between vaccines and autism. But, we forgot, the minor detail, that there is a not even a very weak correlation between them. Hence, we refuse to vaccinate our kids since we don't want them to become autistic.

Vivid Memories

Vivid memories, memories involving strong feelings, makes us remember things more strongly. It makes us more confident about our memories being correct. Just because the memories are stronger does not mean that they are more correct. We simply believe in them more.

Memory Fusion

Memories also fuse together to form new composite memories, that may not resemble what really happened at all. Do you remember your tenth birthday or do you remember what your mom told you or what you have seen in pictures?

Fake Memories

We cannot tell if our memories are fake or if they really happened. Everything we remember seems real to us!

Pattern Recognition

Humans are also very good at pattern recognition. This allow us to detect and categorize people, animals, and things. But, our pattern recognition also shows us things that are not there. Was there really a dalmatian in the spotted picture above?

Agent Detection

Agent detection is an inclination for humans and animals to detect an intelligent agent in situations that may or may not involve one. We see and hear things that aren't there.

We detect a bush blowing in the wind as a person hiding. We see a rope lying on the trail as a dangerous snake.

Confabulated Consciousness

Our mind processes our perceptions and memories and creates our reality into a coherent story. The story need not be correct, it must only be consistent. In order to keep the story consistent our mind makes up the details it needs to.

In a study of split-brain patients, the patients were shown images. One image per eye. The split-brain condition prevents the two parts of the brain from communicating with each other.

In the depicted example, the patient was shown two images: one eye was shown a chicken foot, the other eye was shown a snowy landscape. The patient then had to pick a related image from a number of other pictures. The patient in the study picked a hen and a snow-shovel with each hand respectively.

When asked why he picked the images, his verbal side of the brain answered. "I picked the hen because I saw a chicken's foot and I picked the shovel because I need a shovel to clean out the hen house."

His mind made up story that was consistent with why he had a shovel in his other hand.

Our mind can make things up to make our life story consistent!

Bias

A bias is a prejudice. A cognitive bias is a type of error in thinking that occurs when we are processing and interpreting information in the world around us.

Cognitive biases are often a result of our attempt to simplify information processing. They are rules of thumb that help us make sense of the world and reach decisions with relative speed.

Unfortunately, these biases sometimes trip us up, leading to poor decisions and bad judgments.

Anchoring

The anchoring effect describes the human tendency to rely to heavily on the first piece of information offered, the anchor, when making decisions.

If I ask a group of people "If more or less than 20 percent of the mammals have four legs?" and then ask the same group to guess the specific percentage of animals that have four legs. I commonly get a lower percentage than if I initially had asked "If more of less than 80 percent of the mammals have four legs?".

We anchor to the number presented to us. This is the same technique that is used by salesmen when they offer you a good deal of only 20 thousand dollars for the second-hand Volvo.

Availability Heuristic

How many percent of the population do you think are allergic to gluten? How do you go about making such an estimation? What I often do is to think about the people around me. How many of them are allergic to gluten? It seems like quite a lot. I would guess about 10 percent of the people I know are allergic, so that is my reply.

This is the availability heuristic at work. Why should my tiny number of acquaintances have anything to do with the rest of the population in the world? But, this information is readily available to me and it is easier for me to just guess from this information than to think through the problem thoroughly.

Fundamental Attribution Error

Say you are walking in the street and stumble and fall. The common way we react to this is that we make up an excuse to why we fell, a hole in the pavement, etc. It is not my fault, there was a hole in the pavement. Perhaps, we even get angry, someone should really fix that!

If someone else stumbles and falls in the same spot, we readily label that person as being clumsy or careless.

We attribute our mistakes to external causes and other's mistakes to the person. We also give ourselves credit for good things we do, but other people's good deeds we attribute to luck or coincidence. This is the fundamental attribution error.

Hindsight Bias

Hindsight bias is also known as the "I-knew-it-all-along" effect. It is the tendency to see past events as being predictable at the time those events happened. (This picture does not really convey this bias, as the outcome can probably be predicted beforehand :)

An example of this is the 9/11 bombings, when the event had happened it was easy to find clues that informed about a coming attack. Clues like this exist all the time for things that never happen, but we don't focus on those because they are not relevant.

Confirmation Bias

This is the mother of all biases! A bias that we, all of us, fall into every day. It is the tendency to search for or interpret information in a way that confirms our beliefs. Or, to notice events that confirms our beliefs while ignoring events that disconfirms them.

Do I put the seat down when I have been on the toilet? All the time, I say. Never, my wife says. How can this be? How can I and my wife come to completely different conclusions from the same data?

The reason is that I notice the times when I remember to put the seat down, since I have to think about this and therefore remember it. I don't remember the times when I don't do it since, I don't even notice them.

For my wife it is the absolute opposite, she only notices when I forget to do it and doesn't notice when I do.

When we read an article that we agree with, it is easy to think, "Yes, that is the way it is!" and move on. If we read an article that we don't agree with, we can go to great lengths to examine the "erroneous" arguments to disconfirm them.

Innumeracy

The human mind is really bad at working with large numbers and probability.

Gambler's Fallacy

The tendency to think that future probabilities are altered by past events, when in reality they are unchanged.

Flip a coin ten times in a row and it turns out tails every time? How likely is it that we will flop a heads the next time. The answer is, of course, 50%. In this scenario most of us will know this is correct, but in many other scenarios we tend to think that the other option is due and hence calculate it as more likely to occur.

Lottery Fallacy

What is the odds of one person winning a lottery? Not very high, maybe one-in-a-billion, depending on which lottery it is. But often times this is not the real question to ask ourselves. We should often ask: What is the odds of anyone winning the lottery? It turns out that the odds for this are, usually, pretty good.

Imagine you dream that someone dies. When you wake up the next day it has really happened. What are the odds of this happening to you? I must be a miracle. No! The correct question is: What are the adds of this happening to anyone?

Base Rate Neglect

John is a man who wears Gothic inspired clothing, has long black hair, and listens to death metal. How likely is it that he is a Christian and how likely is it that he is a Satanist?

We have a tendency to answer that it is more likely that he is a Satanist. But, this ignores the base rate. The fact that there are 2 billion Christians and only, maybe, 2 million Satanists. With that base rate in place, it is much more likely that John is a Christian who likes wearing Gothic clothing, has long black hair and listens to death metal.

Clustering Illusion

This is the tendency to overestimate the importance of small runs, streaks, or clusters in large samples of random data.

The clustering illusion explains the "hot-hand" in basketball. The hot-hand is the belief that a player who has made a few baskets is more likely to make the next basket since he is on-a-roll.

Probability

Imagine a disease that 1% of the population has. Assume there is a test with 99% certainty of being correct. 1% false positives and 1% false negatives.

What is the probability that you have the disease if after taking the test it shows positive?

Our natural inclination to answer this question is, "Bloody sure!". But, in reality the probability of us having the disease is only 50%. Google it, if you don't believe it!

So What?

So, we believe things, but we don't know why. Our perception is severely influenced by what we already believe. Our memories are flawed. We see patterns and agents that don't exist. So what? This doesn't apply to us anyway, right?

It turns out that it does. Smarter people are better at rationalizing their beliefs than other's. We still make the same mistakes, but we are better at coming up with credible explanations as to why it is not a rationalization.

Skepticism

"I doubt it!" is not only a proper response to what other people say. It is also an appropriate response to our own thought and ideas.

Scientific skepticism holds that science is the best way to find out things about the world and ourselves. Scientific skeptics don't trust claims made by people who reject science or who don't think that science is the best way to learn about the world.

Scientific skeptics don't say that all extraordinary claims are false. A claim isn't false just because it hasn't been proven true.

It's possible pigs can fly, but until we see the evidence we shouldn't give up what science has taught us about pigs and flying.

Meta Cognition

Thinking about thinking! When you learn new facts, be aware of all the fallacies and biases mentioned in this article. This will help prevent you from making some mistakes.

Bias Blind Spot

The bias blind spot is the cognitive bias of failing to compensate for one's own cognitive biases. Even if we know everything I've written about here, we have a tendency to underestimate our potential for self-deception. To see ourselves as rational beings is the greatest self-deception of all.

Richard Feynman


The first principle is that you must not fool yourself -
and you are the easiest person to fool.  
-- Richard Feynman


References