Loading 430 million bitcoin trades with q/kdb+

The Bitmex bitcoin exchange generously makes available its historical trade and quote data. q/kdb+ is an amazing programming environment for working with this type of data.

In this post I'll attempt to showcase the speed and power of the q language using this dataset.

This will not be an introductory tutorial, I'm mainly just looking to show off the power of q. Everything shown should be copy and pastable into a q so feel free to follow along!

What you'll need

  • A decent amount of disk space. And time to download the data. Unzipped, the trade data csvs take up 34GB, and double that when they are processed into the database on disk (after which the csvs can be deleted).
  • A working installation of q, which you can obtain and find installation instructions for here
  • Because the data is stored in a public s3 bucket, the best way to obtain them is probably using the AWS cli tools, which you can see installation instructions for here.
  • A program to unzip the gzipped csv files. This tutorial will use gzip from the command line, which should be included on macOS, most linux distros, and Windows 10.

Getting the csv files

Let's start off by making a directory to work in

mkdir qbitmex && cd qbitmex

The first step in working with the bitmex data is downloading the data itself using the AWS cli tools.

We can call the following command to download all the csv files for the trade data we currently need.

aws --no-sign-request --region=eu-west-1 s3 sync s3://public.bitmex.com . --exclude "*" --include "data/trade/[0123456789]*.csv.gz"

This will take awhile.

Once this completes we'll need to unzip files we downloaded

gzip -df data/trade/*.gz

Loading the data with q

Next we'll need to actually load the files into kdb+.

The trade csvs should be in the data/trade directory

filelist:` sv' dir ,' key dir:`:data/trade

First we'll make a function that can take a single csv file and parse the columns correctly:

parse_csv:("zSSjfSgjff";enlist ",")0:

/we can test this out as follows:
/parse_csv first filelist

Now things start to get a little tricky. If we had enough memory we could just load all the files one by one into in-memory tables, concatenate the results together into one big table, and save the result to disk, like:

/will crash without absurd amounts of memory available
/`:trade_table set (,/) parse_csv each filelist

Unfortunately we don't have enough memory for this. There are a couple of useful kdb concepts that are designed for large tables like this.

Splayed Tables: Tables that are stored on disk with a seperate file for each column. The table will be a directory in the filesystem with each column its own file.

Partitioned Tables: Splayed tables which are further grouped by the values of a particular column. For example, it's common for tables to be partitioned by date. So in our case each trading date would have its own directory within the root directory (these dates would represent a "virtual column" in the resulting table), and each date directory would hold a directory for each table sharing this partition scheme (in our case "trade". So we could also have a future "quote" table which would share the same date virtual column on disk), and each of those directories would contain a file for each column associated with the date of the virtual column.

See Chapter 14 of the excellent Q for Mortals book for a more thorough description of splaying an partitioning tables.

Here are some q functions that will help us create the table structure we need.

.Q.en: takes a dir and table parameters and enumerates the symbol columns of a table, which is required for splaying a table. It's definitely worth fully understanding enumeration and what the .Q.en does behind the scenes. But in short enumeration is the process of turning columns of symbols into columns of integers behind the scenes for optimization purposes while still mostly allowing you to work with them and think of them as symbols.

For more on enumeration see Q for Mortals and The Kdb+ Cookbook

.Q.par: takes as arguments a root directory, a value that is being partitioned on, and a table name, and returns a full handle for the partition. For example

Now we can build up a partitioned table from our csv files:

First we'll create a quick helper function to log output, stolen from here

out:{-1(string .z.z)," ",x}

We'll name the directory for our db:

dbdir:`:hdb

Now we will actually load the data!

For each file in filelist we'll call parse_csv to load it into an in memory table. We take advantage of the fact that the csvs are already separated by date and get the date from the first row, then use that to set the writepath.

loadfile:{
 out"Reading file: ",string x;
 data:`timestamp xasc parse_csv x;
 out"Read ",(string count data)," rows";
 out"Enumerating";
 data:.Q.en[dbdir;data];
 dt:first raze select[1] timestamp.date from data;
 writepath:.Q.par[dbdir;dt;`$"trade/"];
 writepath upsert data;
 }

\t loadfile each filelist

This will again take awhile (5 minutes on my laptop) to run. \t allows us to time the operation and will return the number of milliseconds elapsed. But once it completes we'll finally have a database of all the trade data!

Now we can "load" our partitioned table and begin querying it. Note that no data is actually pulled into memory until it's needed, kdb+ manages this efficiently behind the scenes.

\l hdb

We now can work with all tables under the hdb root directory, which in our case is only trade:

count trade

Analyzing the data

Now we can begin to query our data!

Let's get all the tradable symbols in our dataset, add the max and min of price for each symbol and sort by total size traded.

`volume xdesc select volume:sum size,maxpx:max price,minpx:min price by symbol from trade

Let's turn the raw trade data into OHLCV bars on both daily and 5 minute time frames. Let's throw in VWAP for fun as well.

daily: select open:first price, high:max price,low:min price,close:last price,
  volume:sum size,vwap:size wavg price
  by date from trade where symbol=`XBTUSD

fivemin: select open:first price,high:max price,low:min price,close:last price,
  volume:sum size,vwap:size wavg price
  by date, 5 xbar timestamp.minute from trade where symbol=`XBTUSD

Conclusion

Learning the various techniques kdb+ makes available for managing large amounts of data allows us to quickly get up and running with huge datasets. While loading up the data may take a bit of time, it's unlikely you'll be able to find any other language or database environment that can do this faster. The same is true of queries.

In future posts we will look at more advanced querying techniques.