Embedded Key-Value Store Performance Benchmark

Key-value data stores (KVSs) are a fast-growing segment of the NoSQL database world that have been increasingly popular lately. Because they have the simplest data model – arbitrary byte strings for keys, paired with a byte string for the corresponding value, they can be quickly incorporated into applications that have very simplistic storage needs. Their APIs tend to be equally simple, providing only very primitive operations, and delivering them with very little overhead.

Key-value data stores (KVSs) are a fast-growing segment of the NoSQL database world that have been increasingly popular lately. Because they have the simplest data model – arbitrary

byte strings for keys, paired with a byte string for the corresponding value, they can be quickly incorporated into applications that have very simplistic storage needs. Their APIs tend to be equally simple, providing only very primitive operations, and delivering them with very little overhead.

While there are a number of client-server oriented KVSs out there now such as memcache and redis, this test only uses embedded KVSs – i.e., KVSs that run entirely within the application process, instead of requiring communication with a DB server process. Note that KVSs of this type are nothing new, they are among the oldest family of storage software around. E.g. BerkeleyDB is descended from a storage library that has been use in Unix since the 1980s. Nor are these KVSs restricted to local (non-networked) use – BerkeleyDB, LMDB, and TokuDB are fully transactional and have been used as the underlying storage engines for servers like MySQL and OpenLDAP for many years. Indeed, inside every other higher-level data store, SQL, NoSQL or otherwise, you’ll find an embedded KVS doing the actual work.

There are lots of high profile applications using KVSs now; e.g. crypto-currencies like BitCoin use a KVS to store their blockchain, which is a complete log of all of their transactions. Project requirements leading to using a KVS might include very high transaction rates (such as for a large ecommerce site), very low latency (e.g. high-frequency traders), very small footprint (e.g. mobile phones and other constrained environments) or combinations of all three.

The tests being run here are derived from benchmark code originally written at Google by the LevelDB authors, subsequently ported to multiple other DB engines by Symas Corp^[2]. The test scenario is based on benchmarks performed by Facebook’s RocksDB team^[3] and an expanded version of those tests covering more DB engines and use cases performed by Symas Corp.

There are two phases to a test: a bulk load phase where the records are loaded in sequential order, and then a read-write phase accessing the records in random order.

All of the DB engines tested here are embedded key/value stores that store keys in sorted order, thus supporting ordered traversal of keys. As such, they all use variants of a search tree data structure. The test data set is chosen to be approximately 5 times larger than RAM on the test servers, to show space and I/O efficiency of each DB engine.

While the underlying concepts are related, the DB engines all take widely varying approaches to their implementations. Most of them attempt to manage their own data caches, in an effort to minimize the number of physical I/O operations needed for a given workload. Every approach entails a tradeoff of space vs time: using a certain amount of space may save a certain amount of time in a given circumstance, but that savings will inevitably be paid back in some other aspect:

RocksDB trades off simplicity for performance – it requires the most complex tuning, with over 40 parameters needing to be set optimally in order to deliver best performance. TokuDB and BerkeleyDB follow in terms of configuration complexity. LevelDB and LMDB aim for configuration simplicity, with LMDB requiring no tuning at all.

RocksDB, LevelDB, and TokuDB focus on write performance; their Log Structured Merge (LSM) Tree and Fractal Tree (FT) designs, respectively, rely heavily on user-configured write buffers to aggregate writes in memory before committing them to disk, allowing the final writes to be written sequentially and thus at the highest speed of the underlying media. The tradeoff is that the ideal order for writes is not ideal for reads, and their read performance is significantly worse. BerkeleyDB and LMDB are more traditional B+tree designs, focusing more on read performance than write performance.

WiredTiger tries to cover all bases, offering both LSM and B+tree engines in their library. This flexibility necessitates a moderate level of configuration complexity as well.

RocksDB and LevelDB strip features while aiming for maximum write performance. BerkeleyDB, LMDB, TokuDB, and WiredTiger focus is on reliability, offering full ACID transactions. As such, the latter four engines are easily adapted to use in SQL servers, LDAP servers, and other transaction-oriented applications while the former two are less suitable for such use.

All of the engines except LMDB use their own explicitly managed write buffers and data caches. LMDB only uses the operating system’s page cache. As such, all the engines besides LMDB have significant CPU overhead due to managing these buffers and copying data multiple times between these buffers. For example, when retrieving a record from disk that was not previously cached, BerkeleyDB performs at least one memory-to-memory copy – first the data is read by the OS into its page cache, then it’s copied to BerkeleyDB’s page cache. Finally the data is copied from the BerkeleyDB cache to the user-provided buffer. LMDB is zero-copy; data goes direct from the storage device to the OS page cache, and this data is handed back directly to the user. Engines like TokuDB perform even more copies, because the data on disk undergoes multiple format transformations before it is returned to the user. These multiple stages each consume memory, so the memory-copy architecture has a direct bearing on how many records may fit in RAM simultaneously, and thus how large a working set can be operated on at full speed. Since LMDB makes no copies of its own, it can handle the largest data set for any given size of RAM.

Some of the DB engines are designed to be more optimal for write operations, and some are optimized more for reads. The bulk load phase shows their best-case write speed. The random read-write phase shows their worst-case write speed, where a single thread performs writes to randomly selected records, while XX threads are performing concurrent reads to other randomly selected records. It also shows how each engine handles reads in a heavily loaded environment.

As the Symas tests showed, the relative performance of each DB engine is highly dependent on the size of records being used. The record size chosen here is a rough compromise between the larger records (at which traditional Btree-based engines excel) and the smaller records (at which the LSM-based designs excel).