Let’s Rock

Vinodhini Chockalingam
10 min readNov 18, 2019

--

Something about Rocks DB

One Google search would have told you the obvious facts. Rocks DB is a

  • Embedded Key Value store (another one?)
  • Forked from LevelDB
  • Developed by Facebook
  • Fast storage

And then you will see bunch of words — SST, LSM, WAL, Bloom filters

In this post, I will try my best to explain what I understood reading through their official wiki and few of their lectures.

Case for Embedded database

Latency is dominated by the network. When you got a single client call passing through multiple services, with each of them communicating to one or many database servers — no matter how efficient your storage is, that round trip time taken in the network is not going away.

So you put the storage in that application server itself

Not all product start with embedded database. If your product cannot live with the latency added by network calls, then ofcourse you enter the world of embedded storage.

Architecture

  • Any read/write goes through the MemTable — These are in-memory datastructure that holds the data.
  • The data is also written to the WAL — Write Ahead Log.
  • Writes go the Active Memtable. When it becomes full, it is marked as closed and becomes immutable and later flushed to the disk. This process is called memtable switch — an active memtable is closed along with the corresponding WAL and a new memtable and WAL is opened for writing.
  • What is WAL for? When your application crashes, you lost the data in memtable. WAL is then used to recover the data and restore the database to the original state.
  • Once the Memtable data is flushed to disk, the WAL data is also cleared — consistency is guaranteed (i.e. WAL only has the data that is not yet written to the disk)
  • After every write, WAL can also be archived to replay the logs and create a slave database. We will get to replication in another post.
  • SST — Static Sorted Table. Also known as Sorted String Tables. When I say the data is flushed to “disk”. It means that the data is written to SST file in disk. By default they are block-based which is optimized for flash storage and they allow compression. There is a whole page in Rocks wiki dedicated to understanding the SST files. They are just .sst file full of bytes — containing metadata and footer, along with the data, which helps read the data back from these files.
  • Over the time you would have too many sst files in the disk. Compaction allows you to keep up with the growing data by merging multiple versions of the key. Remember that I said the memtables are immutable? so what happens when I update the value for a key? A new key value pair is added to the active memtable. Since the read first looks into the memtable and then the ssts, you are always promised the latest value. And compaction ensures the old values are discarded by merge. i.e. what happens if you keep calling map.put(K,V) multiple times with different values? the last Value passed is stored against the key right? Same thing.

What happens when you delete a record? Say I update the value for K1 to V2 which is currently in the memtable and then delete that too.

Disk Table        Memtable
| k1 | v1 | | k1 | v2 |
Disk Table Memtable
| k1 | v1 | ∅

After the compaction, we have effectively resurrected v1. This is why the deleted records are associated with a tombstone — special delete entry which is understood by compaction as deletion.

Disk Table        Memtable
| k1 | v1 | | k1 | <tombstone> |

With these basics, let us look at the architecture.

High level architecture

Before we dig deep into compaction, lets quickly go over one of the features of Rocks — Column Family. I will elaborate the other features in another post.

  • Column Family — No, it is not a column. one key — multiple values against column families. If you don’t provide column family, the value by default gets mapped to “default” column family. Memtables and ssts are per column family and not WAL. That is, just because the memtable of CF 1 got flushed you cannot close the existing WAL — since it still contains CF 2 whose memtable is not flushed yet. I am curious to know how they have managed to keep their recoveries consistent. But this is all I know. And also, the concept of column family had been introduced to logically partition the database. Each CF can be configured independently of the other.

Lets take another detour before LSM. It is important to know these to understand the access patterns and tune rocksdb accordingly.

Write Amplification

The ratio of bytes written to storage versus bytes written to the database.

For example, if you are writing 10 MB/s to the database and you observe 30 MB/s disk write rate, your write amplification is 3. This is especially bad for flash storage.

Higher Write Amplification in SSD due to GC

One of the main differences between HDD (Hard Disk Drive) and a Solid State Drives (SSD) is how they handle data writes. While HDD’s write data on empty spaces, the SSD always erases data first before it writes new data inside the Flash storage chips. This means that except for brand new SSDS or ones that have been securely erased by the producer before its sold, the Flash storage chips have to be erased before they can be rewritten.

Flash storage consists of data blocks and pages. Blocks are made out of several pages and one page is made out of several storage chips. The main challenge is that the Flash cells can only be deleted block-wise and written on page-wise. To write new data on a page, it must be physically totally empty. If it is not, then the content of the page has to be deleted. However, it is not possible to erase a single page, but only all pages that are part of one block. Because the block sizes of an SSD are fixed — for example, 512kb, 1024kb up to 4MB. — a block that only contains a page with only 4k of data, will take the full storage space of 512kb anyway.

And the compaction to merge the records would add this to this factor. We will see how LSM plays here — after we go over two other amplification factors.

Read Amplification

The number of disk reads per query. If you need to read 5 pages to answer a query, read amplification is 5. Read Amplification factor also includes the cost of decompressing the data.

Amplification factor is defined for point queries and range queries separately.

In rocks, data could be present at memtables/many SST files (that are not merged yet). So naturally it has to look at many places to query — higher read amplification. And also, those factors would be very different for range queries.

Rocks DB provides BlockCache to cache data in memory. Though such reads are cheaper than the disk read, it still imposes CPU cost.

Space Amplification

The ratio of the size of database files on disk to data size. If you Put 10MB in the database and it uses 100MB on disk, then the space amplification is 10. With higher space amplification you quickly run out of disk space. A low value for space-amp is more important with flash storage than disk because of the price per GB for storage capacity.

Usually if you end up storing more pointers/metadata along with the user data, your space amplification increases. For eg: Take doubly linked list — on top of the user data, you need to store next and prev. Same goes with any datastructure. In rocksDB, space amplification arises from storing multiple records associated with the same key (since the records are immutable). And obviously, compaction would reduce space amplification,

The RUM Conjecture

Researchers at Harvard DB Lab defined 3 overheads that database systems are trying to optimize : RUM Read overhead, Update overhead, Memory overhead : The CAP kind of thing for database access patterns.

The ubiquitous fight between the Read, the Update, and the Memory overhead
of access methods for modern data systems

When designing access methods we set an upper bound for two of the RUM overheads, this implies a hard lower bound for the third overhead which cannot be further reduced.

Read more about this here.

LSM Leveled compaction in Rocks

LSM is Log Structured Merge Tree. I cannot explain it better than this. But let me get the basics out here.

In general, in-place update storage structures are optimized for read performance. So if you got to update the data, first you have to locate it. And then depending on the updated data, the size might be too big so might have to relocate the data to a bigger page.

LSM Trees are append only storage which leads to sequential writes. The word “merge” in LSM Trees indicates that, due to their immutability, tree contents are merged using an approach similar to merge sort. Immutable files have higher density: there is no need to allocate extra space for updates which might require more space than the originally written ones. These are particularly useful for applications where writes are far more common than reads.

The active memtables are sorted before they are flushed. Hence LSM trees only have to merge bunch of sorted files — sorted in-memory table with the disk resident sorted table - the merging is similar to sorting bunch of sorted collections. If you can open at the max M file iterators, then the order of sorting is going to O(M). In addition to this, there is a reconciliation process involved when the same key is found in multiple files. You need to figure out which one is the latest — I guess you would just look at which one was written recently.

LSM is simply the concept of having an append only storage. SST is one implementation of how the in-memory and disk resident tables are represented. Compaction, indices and optimizing reads are implementation details of LSM based storage.

Leveled compaction is when you put the disk resident SSTs in levels.

What is with the size of the levels? Have you noticed that they are exponentially increasing? Level 2 is 10% of level 3. Level 1 is 10% of Level 2 and so on.
Each level is sorted run of SST files. With binary search it is easy to find the file containing a particular key
Each level is assigned a size. The goal of compaction is to restrict the data size at a level to be under its target size
Each level gets compacted when the target size is reached.

As mentioned earlier, Rocks achieves a stable LSM structure by making sure each level contains 10% size of the level below. So at most 90% of the data is in one level. Rest of 10% data is easy to cache thus reducing your read amplification factor. Also there is only 10% of obsolete rows and only 10% temporary space needs to be reserved for compaction. Thus reducing your space amplification factor too. RocksDB employs various compression techniques too in order to battle space amplification. These are explained in details in this paper on optimizing space amplification for RocksDB

Also note that the tombstones are not dropped during compaction right away. They are preserved until they reach the bottom-most level to ensure that there are no data record for the same key with smaller timestamp present in any other SST.

Before we close, lets look at one other aspect that improves read amplification.

Bloom Filters

How can we improve read amplification in SST files? One way is to quickly figure out if the key I am looking for is not there in SST. SSTs are sorted — so if the sst file stores the key range (lowest, highest) it could tell us if the data could be present in that file.

Bloom filter is an improvement to this approach. It is a probabilistic data structure that can tell if they key might be there or definitely not there. i.e. false positives but no false negatives.

Bloom filter uses a large bit array and multiple hash functions. Keys are hashed to indices in the bit array. If the bits are set to 1 in all positions determined by the hash function for a key, then the key might be present in the file. Even if one of the indices is not set to 1, then it is safe to say that the key is definitely not present.

Image from Google search. Here x hashes to indices 4,8,12 and y hashes to indices — 0,9,14. There could very well be a key — Z which hashes to indices — 4,9,12 — Bloom filter will then produce a false positive.

Closing words

RocksDB is very flexible, which is both good and bad. You can tune it for variety of workloads and storage technologies. Inside of Facebook we use the same code base for in-memory workload, flash devices and spinning disks. However, flexibility is not always user-friendly. We introduced a huge number of tuning options that may be confusing.

The very intent of this article is to understand the components at a high level so you know what you are dealing with. To tune the DB or even to understand the various configurations available in Rocks, it is important to get the basics right. I will dig into the features, dos-and-donts in another post.

References

--

--