Bigtable is the distributed, scalable, highly available and high performing database that powers multiple applications across Google — each having a different demand in terms of the size of data to be stored and latency with which results are expected. Though Bigtable is not open source itself, the paper was crucial in the development of powerful open source databases like Cassandra (borrows BigTable’s data model), HBase and LevelDB. The paper Bigtable: A Distributed Storage System for Structured Data describes the story of how BigTable is designed and how it works.
Bigtable is a sorted, multidimensional key-value store indexed by a row key, a column key, and a timestamp. For example in the context of storing web pages, the page URL could be the row key, various attributes of the page could be the column key and the time when the page was fetched could be the timestamp. Rows are maintained in lexicographic order and row range is dynamically partitioned with each row range being referred to as a tablet. Reads over short range are very fast due to locality and all read/write operations under a row are atomic. Column keys are grouped into sets called column family. These sets are created before any data is stored. A table should have a small number of distinct column families. A column key has the following syntax — family:qualifier. For example, for family as language, the keys could be language:english or language:hindi. A row in a Bigtable can contain multiple versions of same data, indexed by timestamps, and stored in decreasing order of timestamps. Old versions can be garbage collected in the way that the client can specify that only last n entries are to be kept or entries for only last n days(hours/weeks etc) are to be kept.
Bigtable uses Google File System (GFS) for storing logs and data in SSTablefile format. SSTable provides an immutable, ordered (key, value) map. Immutability provides certain advantages which will be discussed later. An SSTable consists of a sequence of blocks and a block index to locate the blocks. When the SSTable is opened, this index is loaded into the main memory for fast lookup.
Bigtable also uses a highly available and persistent distributed lock service called Chubby for handling synchronization issues. Chubby provides a namespace consisting of directories and small files which can be used as locks. Read and write access to a file is atomic. Clients maintain sessions with a Chubby service which needs to be renewed regularly. Bigtable depends on Chubby so much that if Chubby is unavailable for an extended period of time, Bigtable will also be unavailable.
There are 3 major components — a library linked into the client, one master server and multiple tablet servers. Master server assigns tablets to tablet servers, load balances these servers and garbage collect files in GFS. But the client data does not move through the master, clients directly contact tablet servers for read and write operations. Tablet servers manage a set of tablets — they handle the read/write operations directed to these tablets and also split very large tablets.
A 3-level hierarchy is maintained to store tablet location information. The first level is a file stored in Chubby that has the location of the root table. Root table contains the location of all tables in a METADATA tablet. Each entry in this METADATA tablet contains the location of a set of user tablets. These tablet locations are cached in the client library and can also be fetched from the above scheme by recursively moving up the hierarchy.
Each tablet is assigned to one tablet server at a time. The master server keeps track of which servers are alive, which tablet is assigned to which server and which tablets are unassigned. When a tablet server restarts, it creates and acquires an exclusive lock on a unique file in the servers directory (Chubby directory) which is monitored by the master server. If the tablet server loses its exclusive lock, it will stop serving its tablets though it will attempt to reconnect as long as the file exists in the servers directory. In case the file is deleted, the tablet server kills itself. The master tracks tablets that are not being served and reassigns them. To avoid network partition issues, the master kills itself if its Chubby session expires though its tablets are not reassigned.
A master performs following tasks at startup time:
- Grab the unique master lock to prevent concurrent master instantiation.
- Find live servers by scanning the servers directory.
- Find what tablets are assigned to each server by messaging them.
- Find the set of all tablets by scanning the METADATA tablet.
Tablet creation, deletion or merge is initiated by the master server while tablet partition is handled by tablet servers who notifies the master. In case the notification is lost, the master would be notified when it asks for the split tablet to be loaded.
Memtables are in-memory buffers to store recently committed updates to Bigtable. These updates are later written to GFS for persistent storage. Tablets can be recovered by reading the metadata from METADATA tablet which contains a set of Redo points and list of SSTables comprising the tablet.
There are 3 kinds of compactions in action:
- Minor compaction — As the memtable grows in size and reaches a certain threshold, it is frozen, a new memtable is created and the frozen memtable is written to GFS.
- Merging Compaction — Minor compaction keeps increasing the count of SSTables. Merging compaction reads the contents of a few SSTables and the memtable and writes it to a new SSTable.
- Major Compaction — In Major compaction, all the SSTables are written into a single SSTable.
Along with the described implementation, several refinements are required to make the system reliable and high performing.
Clients can club together column families into locality groups for which separate SSTables are generated. This allows for more efficient reads. If SSTables are small enough, they can be kept in main memory as well.
SSTables can be compressed in various formats. The compression ratio gets even better when multiple versions of same data are stored.
Caching — Tablet servers employ 2 levels of caching.
- Scan Cache — It caches (key, value) pairs returned by the SSTable and is useful for applications that read same data multiple times.
- Block Cache — It caches SSTable blocks read from GFS and useful when the application uses locality of reference.
BloomFilters are created for SSTables (particularly for the locality groups). They help to reduce the number of disk access by predicting if an SSTable may contain data corresponding to a particular row, column pair.
Commit Logs are maintained at tablet server level instead of tablet level to keep the number of log file small. Each tablet server maintains 2 log writing threads — each writing to its own and separate log file and only one of the threads is active at a time. If one of the threads is performing poorly (say due to network congestion), the writing switches to other thread. Log entries have sequence numbers to allow recovery process.
We earlier saw that SSTable entries are immutable. The advantage is that no synchronization is needed during read operations. This also makes it easier to split tablets. Permanently removing deleted data is taken care of by garbage collector.