Posts Tagged 'partitioning'

Vertical partitioning of high-throughput OLTP databases

We generally divide database usage into two main categories:

  • Online Analytic Processing (OLAP) characterized by long and heavy read-mostly transactions processing large amounts of data (e.g. terabytes of log files), typically using aggregates on many rows, and
  • Online Transaction Processing (OLTP) characterized by very short-lived, but very frequent, transactions with either no aggregates or only a few aggregates on small row-subsets.

OLAP workloads will usually benefit from using a column store DBMS with highly efficient data retrieval and CPU utilization, while OLTP workloads should usually aim for a row store which is efficient for writes. (BTW: Abadi, Madden and Hachem wrote an excellent paper comparing column and row stores.)

Let’s take a closer look at row stores for high-throughput OLTP systems. Conventional row stores (e.g. Oracle, MySQL, SQL server, PostgreSQL, DB2 and more) are all basically build on paradigms introduced in System-R back in the seventies but the world has changed since then: most OLTP data sets now fit completely in main memory, hardware is both cheap and fast making it affordable to buy additional hardware when there is a need scale out, and it is no longer necessary to provide concurrency of transactions since transactions are very short-lived and might just as well be executed in serial. Stonebraker et al. wrote a quite revolutionary paper, suggesting a complete re-write of modern OLTP databases to match the world as it looks today. In short, they suggest a distributed memory-only database completely avoiding disk-based storage and minimizing the need for undo and redo logs. Initial experiments on a prototype (H-store (now named VoltDB)) showed drastic (!) throughput improvements so the project is very interesting to follow.

But what does all this babbling have to do with vertical partitioning? While row stores are write optimized they are sub-optimal for reads: all retrieval is done in units of whole rows, even if the retrieving transaction only needs a small subset of the columns. When processing a query after its row retrieval, wide rows with superfluous columns generally result in less throughput than narrow rows without superfluous columns. That is, wide rows are generally wasting costly I/O - even if the I/O is from main memory. A column store would not suffer from these problems but writes are relatively costly in column stores making them undesirable for OLTP workloads. A solution is to vertically partition the tables into (possibly non-disjoint) smaller fractions so that the number of superfluous columns retrieved by each transaction is minimized. Given a set of sites in a cluster, the challenge is now to distribute attributes from an input schema and transactions from an input workload to these sites so that the costs of retrievals, writes and inter-site transfers are minimized. Furthermore, we would like all columns needed by a read query to be co-located with the query as inter-site transfer may be a bottleneck for short-lived transactions, or equivalently: single-sitedness of read queries should be preserved. And last, we would like to be able to prioritize between total cost minimization and load balancing of the sites. This distribution problem is non-trivial, and in fact NP-hard. A soon-to-be-submitted paper presents an exact algorithm and a heuristic solving it. Both algorithms have been experimentally shown to obtain good results.