Advanced Topics in Computer Systems
Fall, 2005
Joe Hellerstein & Timothy Roscoe

POSTGRES Storage System

An extremely simple solution to the complex recovery problem.

History:

POSTGRES Overview Problem: What’s wrong with this picture?
           ________________
          |     DBMS       |
           ----------------
               /       \
              /         \
           -----      -----
            DB         Log
           -----      -----


Alternative: A no-overwrite storage system.

  1. Time travel comes for free
  2. instantaneous recovery
  3. no crash recovery code
Details Each tuple has a bunch of system fields: Updates work as follows:
  1. Xmax & Cmax set to updater’s XID
  2. new replacement tuple appended to DB with:
Deleters simply set Xmax & Cmax to their XID

The first version of a record is called the Anchor Point, which has a chain of associated delta record

"Hopefully", delta records fit on the same page as their anchor point.

CC, Timestamps, Archiving:

If we actually got timestamps at xact start, we’d get timestamp ordering CC.

An aside: timestamp and multi-version concurrency control. See Bernstein and Goodman's survey for a more complete presentation.

  1. TSCC
  2. MVCC

Now, back to Postgres. We will read that restart-oriented CC is usually a loser to blocking-based CC. So Postgres chose to do 2PL for concurrency, even though it has versions for recover. Also, get timestamp at commit time, not start time!

How to set Tmin and Tmax if you don’t have the commit time?

    1. no archive: old versions not needed
    2. light archive: old versions not to be accessed often
    3. heavy archive: old versions to be accessed regularly

Time Travel

Allows queries over a table as of some wall-clock time in the past.

Rewrite queries to handle the system fields in tuples

Reading a Record: get record, follow delta chain until you’ve got the appropriate version constructed.

Indexes all live on disk. They contain entries (anchor pointers) for all versions. The actual index blocks do experience overwrites.

Archiving

    1. write archive record(s)
    2. write new anchor record
    3. reclaim space of old anchor/deltas

"Performance Comparison" vs. WAL

Tech trends discussion Assumptions: NVRAM required to make POSTGRES compete on even this benchmark.

The Real POSTGRES Storage Manager Story

Ask Not What POSTGRES Can Do For You...