Advanced Topics in Computer Systems |
Fall, 2005
|
Joe Hellerstein & Timothy Roscoe
|
|
POSTGRES Storage System
An extremely simple solution to the complex recovery problem.
History:
- POSTGRES storage system (Stonebraker) discussed in Berkeley Tech Reports
from 1985
- LFS (Ousterhout & Douglis) discussed in Berkeley Tech Reports
from 1988
- I have been assured that the two had little or nothing to do with
each other
- though Stonebraker is thanked in the "Beating the I/O Bottleneck"
TR from 1988
- Was it the Peet's Coffee?
POSTGRES Overview
- Followon to INGRES
- The prototype for "Object-Relational" databases, but had many other interesting goals as well!
- extensible OO types (and methods -- i.e. download code into the DBMS)
- extensible access methods
- "active" database (triggers and rules)
- a novel storage manager
- parallel query processing (XPRS: eXtended Postgres on RAID and Sprite)
- A second system
- The extensibility features worked, and were extremely influential
in research and industry
- The active DB stuff mostly worked, and had significant impact
- The XPRS parallel query optimization stuff was influential on later designs, though the code didn't go anywhere.
- The storage manager worked slowly, and the design has had little impact.
- "When considering the POSTGRES storage system, we were guided by a
missionary zeal to do something different"
- Still, interesting to discuss.
Problem:
- WAL recovery code is really complicated (witness ARIES)
- recovery code must be flawless
- failures can be arbitrary – testing is hard to pull off
What’s wrong with this picture?
________________
| DBMS |
----------------
/ \
/ \
----- -----
DB Log
----- -----
Alternative: A no-overwrite storage system.
- Time travel comes for free
- instantaneous recovery
- no crash recovery code
Details
- Life of a xact:
- increment and grab current global XID
- do processing
- change status to committed in log (more on this below)
- FORCE data & log to stable storage (in that order!)
- There is a "log" of sorts:
- tail of log (oldest active xact to present) needs 2 bits per transaction
to record state (committed, aborted, in progress)
- body of log needs only 1 bit per xact (committed or aborted)
- at 1 xact per second, 1 year of transactions fits in 4Mb log space!
- Detail: if this is still too big, use a Bloom filter to represent
aborted xacts (lossy compression)
- with just a little NVRAM, the log essentially never needs forcing
Each tuple has a bunch of system fields:
- OID: a database-wide unique ID across all time
- Xmin: XID of inserter
- Tmin: commit time of Xmin
- Cmin: command ID of inserter
- Xmax: XID of deleter (if any)
- Tmax: commit time of Xmax (if any)
- Cmax: command ID of deleter (if any)
- PTR: pointer to chain of deltas
Updates work as follows:
- Xmax & Cmax set to updater’s XID
- new replacement tuple appended to DB with:
- OID of old record
- Xmin & Cmin = XID of updater
- in fact, store this as delta off original tuple
Deleters simply set Xmax & Cmax to their XID
The first version of a record is called the Anchor Point, which has a chain
of associated delta record
"Hopefully", delta records fit on the same page as their anchor point.
- Obvious optimization for read-intensive workloads?
CC, Timestamps, Archiving:
If we actually got timestamps at xact start, we’d get timestamp ordering
CC.
An aside: timestamp and multi-version concurrency control. See Bernstein and Goodman's survey for a more complete presentation.
- TSCC
- Assign each transaction Ti a timestamp ts(Ti) at start time.
- Each tuple t is associated with a Read Timestamp R-ts(t) and a Write TS W-ts(t)
- To read a tuple t, xact T1 must ensure that W-ts(t) ≤ ts(T1); else abort T1 (why?). Update tuple and set R-ts(t)=MAX(R-ts(t), TS(T1)). Hence invariant: RTS is latest reader of this version.
- To write a tuple t, xact T2 must ensure that R-ts(t)≤ ts(T2); else abort T2 (why?). Then update tuple and set W-ts(t)=MAX(W-ts(t),ts(T1))
- A bummer: only one equivalent serial schedule, based on xact start times! Severely diminished concurrency opportunities (when does a transaction get to commit?). Can ameliorate somewhat by buffering transactions' intentions, picking their admission times, and then really running them. Is that practical?
- Another bummer: read-only transactions require updates to timestamps, and hence must do synchronous I/O before commit! As a rule, it's bad to turn reads into writes.
- MVCC
- As in TSCC, assign timestamps to transactions and tuples.
- To write a tuple t, xact T1 creates a new version with W-ts(t)=ts(T1)
- To read a tuple t, xact T2 chooses version of t with largest W-ts(t) s.t. W-ts(T) ≤ ts(T2). Set R-ts(t)=MAX(R-ts(t), ts(T2)). Hence invariant: RTS is latest reader of this version
- Careful: what if a writer's TS falls between the W-ts and R-ts of a version? Then the reader should have seen this writer's version but didn't! Writer must abort and restart. (Hence the invariant on latest R-ts).
- Note that read-only xacts never restart, which is nice.
- Similar concurrency issues due to timestamping.
Now, back to Postgres. We will read that restart-oriented CC is usually a loser to blocking-based CC. So Postgres chose to do 2PL for concurrency, even though it has versions for recover. Also, get timestamp at commit time, not start time!
How to set Tmin and Tmax if you don’t have the commit time?
- XID is taken as an oid from the TIME relation
- at commit:
- update your appropriate TIME tuple with the wall-clock time.
- then force data pages to stable storage, change status to committed
in tail of log
- Why are we not worried about aborting, as in MVCC?
- 3 levels of archiving
- no archive: old versions not needed
- light archive: old versions not to be accessed often
- heavy archive: old versions to be accessed regularly
- on first access to a tuple from a "heavy archive" relation, you update
the OIDs in Tmin and Tmax with values from the TIME relation
Time Travel
Allows queries over a table as of some wall-clock time in the past.
Rewrite queries to handle the system fields in tuples
Reading a Record: get record, follow delta chain until you’ve got the
appropriate version constructed.
Indexes all live on disk. They contain entries (anchor pointers) for all versions. The actual index blocks do experience overwrites.
Archiving
- historical data can be forced to archive via the vacuum cleaner
- write archive record(s)
- write new anchor record
- reclaim space of old anchor/deltas
- crash during vacuum?
- indexes may lose archive records, though Seq. Scan will always work
- duplicate records may be forced to archive or may co-exist in archive and in live DB: OK because POSTGRES doesn’t
do multisets
- Can build R-trees over lifetime intervals of data on archive
"Performance Comparison" vs. WAL
Tech trends discussion
- CPU speed scales with Moore's Law
- Disk density ($$/MB) scales with Moore's Law
- Disk "speed" (what does this mean?) is improving slowly.
- Hence you'll always be able to buy CPUs to keep up with your disk speeds, and you'll never be CPU-bound..
- What do you make of this analysis?
Assumptions:
- records fit on a single page
- deltas live on the same page as anchors
- single-record xacts
- update-only workload (?!)
NVRAM required to make POSTGRES compete on even this benchmark.
The Real POSTGRES Storage Manager Story
- Tuple differencing wasn't implemented
- R-trees over archive not used
- Stonebraker commercialized POSTGRES at Illustra
- Illustra never claimed to be a TP competitor
- Informix bought Illustra, and replaced the no-overwrite storage manager
with Informix’s WAL
- IBM bought Informix.
- PostgreSQL now uses a WAL recovery scheme, and ditched time travel. Rationalized vacuum accordingly. However, maintained version metadata and implemented MVCC!
Ask Not What POSTGRES Can Do For You...
- If you did made a handful of obvious design changes to POSTGRES, would
it be viable today?
- Could certainly be made better
- CS262 project there
- What if you throw in:
- a little NVRAM
- flush to remote memory in a cluster
- variable-sized storage units
- limited time travel
- The Telegraph Storage manager was marching down this path...