Lecture 7

ACPM

Lots of good, sensible ideas in this paper. A nice simple design as well. (Mothy may have more to say there.)

Attractive CS262 paper: OS community learning lessons from the DB experiences – inability to influence page replacement policy in the VM. Not all apps are well served by a single policy, should be extensible/configurable.

Then again, maybe not! Dysfunctional interaction between OS and DB communities? Many misconceptions about DBMSs here! Exposes deeper questions!

Exploring this kind of tension is just what we’re supposed to do in this class.

ACPM Background

First, let’s revisit some of V++’s basic storage units and metaphors:

Disk block: a physical unit of xfer
Page frame: block-sized unit of physical memory (RAM)
Segment: collection of page frames providing a run of virtual addresses (not necessarily contiguous in RAM). This is the unit of extensible page management.
Bound Region: association between two segments – essentially a reference to a (page-aligned) sub-range of an underlying segment. Provides a level of indirection to support, e.g., copy-on-write.
File: traditional filesystem abstraction of a named run of blocks on disk. Can serve as backing store for a segment.
Cached file: a segment with a block read/write interface.

V++’s ACPM is targeted at exposing the following main issues to application level:

The amount of physical memory available to the application. Important if dynamic, presupposes significant multiprogramming.
Spatiotemporal control over physical memory. Being able to choose which pages are in physical memory. Relevant for both performance and correctness in a variety of settings. Also relevant to NUMA multiprocessor settings.
Temporal control over I/O transitions into/out of memory.

DBMS Buffer Management Background

An aside: how do DBMSs manage their memory, and what do they stick into that memory?

Two/Three pools of heap memory:

Buffer pool: At boot time, allocate a buffer pool in virtual memory to serve as a cache for database blocks. This is usually sized to a large fraction of the physical memory on the machine.
Query workspace: Space for memory-intensive query processing operations like sorting and hashing (e.g. for joins).
Miscellaneous heap space: E.g. the query parser needs some heap space, the query optimizer consumes a nontrivial amount of memory for dynamic programming, the DBMS lock table sits on the heap, etc. (As an aside, most DBMSs implement a region-based memory manager to allow for efficient allocation and leak-free deallocation of such heap space. A discussion of its own!)

DBMSs make heavy use of virtual memory, but a well-administered DBMS never causes VM to page. Why?

DBMSs are expensive servers, run on a dedicated machine (since the dawn of time!) No multiprogramming of note.
Buffer pool and query workspace sizes set to fit in physical memory. Combined with (1), basically never pages.

ACPM is most relevant to the DBMS buffer pool. Here’s how buffer pools work in traditional OS contexts:

Using proper OS I/O interface to avoid double-buffering with file cache (what are these in your favorite OS?), DBMS requests individual DB blocks by address to be copied into VM, and written back to disk.
DBMS has its own app-level semantics/implementation for pin-counts, dirty pages, ref bits, etc.
DBMS wants to control the order of disk writes (e.g. for Write-Ahead Logging) for correctness – not just performance. Any virtualization of physical write confirmations gives DBMS designers fits, and causes them to work around the interface rather than with it.
DBMS has intimate knowledge of disk access patterns. This leads to a desire for workload-sensitive buffer replacement policies and block prefetching, both of which are based on patterns other than recency and physical contiguity. (See Chou/DeWitt for examples.)

Returning to ACPM…

Could we implement the DBMS buffer pool using V++ VM? You bet!

ACPM gives us app-level control of page replacement, which is more or less what we want. I.e. issue #2 of H&C’s 3 issues.
However, all accesses in a DBMS are page-oriented read/write, so this isn’t the VM interface, it’s just the cached file interface.
ACPM itself doesn’t seem to guarantee that disk write notifications are real, though this is issue #3. V++ seems to have the right interfaces for that in its cache file implementation.

But, DB misconceptions rife in the paper:

The DB buffer pool is shared by all running queries, and the DBMS can arbitrate the residency of pages across the queries directly at application level – no OS support is needed. Citation [17] and follow-on work never lamented the lack of OS support, they just addressed the policy question of how to allocate buffers across queries. So this does not motivate the notion of dynamically changing the allocation of pages across processes (H&C’s issue #1).
Similarly, the DBMS can control what DB pages are “pinned” in the (local) buffer pool – e.g. index roots, etc. So this does not motivate OS-level interfaces for control over physical memory placement (H&C issue #2).
They assume that database indexes are mapped entirely into memory – i.e. they’re main-memory data structures (a la red-black trees) rather than disk-oriented data structures (a la B-trees). This appears to be honest naivete. But it is madness to expect a main-memory data structure in VM to have as good paging behavior as a disk-oriented data structure. More fundamentally, this underscores one of the papers own key lessons about API design: an API that is “transparent except for performance” is a lousy API for performance-oriented applications! (Remember that!) So VM would never be used for “out-of-core” data structures and computations – it’s a poor API for such things.

So, how much does ACPM help the DBMS?

Goal #1 is irrelevant because multiprogramming and servers don’t go together.
Goal #2 is fine, but we’re only using the cached file interface, not the VM interface, so this isn’t much more helpful than just direct I/O.
Goal #3 is critical for correct write-ahead logging, and useful for performance with respect to read-ahead and write-batching. But this is an I/O subsystem issue, not a VM issue per se.

Conclusion: there’s little here that makes a DBMS writer’s life much easier. I suspect the same complaint applies to scientific and other applications. Basic issue is one of metaphor and API:

For big-data computations with complex I/O patterns, VM is almost by definition “transparent except for performance” regardless of the degree of control. The data structures and algorithms get written for paginated memory models rather than RAM pointer models. There is an algorithms literature around this (“out-of-core algorithms”, “indexability theory”, etc.) Achieving this I/O pattern while pretending to live in a pointer model is near-impossible. So VM is the wrong metaphor; fixing its API is misunderstanding the issue.
One key motivation for the work – multiprogramming – is undercut by the applications that might want control. Big-data computations (scientific apps, DBMS decision-support workloads) tend to be done as batch jobs on dedicated hardware with minimal multiprogramming (the only concurrent tasks are usual system daemons, etc.) Multi-client parallel little-data jobs (transaction processing, information retrieval, etc.) tend to use servers which again run on dedicated HW with minimal multiprogramming.

Lessons here in trying to please people without understanding their pain. Or more likely, hammers in search of nails.

5 Minute Rule

Why read this paper?

Expose you to a style of trend-driven systems thinking. A way to identify research opportunities!
Sensitize you to choosing relevant, portable/scalable metrics of performance.
Expose you to the disk access issues that matter to DB applications – and discuss the way they affect buffer replacement and prefetching.

Trend-Driven problem identification

Jim Gray and Dave Patterson seem to push these the hardest.

Idea: identify the pain point that is coming 5-10 years down the line, and start doing the research on it now. Remember the Gretzky’s Rule: “Skate to where the puck is going!” (Though this is apparently apocryphal and also bad hockey advice: http://www.fastcompany.com/magazine/36/cdu.html).
Doing this well is rather an art, even though hearing about it can be dry.
This particular paper’s analysis doesn’t identify a research problem per se, though it dispenses with a number of proposed problems people have posed over the years.

Performance Metrics

Economic cost (as in $$) is a useful proxy for issues in computing that are relevant to (capitalist) society. Used to be a proxy for performance or resource utilization. Now?
Ratios of metrics can be more telling than individual metrics. When ratios remain the same, design tradeoffs tend to stay the same. When ratios shift substantially, you have a research opportunity.
“Feature selection” – identifying the right issues to measure – is a critical challenge and a place for creativity in the realm of trend-driven research. Again, economic cost is often a good guide in our society, though this opens up a philosophical can of worms.

Access patterns in a DBMS (detail courtesy Chou/DeWitt):

· Straight Sequential: LocSet size = 1. Replace that page as soon as needed.

· Clustered Sequential: LocSet size = (#tuples in largest cluster)/(# of tuples per page). FIFO or LRU replacement work.

· Looping Sequential: LocSet size = size of relation (“file”). MRU is best.

· Independent Random: odds of revisit are low, so LocSet either 1, or the magic number b from the "Yao" formula.

· Clustered Random: Just like CS, but pages are not packed onto blocks. So it’s just # of tuples in largest cluster. LRU or FIFO replacement.

· Straight Hierarchical, Hierarchical/Straight Sequential: like Straight Sequential.

· Hierarchical/Clustered Sequential: like CS, but replace tuples with (key,ptr) pairs

Looping Hierarchical: at each level h of an index, you have random access among pages. Use Yao to figure out how many pages you’ll access at each level in k lookups

Fully-specified models a la DBMIN? Simple stochastic models a la LRU, LRU-k? Something in between (“gray-box systems”)?