CS262A: Advanced Topics
in Computer Systems
Joe Hellerstein and
Timothy Roscoe
UC Berkeley
Fall, 2005
|
Lecture
7
|
ACPM
Lots of
good, sensible ideas in this paper.
A nice simple design as well.
(Mothy may have more to say there.)
Attractive
CS262 paper: OS community learning lessons from the DB experiences –
inability to influence page replacement policy in the VM. Not all apps are well
served by a single policy, should be extensible/configurable.
Then again,
maybe not! Dysfunctional
interaction between OS and DB communities? Many misconceptions about DBMSs here! Exposes deeper questions!
Exploring
this kind of tension is just what weÕre supposed to do in this class.
ACPM Background
First,
letÕs revisit some of V++Õs basic storage units and metaphors:
- Disk block: a physical unit of
xfer
- Page frame: block-sized unit of
physical memory (RAM)
- Segment: collection of page
frames providing a run of virtual addresses (not necessarily contiguous in
RAM). This is the unit of
extensible page management.
- Bound Region: association
between two segments – essentially a reference to a (page-aligned)
sub-range of an underlying segment.
Provides a level of indirection to support, e.g., copy-on-write.
- File: traditional filesystem
abstraction of a named run of blocks on disk. Can serve as backing store for a segment.
- Cached file: a segment with a
block read/write interface.
V++Õs ACPM
is targeted at exposing the following main issues to application level:
- The amount of physical
memory available to the application. Important if dynamic, presupposes significant multiprogramming.
- Spatiotemporal
control over physical memory.
Being able to choose which pages are in physical memory. Relevant
for both performance and correctness in a variety of settings. Also relevant to NUMA
multiprocessor settings.
- Temporal
control over I/O transitions into/out of memory.
DBMS Buffer Management Background
An
aside: how do DBMSs manage their memory, and what do they stick into that
memory?
Two/Three
pools of heap memory:
- Buffer pool: At boot time,
allocate a buffer pool in virtual memory to serve as a cache for
database blocks. This is
usually sized to a large fraction of the physical memory on the machine.
- Query workspace: Space for memory-intensive
query processing operations like sorting and hashing (e.g. for
joins).
- Miscellaneous heap space: E.g. the query parser needs
some heap space, the query optimizer consumes a nontrivial amount of
memory for dynamic programming, the DBMS lock table sits on the heap, etc.
(As an aside, most DBMSs implement a region-based memory manager to allow for efficient
allocation and leak-free deallocation of such heap space. A discussion of its own!)
DBMSs
make heavy use of virtual memory, but a well-administered DBMS never causes VM
to page. Why?
- DBMSs are expensive servers,
run on a dedicated machine (since the dawn of time!) No multiprogramming of note.
- Buffer pool and query workspace
sizes set to fit in physical memory.
Combined with (1), basically never pages.
ACPM is
most relevant to the DBMS buffer pool.
HereÕs how buffer pools work in traditional OS contexts:
- Using proper OS I/O interface
to avoid double-buffering with file cache (what are these in your
favorite OS?), DBMS requests individual DB blocks by address to be copied
into VM, and written back to disk.
- DBMS has its own app-level
semantics/implementation for pin-counts, dirty pages, ref bits, etc.
- DBMS wants to control the
order of disk writes (e.g. for Write-Ahead Logging) for correctness
– not
just performance. Any virtualization
of physical write confirmations gives DBMS designers fits, and causes
them to work around the interface rather than with it.
- DBMS has intimate knowledge of
disk access patterns. This
leads to a desire for workload-sensitive buffer replacement policies and
block prefetching, both of which are based on patterns other than recency
and physical contiguity.
(See Chou/DeWitt for examples.)
Returning to ACPMÉ
Could we
implement the DBMS buffer pool using V++ VM? You bet!
- ACPM gives us app-level control
of page replacement, which is more or less what we want. I.e. issue #2 of H&CÕs 3
issues.
- However, all accesses in a DBMS are
page-oriented read/write, so this isnÕt the VM interface, itÕs just the
cached file interface.
- ACPM itself doesnÕt seem to guarantee
that disk write notifications are real, though this is issue #3. V++ seems to have the right
interfaces for that in its cache file implementation.
But, DB
misconceptions rife in the paper:
- The DB buffer pool is shared by
all running queries, and the DBMS can arbitrate the residency of pages
across the queries directly at application level – no OS support is
needed. Citation [17] and
follow-on work never lamented the lack of OS support, they just addressed
the policy question of how to allocate buffers across queries. So this
does not motivate the notion of
dynamically changing the allocation of pages across processes
(H&CÕs issue #1).
- Similarly, the DBMS can control
what DB pages are ÒpinnedÓ in the (local) buffer pool – e.g. index
roots, etc. So this does not
motivate OS-level interfaces for control over physical memory placement
(H&C issue #2).
- They assume that database
indexes are mapped entirely into memory – i.e. theyÕre main-memory
data structures (a la red-black trees) rather than disk-oriented data
structures (a la B-trees).
This appears to be honest naivete. But it is madness to expect a
main-memory data structure in VM to have as good paging behavior as a
disk-oriented data structure.
More fundamentally, this underscores one of the papers own key
lessons about API design: an API that is Òtransparent except for
performanceÓ
is a lousy API for performance-oriented applications! (Remember that!)
So VM would never be used for Òout-of-coreÓ data structures and
computations – itÕs a poor API for such things.
So, how
much does ACPM help the DBMS?
- Goal #1 is irrelevant because
multiprogramming and servers donÕt go together.
- Goal #2 is fine, but weÕre only
using the cached file interface, not the VM interface, so this isnÕt much
more helpful than just direct I/O.
- Goal #3 is critical for correct
write-ahead logging, and useful for performance with respect to read-ahead
and write-batching. But this is an I/O subsystem issue, not a VM issue per
se.
Conclusion:
thereÕs little here that makes a DBMS writerÕs life much easier. I suspect the same complaint applies to
scientific and other applications.
Basic issue is one of metaphor and API:
- For big-data computations with
complex I/O patterns, VM is almost by definition Òtransparent except for
performanceÓ regardless of the degree of control. The data structures and algorithms
get written for paginated memory models rather than RAM pointer
models. There is an algorithms
literature around this (Òout-of-core algorithmsÓ, Òindexability theoryÓ,
etc.) Achieving this I/O
pattern while pretending to live in a pointer model is
near-impossible. So VM is the
wrong metaphor; fixing its API is misunderstanding the issue.
- One key motivation for the work
– multiprogramming – is undercut by the applications that
might want control. Big-data
computations (scientific apps, DBMS decision-support workloads) tend to be
done as batch jobs on dedicated hardware with minimal multiprogramming
(the only concurrent tasks are usual system daemons, etc.) Multi-client parallel little-data
jobs (transaction processing, information retrieval, etc.) tend to use
servers which again run on dedicated HW with minimal multiprogramming.
Lessons
here in trying to please people without understanding their pain. Or more likely, hammers in search of
nails.
5 Minute Rule
Why read
this paper?
- Expose you to a style of
trend-driven systems thinking.
A way to identify research opportunities!
- Sensitize you to choosing relevant,
portable/scalable
metrics of performance.
- Expose you to the disk access
issues that matter to DB applications – and discuss the way they
affect buffer replacement and prefetching.
Trend-Driven problem identification
Jim Gray
and Dave Patterson seem to push these the hardest.
- Idea: identify the pain point
that is coming 5-10 years down the line, and start doing the research on
it now. Remember the GretzkyÕs Rule:
ÒSkate to where the puck is going!Ó (Though this is apparently apocryphal
and also bad hockey advice: http://www.fastcompany.com/magazine/36/cdu.html).
- Doing this well is rather an
art, even though hearing about it can be dry.
- This particular paperÕs
analysis doesnÕt identify a research problem per se, though it dispenses
with a number of proposed problems people have posed over the years.
Performance Metrics
- Economic cost (as in $$) is a useful proxy
for issues in computing that are relevant to (capitalist) society. Used to be a proxy for performance
or resource utilization. Now?
- Ratios of metrics can be more
telling than individual metrics.
When ratios remain the same, design tradeoffs tend to stay the
same. When ratios shift
substantially, you have a research opportunity.
- ÒFeature selectionÓ –
identifying the right issues to measure – is a critical challenge
and a place for creativity in the realm of trend-driven research. Again, economic cost is often a
good guide in our society, though this opens up a philosophical can of
worms.
Access patterns in a DBMS (detail courtesy Chou/DeWitt):
á
Straight
Sequential: LocSet size = 1. Replace that page as soon as needed.
á
Clustered
Sequential: LocSet size = (#tuples in largest cluster)/(# of tuples per page).
FIFO or LRU replacement work.
á
Looping
Sequential: LocSet size = size of relation (ÒfileÓ). MRU is best.
á
Independent
Random: odds of revisit are low, so LocSet either 1, or the magic number b from
the "Yao" formula.
á
Clustered
Random: Just like CS, but pages are not packed onto blocks. So itÕs just # of
tuples in largest cluster. LRU or FIFO replacement.
á
Straight
Hierarchical, Hierarchical/Straight Sequential: like Straight Sequential.
á
Hierarchical/Clustered
Sequential: like CS, but replace tuples with (key,ptr) pairs
- Looping Hierarchical: at each
level h of an index, you have random access among pages. Use Yao to figure
out how many pages youÕll access at each level in k lookups
Fully-specified
models a la DBMIN? Simple
stochastic models a la LRU, LRU-k?
Something in between (Ògray-box systemsÓ)?