Some of the old hierarchical and network DBs had pretty clever systems
for handling reliability. BUT -- they had no formalisms to
describe
the semantics clearly, and hence few lessons to transfer to other
systems.
System R's RSS team, led by Jim Gray, codified the formal notion of transactions
and
serializability,
and System R delivered a working (though inefficient)
implementation.
Led to 1998 Turing Award for Gray.
Our own Christos Papadimitriou did early work formalizing theoretical
results
on transactions (and has a book on the topic!)
Various companies slogged through the details of transactions over the
years, especially the complex issues of logging and recovery. In
the 80's and into the 90's, IBM published papers on ARIES, which is the
classic discussion of the dirty details of logging and recover.
Many
industry vets "already knew" this stuff and had done it in commercial
systems.
There is an entire industry of Transaction Processing that exists
outside
of DBs, and related to distributed applications and services.
More
on this later in the semester.
Generally a field populated by industrial practitioners!
Since the early days, academic work has focused largely on esoteric
issues
of extensions to the flat transaction model. Little of this has
found
practical use.
Background
Transactions are concepts that allow a system to guarantee certain
semantic
properties.
These guarantees must be rigorously defined so that people can build
correct
systems above them.
Theory meets practice here in a nice way.
Kinds of Actions
Unprotected actions: you cannot count on these. It is
the
responsibility of higher level actions to check on these if they want
to
be sure.
Protected actions: actions enclosed inside of transactions.
Real actions: actions that are visible outside the
computer.
Print to screen, send a web page, output money, drill hole, fire
missile,
etc.
getting these right is very tricky!
easier if idempotent (a la "drill hole")
doable but tricky if you can check the state of the real world
somehow.
we'll return to this later if we get to talk about transactional
networking.
otherwise impossible to handle!
A.C.I.D.
A transaction should enjoy the following guarantees:
Atomicity: the "all or nothing" property. Programmer
needn't
worry about partial states persisting.
Consistency: the database should start out "consistent", and
at
the end of transaction remain "consistent". The definition of
"consistent"
is up to the database administrator to define to the system; other
notions
of consistency must be handled by the application.
Gray & Reuter are very confusing on this point.
Isolation: a transaction should not see the effects of other
uncommitted
transactions.
Durability: once committed, the transactions effects should
not
disappear (though they may be overwritten by subsequent commited
transactions).
Roughly speaking: A and D are guaranteed by recovery (usually
implemented
via logging). C and I are guaranteed by concurrency control
(usually implemented via locking).
It's worth noting that this is not a theoretical formalism, just a
mnemonic. Don't expect it to be a perfect factoring of the issues
-- there is overlap of concerns among the four.
Concurrency Control & Serializability
CS186 material you might not know.
We want to allow multiple transactions to operate concurrently.
(Why?)
Need a crisp definition of acceptable and unacceptable concurrency
build that on a notion of acceptable orders of operation
DB only understands simple data-oriented operations: read, write,
begin,
commit, abort
order of operations captured in a transaction schedule:
transactions (Ti) and data objects (a...z)
all operations of a transactions occur in the order specified by the
transaction
schedule captures the interleaving of the operations in time
e.g. R1(a), R2(a), W2(a), W1(a),
One reasonable definition of correct order of operation: serial
schedules
rough defn: those schedules that have no interleaving of operations
across
xactions
note: in this definition, any serial schedule
is correct
-- order of transactions isn't guaranteed in any way!
What we'd like
to get the effect of serial schedules, but still get lots of concurrency
serializable schedules are those schedules that are
"equivalent"
to serial schedules
definition of equivalence: produces the same final state as a serial
schedule
What could mess up serializability?
conflicts: two transactions share a data item, and at
least one
writes it. In time, we have RW (unrepeatable read), WR (dirty
read),
and WW (lost write) conflicts.
These conflicts can be OK if you're careful!
Can draw a serializability graph (directed), where nodes are
transactions,
and edges are conflicts in order of time.
Definition: 2 schedules are conflict equivalent if they have
the
same actions, and each pair of conflicting actions is ordered the same
way.
Definition: a schedule is conflict serializable if it is
conflict
equivalent to a serial schedule.
Note: some serializable schedules are NOT conflict serializable!
Theorem: A schedule is conflict serializable if and only
if its
serializability graph is acyclic.
Do you see the proof?
Now we have a definition, and a way to be sure we're
serializable.
We need a mechanism to enforce serializability.
Two-Phase Locking (2PL): based on 3 rules
before reading an object, a transaction must get a shared lock on it
before writing an object, a xaction must get an exclusive lock on it
once a xaction releases one lock, it cannot request any more locks
(hence
the name 2PL)
Theorem: 2PL ensures that the precedence graph generated
by transactions
will be acyclic -- so serializability is enforced.
Can you see the proof here?
Strict 2PL is a variant of 2PL in which a xaction must
drop all
its locks at once.
has nice features: avoids "cascading aborts", and guarantees
"recoverable"
schedules.
Recovery
Just you wait! We will dig (deep) into this subject in subsequent
reading on ARIES.