12 Logging Protocols
2018年2月28日
13:19
Logging Schemes(redo, no undo for in-memory database)
Recovery algorithms are techniques to ensure database consistency, txn atomicity and durability despite failures.
Recovery algorithms have two parts:
→ Actions during normal txn processing to ensure that the DBMS can recover from a failure.
→ Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability
Logging scheme
Physical Logging(what happens to tuple)
→ Record the changes made to a specific record in the
database.
→ Example: Store the original value and after value for an
attribute that is changed by a query.
Logical Logging(what cause the tuple's change)
→ Record the high-level operations executed by txns.
→ Example: The UPDATE, DELETE, and INSERT queries
invoked by a txn.
Logical logging writes less data in each log record
than physical logging.
Difficult to implement recovery with logical
logging if you have concurrent txns.
→ Hard to determine which parts of the database may have
been modified by a query before crash.
→ Also takes longer to recover because you must re-execute
every txn all over again.
Crash Course on ARIES
DISK-ORIENTED LOGGING & RECOVERY
The “gold standard” for physical logging &
recovery in a disk-oriented DBMS is ARIES.
→ Algorithms for Recovery and Isolation Exploiting
Semantics
→ Invented by IBM Research in the early 1990s.
Relies on STEAL and NO-FORCE buffer pool
management policies.
ARIES MAIN IDEAS
Write-Ahead Logging:
→ Any change is recorded in log on stable storage before the
database change is written to disk.
→ Each log record is assigned a unique identifier (LSN).
Repeating History During Redo:
→ On restart, retrace actions and restore database to exact
state before crash.
Logging Changes During Undo:
→ Record undo actions to log to ensure action is not
repeated in the event of repeated failures.
Phase #1: Analysis
→ Read the WAL to identify dirty pages in the buffer pool
and active txns at the time of the crash.
Phase #2: Redo
→ Repeat all actions starting from an appropriate point in
the log.
→ Log redo steps in case of crash during recovery.
Phase #3: Undo
→ Reverse the actions of txns that did not commit before
the crash.
Every log record has a globally unique log sequence
number (LSN) that is used to determine the serial
order of those records.
The DBMS keeps track of various LSNs in both
volatile and non-volatile storage to determine the
order of almost everything in the system…
Often the slowest part of the txn is waiting for the
DBMS to flush the log records to disk.
Have to wait until the records are safely written
before the DBMS can return the
acknowledgement to the client.
So here comes group commit
Batch together log records from multiple txns and
flush them together with a single fsync.
→ Logs are flushed either after a timeout or when the buffer
gets full.
→ Originally developed in IBM IMS FastPath in the 1980s
This amortizes the cost of I/O over several txns.
EARLY LOCK RELEASE(improvement with group commit)
A txn’s locks can be released before its commit record is written to disk as long as it does not return results to the client before becoming durable. Other txns that read data updated by a precommitted txn become dependent on it and also have to wait for their predecessor’s log records to reach disk.
IN-MEMORY DATABASE RECOVERY
Recovery is slightly easier because the DBMS does
not have to worry about tracking dirty pages in
case of a crash during recovery.
An in-memory DBMS also does not need to store
undo records.
But the DBMS is still stymied by the slow sync
time of non-volatile storage.
siloR(this is an example.)
SiloR uses the epoch-based OCC that we discussed previously. It achieves high performance by parallelizing all aspects of logging, checkpointing, and recovery.
The DBMS assumes that there is one storage device per CPU socket.
→ Assigns one logger thread per device.
→ Worker threads are grouped per CPU socket. As the worker executes a txn, it creates new log records that contain the values that were written to the database (i.e., REDO).
Each logger thread maintains a pool of log buffers
that are given to its worker threads.
When a worker’s buffer is full, it gives it back to
the logger thread to flush to disk and attempts to
acquire a new one.
→ If there are no available buffers, then it stalls.
The logger threads write buffers out to files
→ After 100 epochs, it creates a new file.
→ The old file is renamed with a marker indicating the max
epoch of records that it contains.
Log record format:
→ Id of the txn that modified the record (TID).
→ A set of value log triplets (Table, Key, Value).
→ The value can be a list of attribute + value pairs.
Phase #1: Load Last Checkpoint
→ Install the contents of the last checkpoint that was saved
into the database.
→ All indexes have to be rebuilt.
Phase #2: Replay Log
→ Process logs in reverse order to reconcile the latest
version of each tuple.
Recorvery
Phase #1: Load Last Checkpoint
→ Install the contents of the last checkpoint that was saved
into the database.
→ All indexes have to be rebuilt.
Phase #2: Replay Log
→ Process logs in reverse order to reconcile the latest
version of each tuple.
First check the pepoch file to determine the most
recent persistent epoch.
→ Any log record from after the pepoch is ignored.
Log files are processed from newest to oldest.
→ Value logging is able to be replayed in any order.
→ For each log record, the thread checks to see whether the
tuple already exists.
→ If it does not, then it is created with the value.
→ If it does, then the tuple’s value is overwritten only if the
log TID is newer than tuple’s TID.
Design decision:
Node failures in OLTP databases are rare.
→ OLTP databases are not that big.
→ They don’t need to run on hundreds of machines.
It’s better to optimize the system for runtime
operations rather than failure cases.
Physical Logging
Command Logging
Logical logging scheme where the DBMS only
records the stored procedure invocation
→ Stored Procedure Name
→ Input Parameters
→ Additional safety checks
Command Logging = Transaction Logging
logging
The DBMS logs the txn command before it starts
executing once a txn has been assigned its serial
order.
The node with the txn’s “base partition” is
responsible for writing the log record.
→ Remote partitions do not log anything.
→ Replica nodes have to log just like their master.
recovery
The DBMS loads in the last complete checkpoint
from disk.
Nodes then re-execute all of the txns in the log
that arrived after the checkpoint started.
→ The amount of time elapsed since the last checkpoint in
the log determines how long recovery will take.
→ Txns that are aborted the first still have to be executed.
Replication
Executing a deterministic txn on the multiple
copies of the same database in the same order
provides strongly consistent replicas.
→ DBMS does not need to use Two-Phase Commit
PROBLEMS WITH COMMAND LOGGING
If the log contains multi-node txns, then if one
node goes down and there are no more replicas,
then the entire DBMS has to restart.