Skip to content

Instantly share code, notes, and snippets.

@YZLRYO
Last active April 18, 2018 16:18
Show Gist options
  • Save YZLRYO/5bbe1ecdf690280cb87749a97e05e2f8 to your computer and use it in GitHub Desktop.
Save YZLRYO/5bbe1ecdf690280cb87749a97e05e2f8 to your computer and use it in GitHub Desktop.
Notes for db impl
Feb 19, Week 7
Review:
global depth = 2
% 2^gd
hashkey, ptr
h(k) | ptr(address) address | ld
0 0 addree right 0 2 4 12
1 1 1 2 17
2 3 2 2 11 15 19 3
3 2 3 2 2 10 14
=> split when?? ld is equal gd
how many most in a page in hsh file?? 2^(gd-ld)
Note: still need to do sth like chaining/ linear probing,
(the page is a linked chain)
Why?
1. Repeated key is issue, in general, hashing does not like skew
Pros:
1. No downtime to increase table size
2. Just 1 I/O for point finds (if things are waiting)
Cons:
1. range queries are terrible
2. Need to pin the directory ,
eg. 1M pages
best case: 20B each Pg, 20MB
but the Pg size(including directory size) may increase
3. hashing is vulnerable to skew, nasty cases, zips law? 80 data to 20 buckets. ending in one disk full
4. linear hashing
B tree + space filling curve : B tree for use with MD data
# How ro handle keys composed of 2+ attributes?
A: 1. "Arbitrary" say one goes "first"...use second to break ties.... (Combined them)
Problem: can only do searches on first attribute, "scanning the second is not efficient"
2. Use a so-called "spatial index" to organize data
Designed to handle point(range) in any subset of key attributes
# What is a space filling curve?
A: Fracted curve that hits every point on a plane ( or highest space)
# What curves are commonly used in DB system?
A: Z-ORDER curve
* ------ * * -* *--*
_____/ ,’,' ,'
/ *--* *--*
*--------*
make small z, then large z with small z
Hibert Curve
* *-----*-----* *-----*
| | ^ | |
*-----* | *-----* *-----*
connect between U
@ Note: Z-ORDER, Hilbert curves defined in > 2 dimensions
# So, how to use to store multized data?
A : 1. k = sfc(key)
2. store (k, record) pair in B-tree
* * 3*-----* *-----*
| | | |
*-----* 2*-----* *-----*
| |
* * 1* *-----* *
| | | | | | range
*-----* 0*-----* *-----*
0 1 2 3
To perform a range find:
1) Break query into a set of 1-D ranges in curve
2) Query each range using B-tree
# When work well?
A: When query result in few finds on underlying B-tree
In other words, the curve should preserve "spatial locality" (close in high d space, then close in 1 d space)
# PRO:
1. Easy, just as B-tree, nice for point data
CON:
1. Does not handle volume data well
volume data: add a house?(an object thing, with diff attribute => volume) not good for spatial data
Diff volume, range: quite opposite
Q3 Monday Btree, extendible hashing
Linear Hashing
top
------
1 hash using gd+1 bits
2 cursor
3
------ hashed using gd bits
4
5
6 hashed using gd+1 bits
7
-------
# Do I need to handle onr-full pages?
A: YES need chaining, linear probing.
# How does this handle "hot spot"?
A: It doesn't. ( extended hash: Naturally handle hot spot, split hot spots
* PRO
1. hash with dynamic growth, no directory
* CON
1. hash not hash well, linear probing, bad for hot spots
Q4 Friday
For volume data.. R-tree file organized for spatial data with volume
# Looks like a B-tree, but ranther (PE, ptr) pairs, have (MBR, ptr) pairs // MBR: Minimum bonding rectangle
# R-tree range queries
- recurse down the tree
- whenever MBR intersects Q(query range), recurse to child tree using coressponding ptr
# what makes a 'good' R-tree?
A: 1) all nodes quite full
2) we also want MBR with small perimeter (dont want overlap)
* 1) and 2) often in conflict
# Insert into an R-tree:
1) Start at root
2) while not at leaf: out of all leaf ranges
a) insert into childw smallest increase in perimeter to cover the object
b) extend MBR as needed
# How to handle splits?
PRO:
1. R-tree is the only way to nicely index volume data
CON:
1. unusable in >= 10 dims, (not to worry data high dims not frequently)
- Indexing V.S Storing data in a primary file arganization
* Issue: can only organize data in an array
* Bat may query many ways
* How to handle? Indexing
- Indexing
* Rather than storing, store (key, ptr) pairs
* concepts & defs
1) sparse index
- only have (key, ptr) pair per data page in the index
key,ptr: (2,0) (19,1) (32,2) (49,3)
ptr: 0 1 2 3
| 2 4 6 12 | | 19 20 27 30 | | 32 36 40 42 | | 49 57 75 80 |
Note: have seen these directory
2) Dense index
- (key,ptr) pair for every record in the data file
3) Primary index
- Index is used to organize data in pages
(Index is same as primary file organization)
4) Data file and index are independent
ex: data in B-tree where key is NetID, secondary index on these data is GPA
Notes:
1) All secondary indices are dense
2) Primary indices may be dense or sparse
3) Usually primary index is sparse, but a dense index saves going to a data page when key value is enough
4) Taken to file extreme, "dense index on everything" gives us a "colomn oriented" DB.
5) Limited to one primary index (without replication)
6) Can have unlimited secondary indices
pro: queries can be faster
con: update more expensive
7) Even if an index is available, question "should I use it?" is hard to answer
==========================================================
idea:
a bunch of data
| 2 4 6 12 | | 14 14 23 | | 48 78 98 |
0 1 2
want find quickly
exxtra map for decondary indexing:
(2,0) (14,1) (48,2) => sparse
| 6,4,7 | | 2.3.8 | | 1 9 10 | not in order
(1,2) (2,1)... => dense
query faster with secondary index, update slow
Query compilation & Optimization
Overview:
* Given an SQL query
* Assume an abstract Relational Algebra(RA) "machine", join, proj, selection...
* Assume Machine has multiple implementations of each RA operation
Goal: * Compile SQL into RA expression annotatetd with implementations for each RA op
# Are many ways to compile SQL into annnotated RA
* Which one to choose?
- The one with best "performance" (wall clock time, throughput: common, mem usage, comm of network traffic)
# Are really 2 sub-problems in optimization test
* logic opt'n : choose the Best RA
* physical oppt'n: choose the best annotation
1) Compilation - build initial RA for SQL
Today: 1)
a) scan/ tiokenice the SQL
b) parse the SQL
c) translate into RA
2) logical opt - improve the RA
3) Physical opt annotate RA with best physical ops
Compilation - usually two steps: sa=canning and parsing
scan: break program into tokens or terminals in formal grammar
parse: reverse engineering production rules of formal grammars used to produce the sequence of tokens
Virtually all programs in modern PLs correspond to strings framed using a formal grammar - a CFG
* CFG: basically a machine that can be used to produce all strings in a partition language
* CFG is a set of rules of the form (context-free-grammar)
Left Hand Side = > Right HS
non-terminal a string of terminals & non-terminals
* terminal: symbol that appear in the program
* non-terminal: is an intermediate symbol, not expected in prog
EX: CFG for all binary strings with same # of 0's & 1's
start -> 0 start 1 |
1 start 1 |
start start |
epsilon(empty)
EX: all palindrome
start -> 0 start 0 |
1 start 1 |
0 |
1 |
epsilon(empty)
Issue: we typically dont write CFGs where termminals are individual characters
- CFG complicated & slow to parse
Instead, we use a regular expression to specify terminals
-ex FP numbees: (* regular expression of number)
[+-]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
CFG is more stronger than RegularExpression
Issue: during compilation, we are not producing strings, we are reversing the production of the strings called "parsing"
- How parsing work? dont care in modern times, no one writes a parser, use a "compiler compiler" -ex YACC
How to search graph?
1) Exhausive - works up to 5, 6 joins
2) Greedy.. walk graph, always tranverse to best adjacen node
What are various transitions?
(1) On selections
sigma_B pi_A always applies if A is simply a listof attrs
| |
pi_A => sigma_B if A computes non-identify functions over attrs of R,
| | where result is referenced by B, it wont apply
R R
pi_A sigma_B always when B only refers attrs, actually return by A
| |
sigma_B => pi_A
| |
R R
sigma_B xjoin_B2^B'
| / \
xjoin_B' => sigma_B1 sigma_B3
/ \ / \
R S R S
assuming B is in CNF, is valid when:
B1: CNF with those clauses in B referring only to atts in R
B3: CNF with those clauses in B referring only to atts in S
B2: CNF with those clauses in B referring to atts from both R and S
2) On join
xjoin_B xjoin_B
/ \ / \
R xjoin_B' => xjoin_B1 T
/ \ / \
S T R S
applies assuming B1 in CNF including clauses refering only to attrs in R and S
Parsing: task of analyzing a string to determine CFG rules used to produce that string
No one writes parsers by hand
* instead use compiler compiler
* classic tools lex/flex yacc/bison
compiler compiler:
* Write down a CFG
* Annotate CFG with code to exec when rule recognized
* then comp-comp generates a parser
I case of SQL: com-comp used to obtain:
1) expressions in SELECT clause
2) list of tables + aliases in FROM clause
3) the bool expr from where clause
Most modern compilers just use parser to extract a store data in C/Java/C++, data structs
What do I need to do with those data structs?
1) analysis & verification of program, (ex. semantic checking
2) emit target code ( RA in a DBcompiler)
In case of SQL, how do I build my output RA exp?
SELECT E1, E2 pi
FROM T1, T2 => E1, E2
WHERE B |
sigma
B
|
x
/ \
x T4
.....
Parsing: How work? dont care
But need to know enough to know when a comp-comp will fail
- When will it fail?
1) Grammar is ambiguous, ex can go to multiple grammar
2) Grammar reuires too much lookahead
Logical Optimization
takes as input RA, convert to "better" RA
/
add up the total # of bytes produced in implementing RA
/
or simple add up the # of the tuples produced
One view of logical opt
* We have a large graph
* Each node in the graph is a diff RA expr ("plans") that is eui to original SQL
 * Is an edge from plan A to plan B if there is an algebraic transform from A to B guaranteed not to alter semantics
 
xjoin_B(R, xjoin_B'(S, T)) => xjoin_B(xjoin_B'(S, T), R)
- we have covered pi, sigma, join the entire core of RA
- didn't cover antijoin, semi-join (appear when have correct subquary) or outerjoin, intersection, union
costing RA expressions
-always based on ability to estimate # of tuples returned from operation
ex: sigma_B2 <- guess # of tuples coming out here
|
xjoin B1
/ \
R S
2 methods:
1) sampling, run query over sample of input relations,,, => not used reason: 1) hard to maintain samples 2) slow
3) accuracy with small samples
2) statistical synopsis (counts, histograms, etc)
/ \-the
number of tuples in each relation
What we want: for each RA op to be able to take as input T(R), V(R,A1) V(R,A2)
1) selection
R' <- sigma_A=V (R)
T(R') = T(R) / V(R, A)
V(R', A') = 1 if A == A'
= min( T(R'), V(R,A') ) if A!=A'
R' <- sigma_A!=V (R)
T(R') = T(R) - T(R) / V(R, A)
V(R', A') = V(R, A) - 1 if A == A'
= min( T(R'), V(R,A') ) if A!=A'
R' <- sigma_A>V (R)
T(R') = T(R) / 3
V(R', A') = V(R, A)/3 if A == A'
= min(.) if A!= A'
R' <- sigma_B1 or B2 (R)
T(R') = T(sigma_B1(R)) + T(sigma_B2 (R)) , problem: double count interrsection
=> trans to T(R') = MIN( T(sigma_B1(R)) + T(sigma_B2 (R)), T(R) )
V(R', A) = min(V(sigma_B1(R), A) + V(sigma_B2(R), A), V(R, A))
R' <- sigma_B1 and B2 (R)
T(R') = T(sigma_B1(sigma_B2 (R))) , assuming independent
V(R', A) = V(sigma_B1(sigma_B2 (R)), A)
under correlation may over OR under estimate
R' <- sigma_!B (R)
T(R') = T(R)-T(sigma_B(R)) , assuming independent
V(R', A) = V(R, A)
2) Joins
R' <- R xjoin_R.A=S.A S (R join S on R.A = S.A)
1. sort on A for R and S
2. every bucket in left(R) has a match in right(S)
3. assume every bucket in relation with smaller # of join keys matches with a bucket in other relation
=> will hold in case of a foreign key join
given this, output size is
SUM ni * f(i) <- returns size of bucket matching bucket matching bucket i in other relation
ni in bucket size from smaller relation
to approximate with T(), V(), use
T(R') = T(R)/V(R,A) * T(S)/V(S,A) * min(V(R,A), U(S,A)) if A == A'
= min(T(R'), V(R,A')) if A' from R
= min(T(R'), V(S,A')) if A' from S and A!=A'
How to handle more complex predicates?
=> Two reasonable way
1. Treat like: R' <- sigma_(R.B==S.B) (R xjoin_R.A==S.A (S))
benefit: easy, if are have costing for single attribute joins, also selection
2. Treat like: R' <- R xjoin_((R.A concat R.B ) == (S.A concat S.B)) (S)
if chose 2, need to estimate V(R, R.A concat R.B), reasonable cost is MAX(T(R), V(R,R.A) concat V(R, R.B))
Physical optimization
- annotating an RA expression with mplementation details.
What are we concerned about when choosing an implementation.
1) Implementation of each RA operation
2) Side effects of RA implementations, ex: output is sorted, or hashed to machine via some key, etc.
3) "Pipelining" - consume of output of op takes its data directly from data produced when ready... avoid trips through mem hierarchy
Implementing RA Ops
Let's cover in order from fast to slowww...
1) Selection with an index.
2) Selection without an index: just apply prediction to each tuple.
3) "Scan" Join: applicable to equi-joins where one input relation can fit into RAM
alg: R' <- R xjoin _R.A==S.A (S) -1- Hash S (smaller relations) on S.A - store results in RAM in hash table
-2- Scan R.. for each r in R, probe hash table to find match, send to output stream.
4) Scan based aggregation
- Hash table, one entry per group, (ex, SELECT SUM(att1) FROM TB GROUP BY att2)
- each hash table from finite set of attributes
- process a tuple
1. hashes to a group key
2. update entry in table
5) Union
- two types of union(UNION and UNION ALL, set(union) and bag based union)
- bag based is easy, scan R then scan S
- set based: (Assume Union hash can fit into RAMs)
1) scan R for each r in R
- hash r
- add to hashtable if not there
2) Scan same on S
3) Scan hash, and contents to output stream
6) Intersection and Difference
-
10) 2-pass, hash-based set ops (n, v, ~)
- If R is big, one-pass hash-based won't work
- Can use algorithm similar to hash join
11) Sort-merge join
first, simple "sort join" :
a) TPMMS over R (sort key is join key)
b) TPMMS over S
c) Merge
cost: 5(|R|+|S|)
but wasteful: find merge can be combined with TPMMS merge
- Sort-merge join idea: never materialize sorted R,S
- Instead, do merge directly after sort phase of TPMMS
=> cost 3(|R|+|S|)
Note: often R or S are already sorted, <- can avoid 2 passes on sorted relation
Why? may be sorted on disk(either could be sorted primary key)
- or - may be output from prior join
12) sort-based duplicate removal - like TPMMS
in merge phase: cost: 3|R|
| 1 2 4 | | 6 7 9 | | 10 11 12 | | 2 4 5 | | 6 7 9 | | 13 14 15 |
Run 1 Run 2
PQueue: 1 2 3 4 ...
keep polling while get the same value
key merge difference:
when remove low value, repeadly peek into front of PQ, discard duplicate
13) hash-based duplicate removal
cost: 3|R|
after hash parsor
r rrrrrr rrrrr r r
|---------------| |---------------| |---------------| |---------------| |---------------| |---------------|
0 1 2 3 4
rrrrrr
|---------------|
in merge phase: do in RAM duplicate removal on each bucket
Compare sort + hash -based dup. removal
=> hash-based better
- O(1) time complexity per second to add to hash, Compare to O(lgn) for sort
- early duplicate removal, remove in add stage
=> sort-based
- output is sorted
CCR - concurreny control recovery
Recovery from failure.. what types are there?
(1) Bad data:
- to handle (from prog. point of view):
* constraints
* triggers
(2) Media failures
- what tools do we have as system designers to handle?
* RAID and similar schemes
* backups (off-set in extreme case)
(3) System s user initiated failures, handled via transactions
Transaction in SQL: BEGIN TRANSACTION; code; COMMOT or RELLBACK
Transaction provides "all or nothing" semantics "atomicity"
- if commit, all actions commit
- if fail/rollback nothing happened
WHy is atomicity hard?
- failures can happen any time
- some changes may be on disk, some may in buffer
- typical mechanism to recover from system a user init failures:
LOGGING!
- idea: have a log used to store info about update... enough info to recover from failure
What is a "log"?
- A sequence of special records....
- Writes to log are append only
- No updates, deletionvia "truncations"
- What type of records are written to a log?
depends on type of log in, but all logging schemes uses: <START T> <COMMIT T> <ABORT T>
LOG:
|rrrrrrrrrrrrrrrrrrr||head tail||rrrrrrrrrrrrrrrrrrr|
We need some basic DB operations
INPUT(X) - loads Z into buffer
READ(X,t) - loads X from buffer into variable t
WRITE(X,v) - put v into X, store in buffer
OUTPUT(X) - write contents of X to disk from buffer
What are we writing/reading? That is what is X?
* Relations
* pages
* records
* atts
(1) Undo logging
2 key rules
- If WRITE(X, t) by Transaction T, then <T,X,v> must be to disk before OUTPUT(X)... v is old value of X
- If T commits, then <COMMIT t> to disk AFTER OUTPUT(X)
Say we fail... recall <START T> is by record when transactions begins
1) look at log... find all trans. with START, no COMMIT
2) for each such trans, use <T,X,v> recs to restore value v for X... going from head of log back to tail
Issue: com we have
<START T1>
<START T2>
<T1, X, v1>
<T2, X, v2>
<COMMIT T2> ... crash...
=> T1 not commited, rolled back cause of this but then T2 undone as well
How to truncate log in undo logging?
Easy way to "checkpoint":
1) Stop new transactions
2) wait for all trans to commit/abort
3) Set tail = head
Problem: stops new transactions for a (long?) time
better solution: "Nonquiescent" checkpointing
1) Write <CHECKPT T1, T2.....Tk> where T1...TK are all active trans
2) Wait until T1..Tkk all commit or abort
3) Write <END HECKPT> tolog
Rocovery:
1) Locate last <CHECKPT> <END CHECKPT> pair....run recovery alg only on log after that <CHECKPT>
Implies : can truncate the log at <CHECKPT> record after we write the <END CHECKPT> record
tail ST1 ST2 CHECKPT ST3 CT1 CT2 END ST4 CHECKPT crash
|-------|--------------------------|----------------------|--------|--------|----------|----------|---------|--------|------------
- Why is undo logging not really used?
Undo rules:
* forces flush of written DB objects before commit
* affects utility of buffering
1) If WRITE(X,t) by T, then <T, X, v> to log before OUTPUT(X)
2) If T commits, <COMMIT T> to log after all OUTPUT(X)
- Redo logging
-Idea: log stores info to redo commited trans
- Rules:
1) Write <START T> before T begins: helps us checkpointing
2) Before OUTPUT(X), corresponding <T, X, v> (v is new value) must be in log and <COMMIT T> on disk
3) <COMMIT T> must follow after all corresponding <T, X, v> in log
- Why does this work?
1) transaction will write a bunch of stuff
2) Before written data to disk, will have redo info in log
3) Also, know that if I successfully commit, all redo info in log
Recovery:
1) Identify all of the commits transaction in the log
2) Scan log from begin, for each <T,X,v> (T commited) write v to X
st T1 st T2 end T1 CHEKPT st T3
--------|--------------|---------|----------------|--------------|----------------------------------------
<-|
ok write T1 if: all T1's values have been OUTPUT()
ok write T2 if:
could be log entries for T1 and T2
at point that we have OUTPUT(X) for all trans that had data in buffer at CHEKPT
write <END CHECKPT> and truncate log, OR the TRANs that wrote aborts
Recovery in checkpointed log:
- Scan from back to front... if hit <END CHECKPT> just run recovery from corresponding check point
- If hit <CHECKPT T1, T2....> then:
all trans with data in buffer at start of CHCKPT
T1, T2 have began, run recovery from there
Drawback of redo logging:
- Still messes with BM... must join written pages until COMMIT
Bad cause can require more RAM than available
- Also affects LRU performance
Undo-redo logging
Rule
- Before OUTPUT(X), after WRITE(X, v) due to T, muat have <T, X, oldval, v> on disk
Recovery:
1) From start of log, redo all commited trans
2) From end of log, undo all trans with no COMMIT, ABORT
checkpoint in redo logging:
1) Write <CHECKPT T1, T2...> where T1, T2..are all trans with data in buffer
2) Wait until T1, T2.. have no more data in buffer not just commited, all of T1, T2 have commited or aborted, write <END CHECKPT>
3) Write <END CHECKPT>
Concurrency Control
- Facilitates the view no one is in DB concurrently
- The gold standard for isolation is serializability (SR)
- SR: a schedule is SR if state of DB after arunning the schedule is the same as if had run some serial aftersome of running in schedule
* schedule : a list of reads and writes
* serial: no overlap among transactions
-issue: SR really hard to reason about
-So in practice we strive for CSR (C = "conflict")
-CSR: We sau two DB ops conflict if they involve the same data object and one is a write
Then a schedule is CSR if can transform to a serial schedule by flipping order of non-conflicting ops.
-Want CSR, how do we get it?
-Classic mechanism: locking
*When using locking, must lock any object before read/write
*locks are granted as folllows S X
S Y N
X N N
-Classic locking alg to guarantee CSR in 2PL
*Each trans must obtain all of its locks before it begins releasing any
*In practice, means a trans has a growing phase and a shrinking phase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment