Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M....
Embed Size (px)
Transcript of Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M....

E.G.M. Petrakis Hashing 1
Hashing on the Disk
Keys are stored in “disk pages”(“buckets”)
several records fit within one pageRetrieval:
find address of pagebring page into main memorysearching within the page comes for free

E.G.M. Petrakis Hashing 2
Σkey
spacehash
function....
data pages0
12
m-1b
page size b: maximum number of records in pagespace utilization u: measure of the use of space
bpages#recordsstored#u⋅
=

E.G.M. Petrakis Hashing 3
Collisions
Keys that hash to the same address are stored within the same pageIf the page is full:
i.
page splits: allocate a new page and split page content between the old and the new page or
ii. overflows: list of overflow pages
x x x x x xoverflow

E.G.M. Petrakis Hashing 4
Access Time
Goal: find key in one disk access Access time ~ number of accessesLarge u: good space utilization but many overflows or splits => more disk accesses Non-uniform key distribution => many keys map to the same addresses =>overflows or splits => more accesses

E.G.M. Petrakis Hashing 5
Categories of Methods
Static: require file reorganizationopen addressing, separate chaining
Dynamic: dynamic file growth, adapt to file size
dynamic hashing, extendible hashing, linear hashing, spiral storage…

E.G.M. Petrakis Hashing 6
Dynamic Hashing Schemes
File size adapts to data size without total reorganizationTypically 1-3 disk accesses to access a keyAccess time and u are a typical trade-offu between 50-100% (typically 69%)Complicated implementation

E.G.M. Petrakis Hashing 7
Two disk accesses:one to access the index, one to access the datawith index in main memory => one disk access Problem: the index may become too large
Dynamic hashing (Larson 1978)Extendible hashing (Fagin et.al. 1979)
index data pages
Schemes With Index

E.G.M. Petrakis Hashing 8
Ideally, less space and less disk accesses (at least one)
Linear Hashing (Litwin 1980)Linear Hashing with Partial Expansions (Larson 1980)Spiral Storage (Martin 1979)
address space
data space
Schemes Without Index

E.G.M. Petrakis Hashing 9
Support for shrinking or growing file shrinking or growing address space, the hash function adapts to these changeshash functions using first (last) bits of key = bn-1bn-2….bi b i-1…b2b1b0hi(key)=bi-1…b2b1b0 supports 2i addresseshi: one more bit than hi-1 to address larger files
⎩⎨⎧
+=
−
−i
1i
1ii 2(key)h
(key)h(key)h
Hash Functions

E.G.M. Petrakis Hashing 10
Dynamic Hashing (Larson
1978)Two level index
primary h1(key): accesses a hash tablesecondary h2(key): accesses a binary tree Index: binary tree
h1
(k)1st
levelh2
(k)2nd
leveldata pages
b
2
3
4
1

E.G.M. Petrakis Hashing 11
Index
Fixed (static): h1(key) = key mod mDynamic behavior on secondary index
h2(key) uses i bits of keythe bit sequence of h2=bi-1…b2b1b0denotes which path on the binary tree index to follow in order to access the data pagescan h2 from left to right (bit 1: follow right path, bit 0: follow left path)

E.G.M. Petrakis Hashing 12
index
h1
(k)1st
levelh2
(k)2nd
leveldata pages
b
2
3
4
1012345
0
11
0h1
=1, h2
=“0”h1
=1, h2
=“01”h1
=1, h2
=“11”h1
=5, h2
= any
h1
(key) = key mod 6h2
(key) = “10…
“<= depth of binary tree = 2
0

E.G.M. Petrakis Hashing 13
Initially fixed size primary index and no data
insert record in new page under h1addressif page is full, allocate one extra pagesplit keys between old and new pageuse one extra bit in h2 for addressing
h1
=1, h2
=0h1
=1, h2
=10123
0
1
Insertions
0123
0123
bh1
=1,h2
=any

E.G.M. Petrakis Hashing 14
0
321
0
321
0
321
0
321
1
2
h1
=0,
h2
=any
h1
=3,
h2
=any
1
2
3
10
h1
=0,
h2
=0
h1
=0,
h2
=1
h1
=3,
h2
=any
01
01
13425
h1
=3,
h2
=0
h1
=0,
h2
=0
h1
=3,
h2
=1
h1
=0,
h2
=01h1
=0,
h2
=11
b
index storage
01

E.G.M. Petrakis Hashing 15
Deletions
Find record to be deleted using h1, h2Delete record Check “sibling” page:
less than b records in both pages ?if yes merge the two pages delete one empty pageshrink binary tree index by one level and reduce h2 by one bit

E.G.M. Petrakis Hashing 16
merging0
321
1
3
2
4
0
321
1
3
2
4
delete

E.G.M. Petrakis Hashing 17
Extendible Hashing (Fagin et.al. 1979)
Dynamic hashing without indexPrimary hashing is omittedOnly secondary hashing with all binary trees at the same levelThe index shrinks and grows according to file size Data pages attached to the index

E.G.M. Petrakis Hashing 18
dynamichashing withall binary treesat same level
100
1234
1
0
0
100
01
10
11
01234
2
2
1
dynamichashing
number ofaddress bits

E.G.M. Petrakis Hashing 19
Initially 1 index and 1 data page 0 address bitsinsert records in data page
index storage
0
b
0
Insertions
global depth d:size of index 2d
local depth l :Number of address bits

E.G.M. Petrakis Hashing 20
11
01
d: global depth = 1l : local depth = 1
d
1
l
index storage
0
b
0
d l
Page “0”
Overflows

E.G.M. Petrakis Hashing 21
Page “0”
Overflows (cont.)
1 more key bit for addressing and 1 extra page => index doubles !!Split contents of previous page between 2 pages according to next bit of keyGlobal depth d: number of index bits => 2d
index sizeLocal depth l : number of bits for record addressing

E.G.M. Petrakis Hashing 22
Page “0”
Overflows (again)
00011011
2 2
2
1
contains recordswith same 1st
bit of key
dl ≤
contain recordswith same 2 bits of key
d

E.G.M. Petrakis Hashing 23
Page “01”
Overflows3
000001010011100101110111
1
2
3
3
d
1 more key bitfor addressing
2d-l: number of pointers to page

E.G.M. Petrakis Hashing 24
Page “100”
Overflows
no need to double indexpage 100 splits into two (1 new page)local depth l is increased by 1
000001010011100101110111
23
2
3
3
2+1

E.G.M. Petrakis Hashing 25
If l < d, split overflowed page (1 extra page)If l = d double index, split page and
d is increased by 1=>1 more bit for addressingupdate pointers (either way):a)
if d prefix bits
are used for addressingd=d+1;for (i=2d-1, i>=0,i--) index[i]=index[i/2];b)
if d suffix bits
are usedfor (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1;d=d+1
Insertion Algorithm

E.G.M. Petrakis Hashing 26
Deletion Algorithm
Find and delete recordCheck sibling page If less than b records in both pages
merge pages and free empty pagedecrease local depth l by 1 (records in merged page have 1 less common bit)if l < d everywhere => reduce index (half size)update pointers

E.G.M. Petrakis Hashing 27
000001010011100101110111
23
2
3
3
2
delete withmerging
000001010011100101110111
23
2
2
2
l < d 00011011
22
2
2
2

E.G.M. Petrakis Hashing 28
A page splits and there are more than bkeys with same next bit
take one more bit for addressing (increase l) if d=l the index doubles again !!
Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value)
if distribution is known, transform it to uniformDynamic hashing performs better for non-uniform distributions (affected locally)
Observations

E.G.M. Petrakis Hashing 29
For n: records and page size bexpected size of index (Flajolet)
1 disk access/retrieval when index in main memory2 disk accesses when index is on diskoverflows increase number of disk accesses
)b1(1)
b1(1
nb
3.92nblog2
l ++≈
Performance

E.G.M. Petrakis Hashing 30
Storage Utilization with Page Splitting
In general 50% < u < 100%On the average u ~ ln2 ~ 69% (no overflows)
bb
before splittingafter splitting
50%2bbu == After splitting

E.G.M. Petrakis Hashing 31
Storage Utilization with Overflows
Achieves higher u and avoids page doubling (d=l)
higher u is achieved for small overflow pagesu=2b/3b~66% after splittingsmall overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75%
double index only if the overflow overflows!!
bb

E.G.M. Petrakis Hashing 32
Linear Hashing (Litwin
1980)
Dynamic scheme without indexIndices refer to page addresses Overflows are allowedThe file grows one page at a timeThe page which splits is not always the one which overflowedThe pages split in a predetermined order

E.G.M. Petrakis Hashing 33
Linear Hashing (cont.)
Initially n empty pages p points to the page that splits
Overflows are allowed
bp
bp

E.G.M. Petrakis Hashing 34
File Growing
A page splits whenever the “splitting criterion” is satisfied
a page is added at the end of the filepointer p points to the next pagesplit contents of old page between old and new page based on key values
p

E.G.M. Petrakis Hashing 35
b=bpage=4, boverflow=1initially n=5 pageshash function h0=k mod 5splitting criterion u > A%alternatively split when overflow overflows, etc.
4319
613303
40227
737712
16711
125320
90435
p
215 522 438 new element
0 1 2 3 4
split80%2217u →>=

E.G.M. Petrakis Hashing 36
Page 5 is added at end of fileThe contents of page 0 are split between pages 0 and 5 based on hash function h1 = key mod 10p points to the next page
p
4319
613303438
40227
737712
16711
32090
522
125435215
0 1 2 3 4 5
1h 1h0h 0h 0h 0h
%802518
<=u

E.G.M. Petrakis Hashing 37
Initially h0=key mod nAs new pages are added at end of file, h0alone becomes insufficient The file will eventually double its size In that case use h1=key mod 2nIn the meantime
use h0 for pages not yet splituse h1 for pages that have already split
Split contents of page pointed to by pbased on h1
Hash Functions

E.G.M. Petrakis Hashing 38
When the file has doubled its size, h0is no longer needed
set h0=h1 and continue (e.g., h0=k mod 10)The file will eventually double its size againDeletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%)
Hash Functions (cont.)

E.G.M. Petrakis Hashing 39
Initially n pages and 0 <= h0(k) <= nSeries of hash functions
Selection of hash function:if hi
(k) >= p then use hi
(k) else use hi+1
(k)
⎩⎨⎧
+=+ i
i
i1i n2(k)h
(k)h(k)h
Hash Functions

E.G.M. Petrakis Hashing 40
Linear Hashing with Partial Expansions (Larson 1980)
Problem with Linear Hashing: pages to the right of p delay to split
large chains of overflows on rightmost pagesSolution: do not wait that much to split a page
k partial expansions: take pages in groups of kall k pages of a group split togetherthe file grows at lower rates

E.G.M. Petrakis Hashing 41
Two Partial Expansions
Initially 2n pages, n groups, 2 pages/groupgroups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1)
Pages in same group spit together => some records go to a new page at end of file (position: 2n)
2 pointers to pages of
the same group0 1 n 2n

E.G.M. Petrakis Hashing 42
1st
Expansion
After n splits, all pages are splitthe file has 3n pages (1.5 time larger) the file grows at lower rate
after 1st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n
0 n
2n
3n
0 n 2n 3n

E.G.M. Petrakis Hashing 43
2nd
Expansion
After n splits the file has size 4nrepeat the same process having initially 4n pages in 2n groups
2 pointers to pages ofthe same group
0 1 2n 4n

E.G.M. Petrakis Hashing 44
1
1,1
1,2
1,3
1,4
1,5
1,6
1 1,2 1,4 1,6 1,8 2relative file size
disk
acc
ess/
retr
ieva
l LinearHashing
LinearHashing2 partialexpansions
3.563.534.04deletion3.313.213.57insertion1.091.121.17retrieval
Linear Hashing3 part. Exp.
Linear
Hashing
2 part. Exp.
Linear Hashing
b = 5b’
= 5
u = 0.85

E.G.M. Petrakis Hashing 45
Dynamic Hashing SchemesVery good performance on membership, insert, delete operationsSuitable for both main memory and disk
b=1-3 records for main memoryb=1-4 Kbytes for disk
Critical parameter: space utilization ularge u => more overflows, bad performancesmall u => less overflows, better performance
Suitable for direct access queries (random accesses) but not for range queries