E.G.M. PetrakisHashing1 Hashing on the Disk Keys are stored in “disk pages” (“buckets”) ...

E.G.M. Petrakis Hashing 1

Hashing on the Disk

Keys are stored in “disk pages” (“buckets”) several records fit within one page

Retrieval: find address of page bring page into main memory searching within the page comes for

free


Σkey

spacehash

function....

data pages0

1

2

m-1

b

page size b: maximum number of records in page space utilization u: measure of the use of space

bpages#recordsstored#

u


Collisions

Keys that hash to the same address are stored within the same page

If the page is full:i. page splits: allocate a new page

and split page content between the old and the new page or

ii. overflows: list of overflow pages x x x x x xoverflow


Access Time

Goal: find key in one disk access Access time ~ number of accesses Large u: good space utilization but

many overflows or splits => more disk accesses

Non-uniform key distribution => many keys map to the same addresses => overflows or splits => more accesses


Categories of Methods

Static: require file reorganization open addressing, separate chaining

Dynamic: dynamic file growth, adapt to file size dynamic hashing, extendible hashing, linear hashing, spiral storage…


Dynamic Hashing Schemes

File size adapts to data size without total reorganization

Typically 1-3 disk accesses to access a key

Access time and u are a typical trade-off

u between 50-100% (typically 69%) Complicated implementation


Two disk accesses: one to access the index, one to access the data with index in main memory => one disk access Problem: the index may become too large

Dynamic hashing (Larson 1978) Extendible hashing (Fagin et.al. 1979)

index data pages

Schemes With Index


Ideally, less space and less disk accesses (at least one)

Linear Hashing (Litwin 1980) Linear Hashing with Partial Expansions

(Larson 1980) Spiral Storage (Martin 1979)

address space

data space

Schemes Without Index


Support for shrinking or growing file shrinking or growing address space, the

hash function adapts to these changes hash functions using first (last) bits of

key = bn-1bn-2….bi b i-1…b2b1b0

hi(key)=bi-1…b2b1b0 supports 2i addresses

hi: one more bit than hi-1 to address larger files

i

1i

1ii 2(key)h

(key)h(key)h

Hash Functions


Dynamic Hashing (Larson 1978)

Two level index primary h1(key): accesses a hash

table secondary h2(key): accesses a binary

treeIndex: binary tree

h1(k)1st level

h2(k)2nd level

data pagesb

2

3

4

1


Index

Fixed (static): h1(key) = key mod m Dynamic behavior on secondary index

h2(key) uses i bits of key the bit sequence of h2=bi-1…b2b1b0 denotes

which path on the binary tree index to follow in order to access the data page

scan h2 from right to left (bit 1: follow right path, bit 0: follow left path)


index

h1(k)1st level

h2(k)2nd level

data pagesb

2

3

4

1012345

0

1

1

0h1=1, h2=“0”

h1=1, h2=“01”

h1=1, h2=“11”

h1=5, h2= any

h1(key) = key mod 6h2(key) = “01”<= depth of binary tree = 2

0


Initially fixed size primary index and no data

insert record in new page under h1address

if page is full, allocate one extra page split keys between old and new page use one extra bit in h2 for addressingh1=1, h2=0

h1=1, h2=10123

0

1

Insertions

0123

0123

bh1=1,h2=any


0

321

0

321

0

321

0

321

1

2

h1=0, h2=any

h1=3, h2=any

1

2

3

10

h1=0, h2=0

h1=0, h2=1

h1=3, h2=any

0

1

01

1342

5h1=3, h2=0

h1=0, h2=0

h1=3, h2=1

h1=0, h2=01h1=0, h2=11

b

index storage

0

1


Deletions

Find record to be deleted using h1, h2

Delete record Check “sibling” page:

less than b records in both pages ? if yes merge the two pages delete one empty page shrink binary tree index by one level

and reduce h2 by one bit


merging0

321

1

3

2

4

0

321

1

3

2

4

delete


Extendible Hashing (Fagin et.al.

1979) Dynamic hashing without index Primary hashing is omitted Only secondary hashing with all

binary trees at the same level The index shrinks and grows

according to file size Data pages attached to the index


dynamichashing withall binary treesat same level

1

001234

1

0

0

1

00

01

10

11

01234

2

2

1

dynamichashing

number ofaddress bits


Initially 1 index and 1 data page 0 address bits insert records in data page

index storage

0

b

0

Insertions

global depth d:size of index 2d

local depth l :Number of address bits


11

01

d: global depth = 1l : local depth = 1

d

1

l

index storage

0

b

0

d l

Page “0” Overflows


Page “0” Overflows (cont.)

1 more key bit for addressing and 1 extra page => index doubles !!

Split contents of previous page between 2 pages according to next bit of key

Global depth d: number of index bits => 2d index size

Local depth l : number of bits for record addressing


Page “0” Overflows (again)

00011011

2 2

2

1

contains recordswith same 1st bit of key

dl

contain recordswith same 2 bits of key

d



3000001010011100101110111

1

2

3

3

d

1 more key bitfor addressing

2d-l: number of pointers to page



no need to double index page 100 splits into two (1 new page) local depth l is increased by 1

000001010011100101110111

23

2

3

3

2+1


If l < d, split overflowed page (1 extra page)

If l = d => index is doubled, page is split d is increased by 1=>1 more bit for

addressing update pointers (either way):

a) if d prefix bits are used for addressing

d=d+1;for (i=2d-1, i>=0,i--) index[i]=index[i/2];b) if d suffix bits are used

for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1;d=d+1

Insertion Algorithm


Deletion Algorithm

Find and delete record Check sibling page If less than b records in both pages

merge pages and free empty page decrease local depth l by 1 (records in

merged page have 1 less common bit) if l < d everywhere => reduce index

(half size) update pointers


000

001

010

011

100

101

110

111

23

2

3

3

2

delete withmerging

000

001

010

011

100

101

110

111

23

2

2

2

l < d00011011

22

2

2

2


A page splits and there are more than b keys with same next bit take one more bit for addressing (increase l) if d=l the index doubles again !!

Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value) if distribution is known, transform it to uniform

Dynamic hashing performs better for non-uniform distributions (affected locally)

Observations


For n: records and page size b expected size of index (Flajolet)

1 disk access/retrieval when index in main memory

2 disk accesses when index is on disk overflows increase number of disk

accesses

)b1

(1)b1

(1n

b3.92

nblog2

l

Performance


Storage Utilization with Page Splitting

In general 50% < u < 100% On the average u ~ ln2 ~ 69% (no

overflows)

bb

before splittingafter splitting

50%2bb

u After splitting


Storage Utilization with Overflows

Achieves higher u and avoids page doubling (d=l)

higher u is achieved for small overflow pages u=2b/3b~66% after splitting small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~

75% double index only if the overflow overflows!!

bb


Linear Hashing (Litwin 1980)

Dynamic scheme without index Indices refer to page addresses Overflows are allowed The file grows one page at a time The page which splits is not always

the one which overflowed The pages split in a predetermined

order


Linear Hashing (cont.)

Initially n empty pages p points to the page that splits

Overflows are allowed

bp

bp


File Growing

A page splits whenever the “splitting criterion” is satisfied a page is added at the end of the file pointer p points to the next page split contents of old page between old

and new page based on key values

p


b=bpage=4, boverflow=1

initially n=5 pages hash function h0=k mod 5

splitting criterion u > A% alternatively split when overflow overflows,

etc.

4319

613303

40227

737712

16711

12532090

435

p

215 522 438 new element

0 1 2 3 4

split80%2217

u


Page 5 is added at end of file The contents of page 0 are split

between pages 0 and 5 based on hash function h1 = key mod 10

p points to the next page

p

4319

613303438

40227

737712

16711

32090

522

125435215

0 1 2 3 4 5

1h 1h0h 0h 0h 0h

%8025

18u


Initially h0=key mod n

As new pages are added at end of file, h0 alone becomes insufficient

The file will eventually double its size In that case use h1=key mod 2n

In the meantime use h0 for pages not yet split

use h1 for pages that have already split

Split contents of page pointed to by p based on h1

Hash Functions


When the file has doubled its size, h0 is no longer needed set h0=h1 and continue (e.g., h0=k mod

10)

The file will eventually double its size again

Deletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%)

Hash Functions (cont.)


Initially n pages and 0 <= h0(k) <= n

Series of hash functions

Selection of hash function:if hi(k) >= p then use hi(k)

else use hi+1(k)

i

i

i1i n2(k)h

(k)h(k)h

Hash Functions


Linear Hashing with Partial Expansions (Larson 1980)

Problem with Linear Hashing: pages to the right of p delay to split large chains of overflows on rightmost pages

Solution: do not wait that much to split a page k partial expansions: take pages in groups of

k all k pages of a group split together the file grows at lower rates


Two Partial Expansions

Initially 2n pages, n groups, 2 pages/group groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1)

Pages in same group spit together => some records go to a new page at end of file (position: 2n)

2 pointers to pages of

the same group0 1 n 2n


1st Expansion

After n splits, all pages are split the file has 3n pages (1.5 time larger) the file grows at lower rate

after 1st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n

0 n 2n 3n

0 n 2n 3n


2nd Expansion

After n splits the file has size 4n repeat the same process having

initially 4n pages in 2n groups

2 pointers to pages ofthe same group

0 1 2n 4n


1

1,1

1,2

1,3

1,4

1,5

1,6

1 1,2 1,4 1,6 1,8 2relative file size

dis

k a

ccess/r

etr

ieval Linear

Hashing

LinearHashing2 partial

expansions

3.563.534.04deletion

3.313.213.57insertion

1.091.121.17retrieval

Linear Hashing3 part. Exp.

Linear Hashing2 part. Exp.Linear Hashing

b = 5b’ = 5u = 0.85


Dynamic Hashing Schemes

Very good performance on membership, insert, delete operations

Suitable for both main memory and disk b=1-3 records for main memory b=1-4 Kbytes for disk

Critical parameter: space utilization u large u => more overflows, bad performance small u => less overflows, better performance

Suitable for direct access queries (random accesses) but not for range queries

E.G.M. PetrakisHashing1 Hashing on the Disk Keys are stored in “disk pages” (“buckets”) ...

Documents

Transcript of E.G.M. PetrakisHashing1 Hashing on the Disk Keys are stored in “disk pages” (“buckets”) ...