Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M....

45
E.G.M. Petrakis Hashing 1 Hashing on the Disk Keys are stored in “disk pages(“buckets”) several records fit within one page Retrieval: find address of page bring page into main memory searching within the page comes for free

Transcript of Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M....

Page 1: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 1

Hashing on the Disk

Keys are stored in “disk pages”(“buckets”)

several records fit within one pageRetrieval:

find address of pagebring page into main memorysearching within the page comes for free

Page 2: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 2

Σkey

spacehash

function....

data pages0

12

m-1b

page size b: maximum number of records in pagespace utilization u: measure of the use of space

bpages#recordsstored#u⋅

=

Page 3: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 3

Collisions

Keys that hash to the same address are stored within the same pageIf the page is full:

i.

page splits: allocate a new page and split page content between the old and the new page or

ii. overflows: list of overflow pages

x x x x x xoverflow

Page 4: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 4

Access Time

Goal: find key in one disk access Access time ~ number of accessesLarge u: good space utilization but many overflows or splits => more disk accesses Non-uniform key distribution => many keys map to the same addresses =>overflows or splits => more accesses

Page 5: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 5

Categories of Methods

Static: require file reorganizationopen addressing, separate chaining

Dynamic: dynamic file growth, adapt to file size

dynamic hashing, extendible hashing, linear hashing, spiral storage…

Page 6: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 6

Dynamic Hashing Schemes

File size adapts to data size without total reorganizationTypically 1-3 disk accesses to access a keyAccess time and u are a typical trade-offu between 50-100% (typically 69%)Complicated implementation

Page 7: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 7

Two disk accesses:one to access the index, one to access the datawith index in main memory => one disk access Problem: the index may become too large

Dynamic hashing (Larson 1978)Extendible hashing (Fagin et.al. 1979)

index data pages

Schemes With Index

Page 8: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 8

Ideally, less space and less disk accesses (at least one)

Linear Hashing (Litwin 1980)Linear Hashing with Partial Expansions (Larson 1980)Spiral Storage (Martin 1979)

address space

data space

Schemes Without Index

Page 9: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 9

Support for shrinking or growing file shrinking or growing address space, the hash function adapts to these changeshash functions using first (last) bits of key = bn-1bn-2….bi b i-1…b2b1b0hi(key)=bi-1…b2b1b0 supports 2i addresseshi: one more bit than hi-1 to address larger files

⎩⎨⎧

+=

−i

1i

1ii 2(key)h

(key)h(key)h

Hash Functions

Page 10: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 10

Dynamic Hashing (Larson

1978)Two level index

primary h1(key): accesses a hash tablesecondary h2(key): accesses a binary tree Index: binary tree

h1

(k)1st

levelh2

(k)2nd

leveldata pages

b

2

3

4

1

Page 11: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 11

Index

Fixed (static): h1(key) = key mod mDynamic behavior on secondary index

h2(key) uses i bits of keythe bit sequence of h2=bi-1…b2b1b0denotes which path on the binary tree index to follow in order to access the data pagescan h2 from left to right (bit 1: follow right path, bit 0: follow left path)

Page 12: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 12

index

h1

(k)1st

levelh2

(k)2nd

leveldata pages

b

2

3

4

1012345

0

11

0h1

=1, h2

=“0”h1

=1, h2

=“01”h1

=1, h2

=“11”h1

=5, h2

= any

h1

(key) = key mod 6h2

(key) = “10…

“<= depth of binary tree = 2

0

Page 13: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 13

Initially fixed size primary index and no data

insert record in new page under h1addressif page is full, allocate one extra pagesplit keys between old and new pageuse one extra bit in h2 for addressing

h1

=1, h2

=0h1

=1, h2

=10123

0

1

Insertions

0123

0123

bh1

=1,h2

=any

Page 14: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 14

0

321

0

321

0

321

0

321

1

2

h1

=0,

h2

=any

h1

=3,

h2

=any

1

2

3

10

h1

=0,

h2

=0

h1

=0,

h2

=1

h1

=3,

h2

=any

01

01

13425

h1

=3,

h2

=0

h1

=0,

h2

=0

h1

=3,

h2

=1

h1

=0,

h2

=01h1

=0,

h2

=11

b

index storage

01

Page 15: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 15

Deletions

Find record to be deleted using h1, h2Delete record Check “sibling” page:

less than b records in both pages ?if yes merge the two pages delete one empty pageshrink binary tree index by one level and reduce h2 by one bit

Page 16: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 16

merging0

321

1

3

2

4

0

321

1

3

2

4

delete

Page 17: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 17

Extendible Hashing (Fagin et.al. 1979)

Dynamic hashing without indexPrimary hashing is omittedOnly secondary hashing with all binary trees at the same levelThe index shrinks and grows according to file size Data pages attached to the index

Page 18: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 18

dynamichashing withall binary treesat same level

100

1234

1

0

0

100

01

10

11

01234

2

2

1

dynamichashing

number ofaddress bits

Page 19: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 19

Initially 1 index and 1 data page 0 address bitsinsert records in data page

index storage

0

b

0

Insertions

global depth d:size of index 2d

local depth l :Number of address bits

Page 20: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 20

11

01

d: global depth = 1l : local depth = 1

d

1

l

index storage

0

b

0

d l

Page “0”

Overflows

Page 21: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 21

Page “0”

Overflows (cont.)

1 more key bit for addressing and 1 extra page => index doubles !!Split contents of previous page between 2 pages according to next bit of keyGlobal depth d: number of index bits => 2d

index sizeLocal depth l : number of bits for record addressing

Page 22: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 22

Page “0”

Overflows (again)

00011011

2 2

2

1

contains recordswith same 1st

bit of key

dl ≤

contain recordswith same 2 bits of key

d

Page 23: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 23

Page “01”

Overflows3

000001010011100101110111

1

2

3

3

d

1 more key bitfor addressing

2d-l: number of pointers to page

Page 24: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 24

Page “100”

Overflows

no need to double indexpage 100 splits into two (1 new page)local depth l is increased by 1

000001010011100101110111

23

2

3

3

2+1

Page 25: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 25

If l < d, split overflowed page (1 extra page)If l = d double index, split page and

d is increased by 1=>1 more bit for addressingupdate pointers (either way):a)

if d prefix bits

are used for addressingd=d+1;for (i=2d-1, i>=0,i--) index[i]=index[i/2];b)

if d suffix bits

are usedfor (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1;d=d+1

Insertion Algorithm

Page 26: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 26

Deletion Algorithm

Find and delete recordCheck sibling page If less than b records in both pages

merge pages and free empty pagedecrease local depth l by 1 (records in merged page have 1 less common bit)if l < d everywhere => reduce index (half size)update pointers

Page 27: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 27

000001010011100101110111

23

2

3

3

2

delete withmerging

000001010011100101110111

23

2

2

2

l < d 00011011

22

2

2

2

Page 28: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 28

A page splits and there are more than bkeys with same next bit

take one more bit for addressing (increase l) if d=l the index doubles again !!

Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value)

if distribution is known, transform it to uniformDynamic hashing performs better for non-uniform distributions (affected locally)

Observations

Page 29: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 29

For n: records and page size bexpected size of index (Flajolet)

1 disk access/retrieval when index in main memory2 disk accesses when index is on diskoverflows increase number of disk accesses

)b1(1)

b1(1

nb

3.92nblog2

l ++≈

Performance

Page 30: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 30

Storage Utilization with Page Splitting

In general 50% < u < 100%On the average u ~ ln2 ~ 69% (no overflows)

bb

before splittingafter splitting

50%2bbu == After splitting

Page 31: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 31

Storage Utilization with Overflows

Achieves higher u and avoids page doubling (d=l)

higher u is achieved for small overflow pagesu=2b/3b~66% after splittingsmall overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75%

double index only if the overflow overflows!!

bb

Page 32: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 32

Linear Hashing (Litwin

1980)

Dynamic scheme without indexIndices refer to page addresses Overflows are allowedThe file grows one page at a timeThe page which splits is not always the one which overflowedThe pages split in a predetermined order

Page 33: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 33

Linear Hashing (cont.)

Initially n empty pages p points to the page that splits

Overflows are allowed

bp

bp

Page 34: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 34

File Growing

A page splits whenever the “splitting criterion” is satisfied

a page is added at the end of the filepointer p points to the next pagesplit contents of old page between old and new page based on key values

p

Page 35: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 35

b=bpage=4, boverflow=1initially n=5 pageshash function h0=k mod 5splitting criterion u > A%alternatively split when overflow overflows, etc.

4319

613303

40227

737712

16711

125320

90435

p

215 522 438 new element

0 1 2 3 4

split80%2217u →>=

Page 36: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 36

Page 5 is added at end of fileThe contents of page 0 are split between pages 0 and 5 based on hash function h1 = key mod 10p points to the next page

p

4319

613303438

40227

737712

16711

32090

522

125435215

0 1 2 3 4 5

1h 1h0h 0h 0h 0h

%802518

<=u

Page 37: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 37

Initially h0=key mod nAs new pages are added at end of file, h0alone becomes insufficient The file will eventually double its size In that case use h1=key mod 2nIn the meantime

use h0 for pages not yet splituse h1 for pages that have already split

Split contents of page pointed to by pbased on h1

Hash Functions

Page 38: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 38

When the file has doubled its size, h0is no longer needed

set h0=h1 and continue (e.g., h0=k mod 10)The file will eventually double its size againDeletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%)

Hash Functions (cont.)

Page 39: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 39

Initially n pages and 0 <= h0(k) <= nSeries of hash functions

Selection of hash function:if hi

(k) >= p then use hi

(k) else use hi+1

(k)

⎩⎨⎧

+=+ i

i

i1i n2(k)h

(k)h(k)h

Hash Functions

Page 40: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 40

Linear Hashing with Partial Expansions (Larson 1980)

Problem with Linear Hashing: pages to the right of p delay to split

large chains of overflows on rightmost pagesSolution: do not wait that much to split a page

k partial expansions: take pages in groups of kall k pages of a group split togetherthe file grows at lower rates

Page 41: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 41

Two Partial Expansions

Initially 2n pages, n groups, 2 pages/groupgroups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1)

Pages in same group spit together => some records go to a new page at end of file (position: 2n)

2 pointers to pages of

the same group0 1 n 2n

Page 42: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 42

1st

Expansion

After n splits, all pages are splitthe file has 3n pages (1.5 time larger) the file grows at lower rate

after 1st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n

0 n

2n

3n

0 n 2n 3n

Page 43: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 43

2nd

Expansion

After n splits the file has size 4nrepeat the same process having initially 4n pages in 2n groups

2 pointers to pages ofthe same group

0 1 2n 4n

Page 44: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 44

1

1,1

1,2

1,3

1,4

1,5

1,6

1 1,2 1,4 1,6 1,8 2relative file size

disk

acc

ess/

retr

ieva

l LinearHashing

LinearHashing2 partialexpansions

3.563.534.04deletion3.313.213.57insertion1.091.121.17retrieval

Linear Hashing3 part. Exp.

Linear

Hashing

2 part. Exp.

Linear Hashing

b = 5b’

= 5

u = 0.85

Page 45: Hashing on the Disk - INTELLIGENCE: Lab Descriptionpetrakis/courses/datastructures... · E.G.M. Petrakis Hashing 1 Hashing on the Disk ... Complicated implementation. ... hash functions

E.G.M. Petrakis Hashing 45

Dynamic Hashing SchemesVery good performance on membership, insert, delete operationsSuitable for both main memory and disk

b=1-3 records for main memoryb=1-4 Kbytes for disk

Critical parameter: space utilization ularge u => more overflows, bad performancesmall u => less overflows, better performance

Suitable for direct access queries (random accesses) but not for range queries