P2p 2006 1 CAN, CHORD, BATON (extensions). p2p 2006 2 Additional issues: Fault tolerance, load...

42
p2p 2006 1 CAN, CHORD, BATON (extensions)

Transcript of P2p 2006 1 CAN, CHORD, BATON (extensions). p2p 2006 2 Additional issues: Fault tolerance, load...

p2p 2006 1

CAN, CHORD, BATON (extensions)

p2p 2006 2

Additional issues:

Fault tolerance, load balancing, network awareness, concurrency

Replicate & cache

Performance evaluation

CHORD

ΒΑΤΟΝ

Structured P2P

p2p 2006 3

Path-length

Neighbor state

Total path

latency

Per-hop latency

volumeMultiple routes

replicas

Dimensions (d) O(dn1/d) O(d) - - -

Realities (r) O(r) - O(r) O(r)

MAXPEERS (p) O(1/p) O(p) O(p)* O(p)*

Hash functions

(k)- - - Ο(k) - O(k)

RTT-weighted routing

- - - - -

Uniform partitioning heuristic

Reduced variance

Reduces variance

- -Reduced variance

- -

Summary: Design parameters and performance (CAN)

* Only on replicated data

p2p 2006 4

Additional issues discussed for CHORD:

Fault tolerance

Concurrency

Replication

(Data) Load balancing

CHORD

p2p 2006 5

CHORD Stabilization

Need to keep the finger tables consistent in the case of concurrent inserts

Keep the successors of the nodes up-to-date

Then, these successors can be used to verify and correct the finger tables

p2p 2006 6

CHORD Stabilization

What about similar problems in CAN?

If we forget performance, in general, it suffices to keep the network connected

Thus:

Connect to the network by finding your successor

Periodically run stabilize (to fix successors) and less often run fix table

p2p 2006 7

CHORD Stabilization

A lookup before stabilization

All finger tables “reasonably” current, an entry is found in O(logN) steps

Successors are correct but finger tables are inaccurate, correct but maybe slow lookups

Incorrect successors pointers or keys have not moved yet, lookup may fail and needs to be retried

p2p 2006 8

CHORD Stabilization

Works

Concurrent joins

Lost and reordered messages

May not work (?)

When system is split into multiple disjoint circles

A single cycle that loops around the identifier space

Caused by failures, network partitioning, etc – in general, left unclear

p2p 2006 9

CHORD Stabilization

n.join(n’)

predecessor = nil;

successor = n’.find_succesor(n);

Upon joining, node n calls a known node n’ and asks n’ to locate its successor

Note, the rest of the network does not know n yet – to achieve this run stabilize

Join

p2p 2006 10

n.stabilize()

x = successor.predecessor;

if(x (n, successor))

successor = x;

successor.notify(n);

CHORD Stabilization

Every node n runs stabilize periodically

n asks its successor for its predecessor, say p

n checks whether p should be its successor instead

(this is, if p has joined the network in between) – this is how nodes find out of new nodes

Stabilize

p2p 2006 11

n.notify(n’)

if (predecessor is nil

or n’ (predecessor, n))

predecessor = n’;

CHORD Stabilization

successor(n) is also notified and checks whether it should make n its predecessor

Notify

p2p 2006 12

np

ns

np.successor = ns

ns.predecessor = np

nn.predecessor = nil

n.successor = ns

CHORD Stabilization

Node n joins

p2p 2006 13

Example – n.stabilize

np

ns

np.successor = ns

ns.predecessor = np

nn.predecessor = nil

n.successor = ns

ns.predecessor = n

ns.notify(n)

n runs stabilize

p2p 2006 14

Example - np.stabilize

np

ns

np.successor = ns

ns.predecessor = n

nn.predecessor = nil

n.successor = ns

np.stabilize()

np.successor = n

np runs stabilize

p2p 2006 15

Example - np.stabilize

np

ns

np.successor = n

ns.predecessor = n

nn.predecessor = nil

n.successor = ns

n.notify(np) n.predecessor = np

n.successor = ns

p2p 2006 16

Chord Stabilization: Fix fingers

n.fix_fingers()

i = random_index > 1 into finger[];

finger[i].node = find_successor(finger[i].start);

Finger tables are not updated immediately

Thus, lookups may be slow by find_predecessor and find_successor work

Periodically run fix_fingers:

pick a random entry in the finger table and update it

p2p 2006 17

Stabilization

Finger tables

Must have a finger at each interval, then the distance halving argument still holds

Lookup finds the predecessor of the target t, say p, then p finds the target (it is its successor)

Problem only when a new node enters between p and t

Then we may need to scan these nodes linearly, ok if O(logN) such nodes

p2p 2006 18

Stabilization

Eventually succeed

Invariant: Once a node can reach a node r via successor pointers, it always can

Termination argument:

Assume two nodes (n1, n2) that both think that they have the same successor s

Both attempt to notify s

s will eventually choose the one that is closer of the two as its predecessor, say n1

The farthest of the two n2 will then learn by s of a better successor than s (n1)

Thus, there is progress to a better successor each time

p2p 2006 19

When a node n fails

Nodes that have n as their successors in the finger table must be informed and find the successor of n to replace it in their finger table

Lookups in progress must continue

Maintain correct successor pointers

Failures

p2p 2006 20

Replication

Each node maintain a successor list of its r nearest successors

Upon failure, use the next successor in the list

Modify stabilize to fix the list

Failures

p2p 2006 21

Other nodes may attempt to send requests through the failed node

Use alternate nodes found in the routing table of preceding nodes or in the successor list

Failures

p2p 2006 22

Example r=3

01

3

5

7

8

10

11

12

15

2

4

6

9

14

13

5

[5, 6, 9]

[6, 9, 12]

[9, 12, 14]

[12, 14, 15]

[14, 15, 3]

[15, 3, 5]

[3, 5, 6]

12

9

6

Failures

p2p 2006 23

A lookup fails, if all r nodes in the successor list fail. All fail with probability 2-r (independent failures) = 1/N

Theorem: If we use a successor list with r = Ο(logN) in an initially stable network and then every node fails with probability 1/2, then

with high probability, find_successor returns the closest living successor

the expected time to execute find_successor in the failed network is O(logN)

Failures

p2p 2006 24

Store replicas of a key at the k nodes succeeding the key

Successor list helps to keep the number of replicas per item known

Other approach: store a copy per region

Replication

p2p 2006 25

K keys, N nodes ideal allocation, each node gets K/N

104 nodes, 105 – 106 keys, step of 105

Mean 1st and 99th percentiles (το ποσοστό της κατανομής που είναι μικρότερο ή ίσο) of the number of keys per node

Large variation which increases with the number of keys

Load balance

p2p 2006 26

K keys, N nodes ideal allocation, each node K/N

104 nodes, 5 x 105 keys

Load balance

Probability Density Function

p2p 2006 27

Node identifiers do not uniformly cover the entire identifier space

Assume N keys and N nodes

If we divide the space in N equal-sized bins, then we would like to see one key per node

The probability that a particular bin is empty is (1 – 1/N)N

For large N, this approaches e-1 = 0.368

Load balance

p2p 2006 28

Introduce virtual nodes

Map multiple virtual nodes with unrelated identifiers to each real node

Each key is hashed into a virtual node which is next mapped to an actual node

Increase number of virtual nodes from N -> N logN

Worst-case path length O(N logN)

Each actual node needs r times as much space for the finger tables of its virtual nodes.

If r = log N, then log2N entries, for N = 106 then 400 entries

The routing messages per node also increase

Load balance: Virtual Nodes

p2p 2006 29

Load balance

p2p 2006 30

CAN

One virtual node (zone) -> Many physical nodes

reduce virtual nodes -> decrease path length

physical network awareness

Many virtual nodes -> One physical node

increase virtual nodes -> increase path length

data load balance

Load balance

p2p 2006 31

Performance Evaluation

CANSimulationKnobs-on full vs Bare bonesUse network topology generators

CHORD

Simulation (runs in iterative style, note how does this affect network proximity)

Also a small size distributed experiment that reports latency measurements

10 sites in the USA, simulate more than 10 nodes by running more than one copies of CHORD at each of the sites

p2p 2006 32

Metrics for System Performance

Path length: overlay hops to route between two nodes in the CAN space

Latency: End-to-end latency between two nodes Per-hop latency: end-to-end latency for the whole

path length

Neighbor-state

Volume per node (indicative of data and query load)

Load per node

looku

p

p2p 2006 33

Metrics for System Performance

Routing fault tolerance

Data fault tolerance

Maintenance cost: cost of join/leave, replication etc

How? Either separately or as the overall network traffic

Churn (dynamic behavior)

p2p 2006 34

Period of stabilize

Period of fix_finger

Maximum number of virtual nodes m

Number of virtual nodes per physical node (O(logN))

Size of the successor list (O(logN))

Hash function

CHORD Magic Constants

p2p 2006 35

Path Length

212 nodesActually is ½ log N

Follow half the log N bits

p2p 2006 36

Simultaneous Node Failures

Randomly select a fraction p of nodes that fail

The network stabilizes

Lookup failure rate is p (could be worst if say network partition)

Lookups during stabilization

CHORD more

p2p 2006 37

Detect malicious participants

Or take a false position in the ring to “steal” data

Physical network awareness

CHORD future

p2p 2006 38

Additional issues:

Fault tolerance

Other ways to restructure/balance the tree

(Workload) load balance

BATON

p2p 2006 39

Failures

Upon node departure or failure, the parent can reconstruct the entries

Assume node x fails, any detected failures of x are reported to its parent y

y regenerates the routing tables of x – Theorem 2

Messages are routed Sideways (redundancy similar to CHORD) Up-down (can find its parent through its neighbors)

There is routing redundancy

p2p 2006 40

AVL-like Restructuring

The network may be restructured using AVL-like rotations

No data movement is needed, but some routing tables need to be updated

p2p 2006 41

Load Balance

Each node keeps statistics about the number of queries or messages it receives

Adjust the data range to equalize the workloadbetween adjacent nodesfor leaves, find another (less loaded leaf) say v; have it transfer its load to its parent; and make v join again as

the node’s child

Performance results (simulation) are reported – not surprises

p2p 2006 42

Replication - Beehive

Proactive – model-driven replication

Passive (demand-driven) replication such as caching objects along a lookup path

Hint for BATON

Beehive

The length of the average query path reduced by one when an object is proactively replicated at all nodes logically preceding that node on all query paths

BATON

Range queries

Many paths to data