P2p 2006 1 CAN, CHORD, BATON (extensions). p2p 2006 2 Additional issues: Fault tolerance, load...
Transcript of P2p 2006 1 CAN, CHORD, BATON (extensions). p2p 2006 2 Additional issues: Fault tolerance, load...
p2p 2006 2
Additional issues:
Fault tolerance, load balancing, network awareness, concurrency
Replicate & cache
Performance evaluation
CHORD
ΒΑΤΟΝ
Structured P2P
p2p 2006 3
Path-length
Neighbor state
Total path
latency
Per-hop latency
volumeMultiple routes
replicas
Dimensions (d) O(dn1/d) O(d) - - -
Realities (r) O(r) - O(r) O(r)
MAXPEERS (p) O(1/p) O(p) O(p)* O(p)*
Hash functions
(k)- - - Ο(k) - O(k)
RTT-weighted routing
- - - - -
Uniform partitioning heuristic
Reduced variance
Reduces variance
- -Reduced variance
- -
Summary: Design parameters and performance (CAN)
* Only on replicated data
p2p 2006 4
Additional issues discussed for CHORD:
Fault tolerance
Concurrency
Replication
(Data) Load balancing
CHORD
p2p 2006 5
CHORD Stabilization
Need to keep the finger tables consistent in the case of concurrent inserts
Keep the successors of the nodes up-to-date
Then, these successors can be used to verify and correct the finger tables
p2p 2006 6
CHORD Stabilization
What about similar problems in CAN?
If we forget performance, in general, it suffices to keep the network connected
Thus:
Connect to the network by finding your successor
Periodically run stabilize (to fix successors) and less often run fix table
p2p 2006 7
CHORD Stabilization
A lookup before stabilization
All finger tables “reasonably” current, an entry is found in O(logN) steps
Successors are correct but finger tables are inaccurate, correct but maybe slow lookups
Incorrect successors pointers or keys have not moved yet, lookup may fail and needs to be retried
p2p 2006 8
CHORD Stabilization
Works
Concurrent joins
Lost and reordered messages
May not work (?)
When system is split into multiple disjoint circles
A single cycle that loops around the identifier space
Caused by failures, network partitioning, etc – in general, left unclear
p2p 2006 9
CHORD Stabilization
n.join(n’)
predecessor = nil;
successor = n’.find_succesor(n);
Upon joining, node n calls a known node n’ and asks n’ to locate its successor
Note, the rest of the network does not know n yet – to achieve this run stabilize
Join
p2p 2006 10
n.stabilize()
x = successor.predecessor;
if(x (n, successor))
successor = x;
successor.notify(n);
CHORD Stabilization
Every node n runs stabilize periodically
n asks its successor for its predecessor, say p
n checks whether p should be its successor instead
(this is, if p has joined the network in between) – this is how nodes find out of new nodes
Stabilize
p2p 2006 11
n.notify(n’)
if (predecessor is nil
or n’ (predecessor, n))
predecessor = n’;
CHORD Stabilization
successor(n) is also notified and checks whether it should make n its predecessor
Notify
p2p 2006 12
np
ns
np.successor = ns
ns.predecessor = np
nn.predecessor = nil
n.successor = ns
CHORD Stabilization
Node n joins
p2p 2006 13
Example – n.stabilize
np
ns
np.successor = ns
ns.predecessor = np
nn.predecessor = nil
n.successor = ns
ns.predecessor = n
ns.notify(n)
n runs stabilize
p2p 2006 14
Example - np.stabilize
np
ns
np.successor = ns
ns.predecessor = n
nn.predecessor = nil
n.successor = ns
np.stabilize()
np.successor = n
np runs stabilize
p2p 2006 15
Example - np.stabilize
np
ns
np.successor = n
ns.predecessor = n
nn.predecessor = nil
n.successor = ns
n.notify(np) n.predecessor = np
n.successor = ns
p2p 2006 16
Chord Stabilization: Fix fingers
n.fix_fingers()
i = random_index > 1 into finger[];
finger[i].node = find_successor(finger[i].start);
Finger tables are not updated immediately
Thus, lookups may be slow by find_predecessor and find_successor work
Periodically run fix_fingers:
pick a random entry in the finger table and update it
p2p 2006 17
Stabilization
Finger tables
Must have a finger at each interval, then the distance halving argument still holds
Lookup finds the predecessor of the target t, say p, then p finds the target (it is its successor)
Problem only when a new node enters between p and t
Then we may need to scan these nodes linearly, ok if O(logN) such nodes
p2p 2006 18
Stabilization
Eventually succeed
Invariant: Once a node can reach a node r via successor pointers, it always can
Termination argument:
Assume two nodes (n1, n2) that both think that they have the same successor s
Both attempt to notify s
s will eventually choose the one that is closer of the two as its predecessor, say n1
The farthest of the two n2 will then learn by s of a better successor than s (n1)
Thus, there is progress to a better successor each time
p2p 2006 19
When a node n fails
Nodes that have n as their successors in the finger table must be informed and find the successor of n to replace it in their finger table
Lookups in progress must continue
Maintain correct successor pointers
Failures
p2p 2006 20
Replication
Each node maintain a successor list of its r nearest successors
Upon failure, use the next successor in the list
Modify stabilize to fix the list
Failures
p2p 2006 21
Other nodes may attempt to send requests through the failed node
Use alternate nodes found in the routing table of preceding nodes or in the successor list
Failures
p2p 2006 22
Example r=3
01
3
5
7
8
10
11
12
15
2
4
6
9
14
13
5
[5, 6, 9]
[6, 9, 12]
[9, 12, 14]
[12, 14, 15]
[14, 15, 3]
[15, 3, 5]
[3, 5, 6]
12
9
6
Failures
p2p 2006 23
A lookup fails, if all r nodes in the successor list fail. All fail with probability 2-r (independent failures) = 1/N
Theorem: If we use a successor list with r = Ο(logN) in an initially stable network and then every node fails with probability 1/2, then
with high probability, find_successor returns the closest living successor
the expected time to execute find_successor in the failed network is O(logN)
Failures
p2p 2006 24
Store replicas of a key at the k nodes succeeding the key
Successor list helps to keep the number of replicas per item known
Other approach: store a copy per region
Replication
p2p 2006 25
K keys, N nodes ideal allocation, each node gets K/N
104 nodes, 105 – 106 keys, step of 105
Mean 1st and 99th percentiles (το ποσοστό της κατανομής που είναι μικρότερο ή ίσο) of the number of keys per node
Large variation which increases with the number of keys
Load balance
p2p 2006 26
K keys, N nodes ideal allocation, each node K/N
104 nodes, 5 x 105 keys
Load balance
Probability Density Function
p2p 2006 27
Node identifiers do not uniformly cover the entire identifier space
Assume N keys and N nodes
If we divide the space in N equal-sized bins, then we would like to see one key per node
The probability that a particular bin is empty is (1 – 1/N)N
For large N, this approaches e-1 = 0.368
Load balance
p2p 2006 28
Introduce virtual nodes
Map multiple virtual nodes with unrelated identifiers to each real node
Each key is hashed into a virtual node which is next mapped to an actual node
Increase number of virtual nodes from N -> N logN
Worst-case path length O(N logN)
Each actual node needs r times as much space for the finger tables of its virtual nodes.
If r = log N, then log2N entries, for N = 106 then 400 entries
The routing messages per node also increase
Load balance: Virtual Nodes
p2p 2006 30
CAN
One virtual node (zone) -> Many physical nodes
reduce virtual nodes -> decrease path length
physical network awareness
Many virtual nodes -> One physical node
increase virtual nodes -> increase path length
data load balance
Load balance
p2p 2006 31
Performance Evaluation
CANSimulationKnobs-on full vs Bare bonesUse network topology generators
CHORD
Simulation (runs in iterative style, note how does this affect network proximity)
Also a small size distributed experiment that reports latency measurements
10 sites in the USA, simulate more than 10 nodes by running more than one copies of CHORD at each of the sites
p2p 2006 32
Metrics for System Performance
Path length: overlay hops to route between two nodes in the CAN space
Latency: End-to-end latency between two nodes Per-hop latency: end-to-end latency for the whole
path length
Neighbor-state
Volume per node (indicative of data and query load)
Load per node
looku
p
p2p 2006 33
Metrics for System Performance
Routing fault tolerance
Data fault tolerance
Maintenance cost: cost of join/leave, replication etc
How? Either separately or as the overall network traffic
Churn (dynamic behavior)
p2p 2006 34
Period of stabilize
Period of fix_finger
Maximum number of virtual nodes m
Number of virtual nodes per physical node (O(logN))
Size of the successor list (O(logN))
Hash function
CHORD Magic Constants
p2p 2006 36
Simultaneous Node Failures
Randomly select a fraction p of nodes that fail
The network stabilizes
Lookup failure rate is p (could be worst if say network partition)
Lookups during stabilization
CHORD more
p2p 2006 37
Detect malicious participants
Or take a false position in the ring to “steal” data
Physical network awareness
CHORD future
p2p 2006 38
Additional issues:
Fault tolerance
Other ways to restructure/balance the tree
(Workload) load balance
BATON
p2p 2006 39
Failures
Upon node departure or failure, the parent can reconstruct the entries
Assume node x fails, any detected failures of x are reported to its parent y
y regenerates the routing tables of x – Theorem 2
Messages are routed Sideways (redundancy similar to CHORD) Up-down (can find its parent through its neighbors)
There is routing redundancy
p2p 2006 40
AVL-like Restructuring
The network may be restructured using AVL-like rotations
No data movement is needed, but some routing tables need to be updated
p2p 2006 41
Load Balance
Each node keeps statistics about the number of queries or messages it receives
Adjust the data range to equalize the workloadbetween adjacent nodesfor leaves, find another (less loaded leaf) say v; have it transfer its load to its parent; and make v join again as
the node’s child
Performance results (simulation) are reported – not surprises
p2p 2006 42
Replication - Beehive
Proactive – model-driven replication
Passive (demand-driven) replication such as caching objects along a lookup path
Hint for BATON
Beehive
The length of the average query path reduced by one when an object is proactively replicated at all nodes logically preceding that node on all query paths
BATON
Range queries
Many paths to data