Partition-Tolerant Distributed Publish/Subscribe System

49

description

Introduction to the design choices behind partition-tolerant distributed pub-sub system

Transcript of Partition-Tolerant Distributed Publish/Subscribe System

Page 1: Partition-Tolerant Distributed Publish/Subscribe System
Page 2: Partition-Tolerant Distributed Publish/Subscribe System
Page 3: Partition-Tolerant Distributed Publish/Subscribe System
Page 4: Partition-Tolerant Distributed Publish/Subscribe System

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

Page 5: Partition-Tolerant Distributed Publish/Subscribe System

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

Page 6: Partition-Tolerant Distributed Publish/Subscribe System

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

1-neighborhood

Page 7: Partition-Tolerant Distributed Publish/Subscribe System

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

2-neighborhood

1-neighborhood

Page 8: Partition-Tolerant Distributed Publish/Subscribe System

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

2-neighborhood

1-neighborhood

Page 9: Partition-Tolerant Distributed Publish/Subscribe System

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

3-neighborhood

2-neighborhood

1-neighborhood

Page 10: Partition-Tolerant Distributed Publish/Subscribe System
Page 11: Partition-Tolerant Distributed Publish/Subscribe System
Page 12: Partition-Tolerant Distributed Publish/Subscribe System
Page 13: Partition-Tolerant Distributed Publish/Subscribe System

◦ An “island” :

ABCDEF SP D

sourcedestination

Page 14: Partition-Tolerant Distributed Publish/Subscribe System

◦ An “island” :

◦ A “barrier”:

◦ Partition identifier (PID) = (pd, i, pnodes)

ABCDEF SP DEF

ABCDEF SP D

sourcedestination

destination source

Page 15: Partition-Tolerant Distributed Publish/Subscribe System

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Page 16: Partition-Tolerant Distributed Publish/Subscribe System

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

s

Page 17: Partition-Tolerant Distributed Publish/Subscribe System

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

ssssss

Page 18: Partition-Tolerant Distributed Publish/Subscribe System

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

Confirmations

ssssss

☑conf

Page 19: Partition-Tolerant Distributed Publish/Subscribe System

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

Confirmations

ssssss

☑conf

☑conf

☑conf

☑conf

☑conf

☑conf

Page 20: Partition-Tolerant Distributed Publish/Subscribe System

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

Confirmations

ssssss

☑conf

☑conf

☑conf

☑conf

☑conf

☑conf

Page 21: Partition-Tolerant Distributed Publish/Subscribe System

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

Page 22: Partition-Tolerant Distributed Publish/Subscribe System

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

sss

Page 23: Partition-Tolerant Distributed Publish/Subscribe System

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

s

☑conf

ss

conf

Page 24: Partition-Tolerant Distributed Publish/Subscribe System

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

s

☑conf

ss

conf

☑conf*

* Tag conf with pid

Page 25: Partition-Tolerant Distributed Publish/Subscribe System

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

s

☑conf

ss

conf

☑conf*

☑conf*

☑* pid tag is alsostored alongwith s* Tag conf with pid

Page 26: Partition-Tolerant Distributed Publish/Subscribe System

Forwarding compromises of five steps:◦ Queuing

◦ Barrier checking

◦ Matching

◦ Routing

◦ cleanup

Page 27: Partition-Tolerant Distributed Publish/Subscribe System

Forwarding only uses subscriptions accepted brokers. Steps in forwarding of publication p:

◦ Identify broker of accepted subscriptions that match p◦ Determine active connections towards matching subscriptions’

brokers◦ Send p on those active connections and wait for confirmations◦ If there are local matching subscribers, deliver to them◦ If no downstream matching subscriber exists, issue confirmation

towards P◦ Once confirmations arrive, discard p and send a conf towards P

Publications

ABCDEP S

Subscriptions

p

☑ ☑ ☑ ☑ ☑ ☑

CE

p p p p p

Deliver to localsubscribers

confconfconfconfconfconf

p

Page 28: Partition-Tolerant Distributed Publish/Subscribe System

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptions

Page 29: Partition-Tolerant Distributed Publish/Subscribe System

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptions

☑ ☑ ☑ ☑ ☑*

Page 30: Partition-Tolerant Distributed Publish/Subscribe System

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p

Page 31: Partition-Tolerant Distributed Publish/Subscribe System

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p p

p

Page 32: Partition-Tolerant Distributed Publish/Subscribe System

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

conf

conf

conf

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p p

conf

p

Page 33: Partition-Tolerant Distributed Publish/Subscribe System

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

conf

conf

conf

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p p

Depending on when this link has been establishedeither recovery or subscription propagation ensure

C accepts s prior to receiving p

conf

p

Page 34: Partition-Tolerant Distributed Publish/Subscribe System

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

Page 35: Partition-Tolerant Distributed Publish/Subscribe System

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

ABCDEX R

New session

Page 36: Partition-Tolerant Distributed Publish/Subscribe System

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New sessionsi sisisi

Page 37: Partition-Tolerant Distributed Publish/Subscribe System

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi

si

csicsi

csicsicsi

sisisiAck messages

Page 38: Partition-Tolerant Distributed Publish/Subscribe System

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi ☑*

si

csicsi

csicsicsi

sisisiAck messages

Page 39: Partition-Tolerant Distributed Publish/Subscribe System

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi ☑*

si

csicsi

csicsicsi

sisisiAck messages

Page 40: Partition-Tolerant Distributed Publish/Subscribe System

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi ☑*

si

csicsi

csicsicsi

sisisiAck messages

Page 41: Partition-Tolerant Distributed Publish/Subscribe System

Is required for crashed broker, that have been restarted

Restarted node should be able:◦ Restoring its δ+1 – neighborhood from stable storage

◦ Querying a network management service aware of neighborhood information

Further steps:◦ Activating links with neighbors

◦ Partial recovery initiation

Page 42: Partition-Tolerant Distributed Publish/Subscribe System

Size of brokers’ neighborhoods as a function of ∆

∆=4∆=3

∆=1

∆=2

• Network size of 1000

• Broker fanout of 3

Page 43: Partition-Tolerant Distributed Publish/Subscribe System

Impact of failures on end-to-end broker reachability

– Overlay setup:• Network size 1000 Brokers with

fanout=3

– Failure injection:• Failures: up to 100 brokers• We randomly marked a given

number of nodes as failed

– Measurements:• The number of end-to-end

brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain have been counted.

Page 44: Partition-Tolerant Distributed Publish/Subscribe System

Impact of failures on end-to-end broker reachability

∆=3

∆=4

∆=2∆=1

– Overlay setup:• Network size 1000 Brokers with

fanout=3

– Failure injection:• Failures: up to 100 brokers• We randomly marked a given

number of nodes as failed

– Measurements:• The number of end-to-end

brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain have been counted.

Page 45: Partition-Tolerant Distributed Publish/Subscribe System

Impact of failures on publication delivery

500 brokers deployed on 8-core machines in a cluster:• Network setup: Overlay

fanout = 3.• We measured

aggregate pub. delivery count in an interval of 120s

• Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers).

Page 46: Partition-Tolerant Distributed Publish/Subscribe System

Impact of failures on publication delivery

500 brokers deployed on 8-core machines in a cluster:• Network setup: Overlay

fanout = 3.• We measured

aggregate pub. delivery count in an interval of 120s

• Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers).

Page 47: Partition-Tolerant Distributed Publish/Subscribe System

Snoeren – publications are forwarded redundantly on multiple disjoint paths between subscribers and publishers

XNET – provides crash/failover scheme similar to this works when δ=1

Gryphon – based on replication scheme, in which routing information is replicated across multiple physical machines

Page 48: Partition-Tolerant Distributed Publish/Subscribe System

Developed reliable P/S system that toleratesconcurrent broker and link failures:

◦ Configuration parameter δ determines level of resiliency against failures (in the worst case).

◦ Dissemination trees augmented with neighborhood knowledge.

◦ Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures.

Page 49: Partition-Tolerant Distributed Publish/Subscribe System