Partition-Tolerant Distributed Publish/Subscribe System

Post on 08-Jul-2015

135 views 1 download

description

Introduction to the design choices behind partition-tolerant distributed pub-sub system

Transcript of Partition-Tolerant Distributed Publish/Subscribe System

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

1-neighborhood

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

2-neighborhood

1-neighborhood

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

2-neighborhood

1-neighborhood

Focus on:◦ fault – tolerance

◦ reliability

Based on:◦ tree-overlay

◦ neighborhood knowledge

◦ δ - configuration parameter

3-neighborhood

2-neighborhood

1-neighborhood

◦ An “island” :

ABCDEF SP D

sourcedestination

◦ An “island” :

◦ A “barrier”:

◦ Partition identifier (PID) = (pd, i, pnodes)

ABCDEF SP DEF

ABCDEF SP D

sourcedestination

destination source

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

s

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

ssssss

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

Confirmations

ssssss

☑conf

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

Confirmations

ssssss

☑conf

☑conf

☑conf

☑conf

☑conf

☑conf

Subscription is accepted when it is added into routing tables

That requires acknowledgments from whole outgoing set

ABCDEP S

Subscriptions

Confirmations

ssssss

☑conf

☑conf

☑conf

☑conf

☑conf

☑conf

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

sss

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

s

☑conf

ss

conf

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

s

☑conf

ss

conf

☑conf*

* Tag conf with pid

Brokers’ B FD detects partition, and connects to first alive broker along the path

It removes identified nodes from Outs list and sends confirmation to upper brokers with included PID of partition

Subscription is accepted when all ACK messages are received from brokers in Outs list

ABCDEP S

Confirmations

Subscriptions

CD B

s

☑conf

ss

conf

☑conf*

☑conf*

☑* pid tag is alsostored alongwith s* Tag conf with pid

Forwarding compromises of five steps:◦ Queuing

◦ Barrier checking

◦ Matching

◦ Routing

◦ cleanup

Forwarding only uses subscriptions accepted brokers. Steps in forwarding of publication p:

◦ Identify broker of accepted subscriptions that match p◦ Determine active connections towards matching subscriptions’

brokers◦ Send p on those active connections and wait for confirmations◦ If there are local matching subscribers, deliver to them◦ If no downstream matching subscriber exists, issue confirmation

towards P◦ Once confirmations arrive, discard p and send a conf towards P

Publications

ABCDEP S

Subscriptions

p

☑ ☑ ☑ ☑ ☑ ☑

CE

p p p p p

Deliver to localsubscribers

confconfconfconfconfconf

p

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptions

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptions

☑ ☑ ☑ ☑ ☑*

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p p

p

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

conf

conf

conf

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p p

conf

p

Key forwarding invariant to ensure reliability: ensuring that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription

conf

conf

conf

Publications

ABCDEP S

Subscriptionsp

C BD

☑ ☑ ☑ ☑ ☑*

p p

Depending on when this link has been establishedeither recovery or subscription propagation ensure

C accepts s prior to receiving p

conf

p

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

ABCDEX R

New session

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New sessionsi sisisi

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi

si

csicsi

csicsicsi

sisisiAck messages

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi ☑*

si

csicsi

csicsicsi

sisisiAck messages

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi ☑*

si

csicsi

csicsicsi

sisisiAck messages

Is initiated upon activation of a new session.

Have five steps:◦ Notify about active session

◦ Reply by sending a summary of subscriptions

◦ Summary is compared to local list, missing subscriptions are transferred too

◦ Subscriptions are accepted by R and sent to its downstream network

◦ Partition information is updated within distance 2δ

sisi

ABCDEX R

New session

csi ☑*

si

csicsi

csicsicsi

sisisiAck messages

Is required for crashed broker, that have been restarted

Restarted node should be able:◦ Restoring its δ+1 – neighborhood from stable storage

◦ Querying a network management service aware of neighborhood information

Further steps:◦ Activating links with neighbors

◦ Partial recovery initiation

Size of brokers’ neighborhoods as a function of ∆

∆=4∆=3

∆=1

∆=2

• Network size of 1000

• Broker fanout of 3

Impact of failures on end-to-end broker reachability

– Overlay setup:• Network size 1000 Brokers with

fanout=3

– Failure injection:• Failures: up to 100 brokers• We randomly marked a given

number of nodes as failed

– Measurements:• The number of end-to-end

brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain have been counted.

Impact of failures on end-to-end broker reachability

∆=3

∆=4

∆=2∆=1

– Overlay setup:• Network size 1000 Brokers with

fanout=3

– Failure injection:• Failures: up to 100 brokers• We randomly marked a given

number of nodes as failed

– Measurements:• The number of end-to-end

brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain have been counted.

Impact of failures on publication delivery

500 brokers deployed on 8-core machines in a cluster:• Network setup: Overlay

fanout = 3.• We measured

aggregate pub. delivery count in an interval of 120s

• Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers).

Impact of failures on publication delivery

500 brokers deployed on 8-core machines in a cluster:• Network setup: Overlay

fanout = 3.• We measured

aggregate pub. delivery count in an interval of 120s

• Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers).

Snoeren – publications are forwarded redundantly on multiple disjoint paths between subscribers and publishers

XNET – provides crash/failover scheme similar to this works when δ=1

Gryphon – based on replication scheme, in which routing information is replicated across multiple physical machines

Developed reliable P/S system that toleratesconcurrent broker and link failures:

◦ Configuration parameter δ determines level of resiliency against failures (in the worst case).

◦ Dissemination trees augmented with neighborhood knowledge.

◦ Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures.