OceanStore: An Architecture for Global - Scale Persistent Storage

24
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao Ελευθερία Φιλτζαντζίδη, 2002

description

OceanStore: An Architecture for Global - Scale Persistent Storage. John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao Ελευθερία Φιλτζαντζίδη , 2002. OceanStore. - PowerPoint PPT Presentation

Transcript of OceanStore: An Architecture for Global - Scale Persistent Storage

Page 1: OceanStore: An Architecture for Global - Scale Persistent Storage

OceanStore: An Architecture for Global - Scale Persistent Storage

John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,

Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells,

and Ben Zhao

Ελευθερία Φιλτζαντζίδη, 2002

Page 2: OceanStore: An Architecture for Global - Scale Persistent Storage

OceanStore

Ubiquitous Computing: Car, Clothing, Books, Houses.Computing devices must have high

performance.Computing devices should consume low power.Computing devices must be transparent to the

user.

Persistent information is necessary for transparency.

Where does persistent information reside?

Page 3: OceanStore: An Architecture for Global - Scale Persistent Storage

OceanStore

Requirements for a persistent infrastructure. Connectivity through: cable-modems, DSL, cell phones

and wireless data services. Information must be kept secure. Information must be extremely durable. Archiving of information should be automatic and

reliable. Information must be divorced from location.

OceanStore is a utility infrastructure for persistent storage.

Page 4: OceanStore: An Architecture for Global - Scale Persistent Storage

OceanStore

As a rough estimate, OceanStore will provide service to roughly 1010 users, each with at least 10,000 files.

OceanStore must therefore support over 1014 files. Consumers will pay a monthly fee in exchange for

acess to persistent storage follow services. Companies buy and sell capacity from each other. The core of the system is composed of a multitude of

highly connected “pools”.

Page 5: OceanStore: An Architecture for Global - Scale Persistent Storage

The OceanStore system

Page 6: OceanStore: An Architecture for Global - Scale Persistent Storage

Two Unique Goals

Untrusted Infrastructure Servers may crash without warning or leak information

to third parties.

Only clients can be trusted.

All the information that enters the OceanStore is

encrypted.

Nomadic Data: Data that is allowed to flow freely. Promiscuous Caching: Data can be cached anywhere,

anytime.

Page 7: OceanStore: An Architecture for Global - Scale Persistent Storage

Applications

Groupware and personal information management tools. (calendars, email, contact lists and distributed design tools)

OceanStore can be used to create large digital libraries and repositories for scientific data.

OceanStore provides an ideal platform for new streaming applications, such as sensor data aggregation and dissemination.

Page 8: OceanStore: An Architecture for Global - Scale Persistent Storage

System Architecture

System OverviewNamingAccess ControlData Location and RoutingUpdate ModelDeep Archival StorageIntrospection

Page 9: OceanStore: An Architecture for Global - Scale Persistent Storage

System Overview

The fundamental unit is the persistent object Objects exist in both active and archival forms.

Active Object: Is the latest version of its data together with a handle for update.

Archival Object: Permanent read-only version of the object. Archival Objects are encoded with an erasure code.

The OceanStore API provides: sessions, session guarantees, updates and callbacks.

OceanStore provides an array of familiar interfaces such as the Unix and a transactional interface.

Page 10: OceanStore: An Architecture for Global - Scale Persistent Storage

Naming

Objects are identified by a GUID, a pseudo-random, fixed-length bit string.

An object GUID is the secure hash of the owner’s key and some human-readable name (Self-certifying path).

Certain objects act as directories, mapping human-readable names to GUIDs (SDSI).

A user can choose several directories as “roots”. The system as a whole has no “roots”.

Page 11: OceanStore: An Architecture for Global - Scale Persistent Storage

Access Control

Reader restrictionAll data that is not completely public is

encrypted. The encryption key is distributed to those users with read permission.

Writer restrictionAll writes can be verified against an access

control list (ACL).An owner of an object can choose the ACL x

for an object foo by providing a signed certificate.

Page 12: OceanStore: An Architecture for Global - Scale Persistent Storage

Data Location and Routing

OceanStore messages are labeled with a destination GUID, a random number, and a small predicate.

OceanStore combines data location and routing The task of routing is handled by the aggregate resources

of many different node. Messages route directly to destinations. The underlying infrastructure has more up-to-date routing

information.

Routing mechanism is a two-tiered approach. Probabilistic algorithm. Deterministic algorithm.

Page 13: OceanStore: An Architecture for Global - Scale Persistent Storage

Probabilistic Algorithm Is fully distributed and uses a constant amount of

storage per server. Using an array of D normal Bloom filter the attenuated

Bloom filter. The first filter contains the objects which are locally on the

node. The ith Bloom Filter is the union of all the Bloom filters for all

of the nodes a distance I through any path from the current node.

Bloom Filter : A method for representing a set A. It consists of a vector B of m bits and k hash functions

h1,h2,..hk of range {1..n}. For each element of the set A, the bits at positions

h1(a),h2(a)..hk(a) of the vector B is set to 1.

Given a query for b we check the bits at positions h1(a), h2(a), ..hk(a).

Page 14: OceanStore: An Architecture for Global - Scale Persistent Storage

The Probabilistic Query Process

Page 15: OceanStore: An Architecture for Global - Scale Persistent Storage

The Global Algorithm

OceanStore uses a variation on Plaxton et. al.’s hierarchical distributed data structure.

The Basic Plaxton scheme Every server in the system is assigned a random node-ID. Each link is labeled with a level number.

In OceanStore each object is mapped to a single node whose node-ID matches the object’s GUID in the most bits. This node is called object’s root.

The location of a replica is “published” in the infrastructure.

This process requires O(logn) hops.

Page 16: OceanStore: An Architecture for Global - Scale Persistent Storage

The Global Algorithm

Page 17: OceanStore: An Architecture for Global - Scale Persistent Storage

The Global Algorithm

Inserting Object #62942

116

479

529

629

116

479

529

629

675

109

Searching Object #62942

Root Node

Search Client

Object Location

Page 18: OceanStore: An Architecture for Global - Scale Persistent Storage

Update Model

Update model base on Conflict Resolution - Bayou System.

Update: list of predicates associated with actions. Commit Abort

Possible Predicates compare-version, compare size : applied to unencrypted

meta-data of an object. compare-block: the encryption technology is a position-

dependent block cipher. search: is preformed directly to cipher data.

Page 19: OceanStore: An Architecture for Global - Scale Persistent Storage

Update Model

Operations applied to ciphertext replace block, append block: a position dependent block

cipher. insert block, delete block

Two sets of blocks: index blocks, data blocks.

Page 20: OceanStore: An Architecture for Global - Scale Persistent Storage

Update Model Someone must choose the final commit order of

updates. OceanStore choose two tier of replicas

A primary tier of replicas: Byzantine agreement protocol. Small number of replicas located in high bandwidth, high -

connectivity regions of the network. Stronger consistency guarantees.

A secondary tier of replicas: Epidemic Algorithm. They are organized into dissemination trees. Contain both tentative and committed data. Secondary replicas order tentative updates in timestamp

order. Lesser degree of consistency.

Page 21: OceanStore: An Architecture for Global - Scale Persistent Storage

Update Model

After generating an update, a client send it to the object’s primary tier,as well as to several random replicas for that object.

The primary tier performs the Byzantine protocol. The secondary replicas propagate the update among themselves epidemically.

The result is multicast down the dissemination tree.

Page 22: OceanStore: An Architecture for Global - Scale Persistent Storage

Deep Archival Storage

The archival mechanism employs erasure codes (interleaved Read-Solomon, Tornado codes).

Erasure coding treats data as a series of fragments and transforms these fragments into a greater number of fragments.

The fragments are spread widely. Any n of the coded fragments are sufficient to construct the original data.

Fragmentation increases reliability and survivability.

Page 23: OceanStore: An Architecture for Global - Scale Persistent Storage

Introspection

Introspection augments a system’s normal operation (computation) with observation and optimization.

The Cycle of Introspection An architecture for introspective systems in OceanStore

Page 24: OceanStore: An Architecture for Global - Scale Persistent Storage

Status

The first implementation is deployed in Java. They use the Unix file system interface and a read-

only proxy for the WWW. They have explored the security guarantees that

are required for the OceanStore. Included Components

A prototype for the probabilistic algorithm. Prototype archival systems that use Read-Solomon and

Tornado codes for redundancy encoding. An introspective prefetching mechanism for a local file

system.