A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1...

25
A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών

Transcript of A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1...

Page 1: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

A Data Transformation Service in Cloud Infrastructures

Κατρής Δημήτριος

1 Φεβρουαρίου 2010

Εθνικό και Καποδιστριακό Πανεπιστήμιο Αθηνών

Page 2: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Outline

• Introduction– Data transformation– gDTS

• Transformation Model• Core Functionality• System Model

– Architecture– Designating the number of workers

• Evaluation

Page 3: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Introduction: DT Usefulness• Digital Libraries

– Digital data preservation• old data to new specifications

– Content Security• watermarking

• Adaptive content delivery– bandwidth limitations + special characteristics or internet devices

• require transformations of the source data to different quality and/or formats• Content visualization

– Data representation can be different from its visualization• presenting content requires a sort of transformation

• Text extraction• Others

– data migration, database wrappers, ontology mappings, etc• We focus on per document transformations (transcoding)

Page 4: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Introduction: gDTS features

• Generic transformation framework – based on pluggable components (transformation programs)

• reveal the transformation capabilities of the framework• we are able to furnish domain and application specific data

transformations.

• Automatic transformation discovery– content type of a source object + target content type

• appropriate transformation is automatically selected• a chain of transformations may be performed

• Operates in several environments• WSRF compliant service • stand alone executable

• Workload distribution– harnesses computational resources from a cloud infrastructure

Page 5: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Transformation Model

• Content Types identification– MIME type specification

• media type + subtype + set of parameters “attribute=value” e.g.

– text/html; charset=“iso-8859-7”– image/jpeg; width=“1024”, height=“768”

• provides compliance with mainstream applications

Page 6: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Transformation Model

• Programs– the software used to perform the conversion

• Transformation Programs– references one program– describes its transformation capabilities

• contains one or more transformation capabilities (Transformation Units)

Page 7: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Transformation Model• Transformation Program Example<Name>ImageMagickWrapper</Name>

<Program>

<Software>

<Package>

<ID>dts_programs_bundle</ID>

<Location>http://repo.di.uoa.gr/programs/dts_programs_bundle.tar.gz</Location>

</Package>

<Package>

<ID>package_apache_poi</ID>

<Location>http://repo.di.uoa.gr/programs/imagemagick.tar.gz</Location>

</Package>

</Software>

<Class>org.gcube.datatransformation.datatransformationlibrary.programs.applications.ImageMagickWrapper</Class>

</Program>

<TransformationUnits>

.

.

.

Page 8: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Transformation Model• Transformation Unit

– Describes• one program capability• the way the program is to be used in order to perform a transformation

– sets program parameters

– Contains• one or more source content types• single target content type• proper program parameters

– Can be composite• references other transformation units • performs consecutive transformations over a source object

– Can have multiple sources• in order to combine documents• handling multipart documents

– cleaner approach

– Other features• wildcards in the content types of transformation units

– image/jpeg image/jpeg; width=”*”, height=”*”– */* application/zip

• wildcards in the program parameters– the ‘-’ enforces the presence of a program parameter value

Page 9: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Transformation Model

• Transformation Unit Example– Source content type (image/tiff)– Target content type (image/tiff; security=“watermarking”)– Program parameters

• name="method" value="composite" isOptional="false"• name="dissolve" value="15" isOptional="false"• name="tile" value="-" isOptional="false"

• Transformation Graph – nodes

• content types– edges

• transformation units – usage

• finds transformation units so as to perform an object transformation from its content type (source) to a target content type

Page 10: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Core Functionality• Data Handlers

– Data Sources• supply gDTS with input data

– Data Sinks• store the resulting transformed data

– Data element• Envelop of data object • Contains

– content type of the object – reference to the content or the raw content itself

• single or multipart– Multipart

» contain nested envelops» Content types: multipart/mixed, multipart/alternative

– Data element buffers• Are both source and sinks• Internal use

Page 11: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Core Functionality

• Data Handlers:– Initialization:

• Caller specifies – transfer mechanism/protocol

» e.g. ftp, http– parameters for I/O

» e.g. hostname, port, password– Function:

• Data source– sequential access of data elements

• Data sink– write the transformed data elements to the destination

– Advantage• abstraction over the original data source and destination • uniform means to process data

– transfer protocols can be different.

Page 12: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Core Functionality• Transformation Graph

– Usage• finds applicable transformation units from source to target content types

– Result• transformation unit or path

– Exact match» media type of Cs matches with media type of Ctu-s, » subtype of Cs matches with subtype of Ctu-s, » number of CTP of Cs equals with the number of CTP of Ctu-s» each CTP in Cs matches with Ctu-s

– Approximate match» #CTP of Cs can be greater #CTP of Ctu-s

– Same conditions must exist between Ct and Ctu-t

Cs Source content type

Ct Target content type

Ctu-s Source content type of TU

Ctu-t Target content type of TU

CTP Content type parameters

Page 13: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Core Functionality• Transformation Graph

– Overall steps• Search for existing transformation unit with exact match• Search for paths in the graph with exact match

– composite transformation unit created» references the transformation units that comprise the path » registered in the transformation program registry

• Perform steps with approximate match instead of exact.– Approximate matches often

• Transformation may not be affected by CT parameters. e.g.– Source object: image/png; width=“1024px”, height=“1024px”– TU: image/png -> image/jpeg– Target Content Type: image/jpeg

– Maintenance• transformation program registry

– contains the transformation programs• periodic update

– if change happens in registry

Page 14: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Core Functionality

• Program Execution– Deployment

• transformation program includes the location of the software

– download and install this software

• loads all the deployed libraries

– Invocation• entry class is specified in the transformation

program

Page 15: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Core Functionality

• Internal operation

Page 16: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Core Functionality

• Comment– download, conversion, storing time periods

overlap• can improve the performance• one procedure can be bottleneck

– buffers accept a certain amount of objects

Page 17: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model

• gDTS targets– high transformation rates– effective utilization of hardware resources

• Why cloud– Usability

• due to virtualization technologies– OS or any pre-installed programs are specified by the user– root access to VM (facilitates program deployment)

– On demand resource provisioning• VMs easily created and destroyed on demand• other job submission frameworks (torque in clusters or grid)

– jobs may wait for hours or even days into queues until other jobs to end– we need to have control and adjust the number of workers participating

in each transformation

Page 18: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model

• Architecture– Master – Worker pattern– Master (Coordinator)

• Supplies workers with objects to transform• Designates the amount of workers

Page 19: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model

Page 20: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model

• Comments– gDTS implemented as stateful WSRF-service

• WSRF: Factory design pattern

– Data elements are requested by the workers• underlying infrastructure may pose network restrictions on

the hosting nodes – Firewalls, NAT

• outbound http calls issued by the workers towards the coordinator service are generally permitted

– Transformation graph • implemented as a remote web service

– workers are not overloaded with graph maintenance

Page 21: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model• Designating the number of workers

– Problem: How many workers to use?• Considerations

– Performance– Cost (resources and money)

– Performance• Using “many” workers may increase performance but• Bottlenecks may appear

– Possible causes» Bandwidth or CPU limitations of sources or sinks» Lack of resources in the cloud

– Result » Under-utilization in workers

– Cost• Under-utilization has two drawbacks

– Resources are occupied without using them» Other apps may need to use these resources

– In commercial clouds we pay to occupy resources

Page 22: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model

• Designating the number of workers– Solutions

• Client specifies the number of workers– Is supported– But clients are agnostic to the runtime conditions of the infrastructure

• gDTS estimate its needs before transformation starts – i.e. calculate bandwidth and computing requirements– Drawbacks

» The amount and nature of data stored in the sources may not be known in advance

» The transformations are not known in advance» Complicated

• Adaptive approach– The amount of worker nodes is managed at runtime.

» Based on the transformation rate

Page 23: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model

• Adaptive approach– iterative procedure

• monitor the transformation rate for a period of time• alter the number of workers (workers step)

– a policy module determines the workers step based in:» transformation rate » the number of workers used during each

measurement

– Policy modules• any policy can be plugged-in that might fit to

specific deployment environments

Page 24: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

System Model

• Simple policy used by gDTS– We define the variable ratio as:

• ratio = current_rate - prev_rate / (prev_rate * (prev_workers_num + prev_workers_step) / prev_workers_num) - prev_rate;

• If ratio > ratio_of_efficiency (value set by the client)– Continue adding workers

• If ratio = ratio_of_efficiency– We do not change the number of workers

• If ratio < ratio_of_efficiency– We remove workers

– Example

Page 25: A Data Transformation Service in Cloud Infrastructures Κατρής Δημήτριος 1 Φεβρουαρίου 2010 Εθνικό και Καποδιστριακό Πανεπιστήμιο

Evaluation

• CPU Intensive transformation