Semantics-based Distributed I/O for mpiBLAST

download Semantics-based Distributed I/O for  mpiBLAST

If you can't read please download the document

  • date post

    04-Feb-2016
  • Category

    Documents

  • view

    26
  • download

    0

Embed Size (px)

description

mpiBLAST Working Model. s. Search. Gather. Scatter. Query. Output. Database. Workers. Applications. mpiBLAST. Communication Profiling. Remote Visualization. I/O Servers hosting file system. ParaMEDIC Data Tools. ParaMEDIC API PMAPI). Generate Temp Database. Data - PowerPoint PPT Presentation

Transcript of Semantics-based Distributed I/O for mpiBLAST

PowerPoint Presentation

Semantics-based Distributed I/O for mpiBLAST

P. Balaji, W. Feng, J. Archuleta, H. Lin, R. Kettimuthu, R. Thakur and X. Ma

Argonne National Laboratory Virginia Tech North Carolina State UniversityIssues with Distributed I/ONSF TeraGridTraditional Distributed I/ODistributed I/O in mpiBLASTSequence Search with mpiBLASTParaMEDIC ArchitectureParaMEDIC-powered mpiBLASTParaMEDIC FrameworkmpiBLASTDataEncryptionParaMEDIC Data ToolsDataIntegrityCommunication ServicesDirectNetworkGlobalFile-SystemApplication PluginsBasicCompressionmpiBLASTpluginCommunicationProfiling PluginCommunicationProfilingRemoteVisualizationApplicationsParaMEDIC APIPMAPI)

mpiBLAST Working ModelsScatterSearchGatherOutputWorkersDatabaseQueryEstimated Output of an All-to-All NT SearchCompute MasterI/O MastermpiBLAST MastermpiBLASTWorkermpiBLASTWorkermpiBLASTWorkermpiBLAST MastermpiBLASTWorkermpiBLASTWorkerQueryRaw MetadataQueryWrite ResultsGenerate TempDatabaseRead TempDatabaseI/O WorkersCompute WorkersI/O Servershosting filesystemProcessed MetadataImpact of Network LatencyNetwork Delay: Performance BreakupTrading Computation and IOImpact of Encrypted File-SystemsArgonne-VT Distributed SetupPerformance on TeraGridOther Applications: MPE Communication ProfilerConclusion

Understanding ParaMEDICHigh LatencySynchronization operations heavily effectedLow BandwidthHigh-bandwidth links are not accessible to everyoneData EncryptionDistributed I/O over the Internet might need to be encrypted in some environmentsAnd yet distributed I/O is essentialLarge-scale computations might require resources that are not available at a single siteScientists may need to access remote large-scale supercomputers which are not available locallyPrimary Idea:Data transformation, but not at the byte-level (unlike compression techniques)Application specifies the appropriate description of the data allows ParaMEDIC to process output as high-level objects, and not a stream of bytesWith respect to mpiBLASTOverall output is just a concatenation of matches and other descriptive information for each query sequenceMatch scores, alignment information, etc.Finding the database matches for each sequence is the compute intensive portion needs to be storedRest can be discarded and regeneratedDistributed I/O is a necessary evilLarge applications need resources from multiple centersScientists might need to use remote large-scale resourcesImpacted by several issuesHigh Latency, Low Bandwidth, Encryption requirementsWe propose the ParaMEDIC frameworkSemantics-based Distributed I/O mechanismParaMEDIC uses application-specific plugins to understand what the output data means and process itOutput data converted to application-specific metadata, transported, and then converted backTrades a small amount of additional computation for potentially large benefits in I/OScientific applications have periodic communication profilesParaMEDIC uses FFT to find periodicity and process outputPreliminary results show 2-5X reduction in outputQuery Size(KB)0-55-5050-150150-200200-500>500TotalNumber of Queries3,305,17087,50625,92026,5249,5922483,455,000Estimated Output (GB)1,13959323,5553,995??>29,282

1Chart168033822740293689752520442096153711675287224690876371839505130609277337678120242234163802597230485928425860958730552938125829051321620678714653932745111010662881584992143828507990166365533684853898934256544575745176560377344626901929070583874179730

Series 1Growth in GenBank Database Size

Sheet1Series 1Series 2Series 319826803382.42198322740294.42198436897521.83198552044202.851986961537119871675287219882469087619893718395019905130609219917733767819921202422341993163802597199423048592819954258609581996730552938199712582905131998216206787119994653932745200011101066288200115849921438200228507990166200336553368485200438989342565200444575745176200556037734462200669019290705200783874179730To resize chart data range, drag lower right corner of range.