Semantics-based Distributed I/O for mpiBLAST

Click here to load reader

download Semantics-based Distributed I/O for  mpiBLAST

of 1

  • date post

  • Category


  • view

  • download


Embed Size (px)


mpiBLAST Working Model. s. Search. Gather. Scatter. Query. Output. Database. Workers. Applications. mpiBLAST. Communication Profiling. Remote Visualization. I/O Servers hosting file system. ParaMEDIC Data Tools. ParaMEDIC API PMAPI). Generate Temp Database. Data - PowerPoint PPT Presentation

Transcript of Semantics-based Distributed I/O for mpiBLAST

PowerPoint Presentation

Semantics-based Distributed I/O for mpiBLAST

P. Balaji, W. Feng, J. Archuleta, H. Lin, R. Kettimuthu, R. Thakur and X. Ma

Argonne National Laboratory Virginia Tech North Carolina State UniversityIssues with Distributed I/ONSF TeraGridTraditional Distributed I/ODistributed I/O in mpiBLASTSequence Search with mpiBLASTParaMEDIC ArchitectureParaMEDIC-powered mpiBLASTParaMEDIC FrameworkmpiBLASTDataEncryptionParaMEDIC Data ToolsDataIntegrityCommunication ServicesDirectNetworkGlobalFile-SystemApplication PluginsBasicCompressionmpiBLASTpluginCommunicationProfiling PluginCommunicationProfilingRemoteVisualizationApplicationsParaMEDIC APIPMAPI)

mpiBLAST Working ModelsScatterSearchGatherOutputWorkersDatabaseQueryEstimated Output of an All-to-All NT SearchCompute MasterI/O MastermpiBLAST MastermpiBLASTWorkermpiBLASTWorkermpiBLASTWorkermpiBLAST MastermpiBLASTWorkermpiBLASTWorkerQueryRaw MetadataQueryWrite ResultsGenerate TempDatabaseRead TempDatabaseI/O WorkersCompute WorkersI/O Servershosting filesystemProcessed MetadataImpact of Network LatencyNetwork Delay: Performance BreakupTrading Computation and IOImpact of Encrypted File-SystemsArgonne-VT Distributed SetupPerformance on TeraGridOther Applications: MPE Communication ProfilerConclusion

Understanding ParaMEDICHigh LatencySynchronization operations heavily effectedLow BandwidthHigh-bandwidth links are not accessible to everyoneData EncryptionDistributed I/O over the Internet might need to be encrypted in some environmentsAnd yet distributed I/O is essentialLarge-scale computations might require resources that are not available at a single siteScientists may need to access remote large-scale supercomputers which are not available locallyPrimary Idea:Data transformation, but not at the byte-level (unlike compression techniques)Application specifies the appropriate description of the data allows ParaMEDIC to process output as high-level objects, and not a stream of bytesWith respect to mpiBLASTOverall output is just a concatenation of matches and other descriptive information for each query sequenceMatch scores, alignment information, etc.Finding the database matches for each sequence is the compute intensive portion needs to be storedRest can be discarded and regeneratedDistributed I/O is a necessary evilLarge applications need resources from multiple centersScientists might need to use remote large-scale resourcesImpacted by several issuesHigh Latency, Low Bandwidth, Encryption requirementsWe propose the ParaMEDIC frameworkSemantics-based Distributed I/O mechanismParaMEDIC uses application-specific plugins to understand what the output data means and process itOutput data converted to application-specific metadata, transported, and then converted backTrades a small amount of additional computation for potentially large benefits in I/OScientific applications have periodic communication profilesParaMEDIC uses FFT to find periodicity and process outputPreliminary results show 2-5X reduction in outputQuery Size(KB)0-55-5050-150150-200200-500>500TotalNumber of Queries3,305,17087,50625,92026,5249,5922483,455,000Estimated Output (GB)1,13959323,5553,995??>29,282


Series 1Growth in GenBank Database Size

Sheet1Series 1Series 2Series 319826803382.42198322740294.42198436897521.83198552044202.851986961537119871675287219882469087619893718395019905130609219917733767819921202422341993163802597199423048592819954258609581996730552938199712582905131998216206787119994653932745200011101066288200115849921438200228507990166200336553368485200438989342565200444575745176200556037734462200669019290705200783874179730To resize chart data range, drag lower right corner of range.