Text Mining and Environmental Metadata Suggestion

download Text Mining and Environmental Metadata Suggestion

of 48

  • date post

    31-Jul-2015
  • Category

    Science

  • view

    427
  • download

    1

Embed Size (px)

Transcript of Text Mining and Environmental Metadata Suggestion

1. Text Mining and EnvironmentalMetadata SuggestionEvangelos PafilisInstitute of Marine Biology, Biotechnology and Aquaculture (IMBBC)Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greecepafilis@hcmr.gr, http://epafilis.infoENA 1st Dec 2014 EBI, UK 2. Species EnvironmentsENA 1st Dec 2014 EBI, UK 3. Comparative nalysis Location Environment Time PeriodENA 1st Dec 2014 EBI, UK?Coral ReefsImage from http://theresilientearth.com/ 4. Not TrivialENA 1st Dec 2014 EBI, UK 5. Slide by Dr. P. Yilmaz, http://www.arb-silva.de/projects/contextual-data/ 6. Essential Context InformationMetadataMeta- = (after)=> data after data=> data describing dataENA 1st Dec 2014 EBI, UK 7. a clear definition, that can be interpretedin many, sometimes conflicting, waysENA 1st Dec 2014 EBI, UK 8. a clear definition, that can be interpretedin many, sometimes conflicting, waysEssential Context InformationENA 1st Dec 2014 EBI, UK 9. Community Standards Standards (such as MiXS, MIMARKS)see http://gensc.org/gc_wiki/index.php/GSC_Publicationsfor a comprehensive list of publications capture genomic/metagenomic and other type of sequence contextual information Including detailed guidelines on how to annotate a sample(e.g. Yilmaz P et al. (2011) The ISME journal 5: 15651567)ENA 1st Dec 2014 EBI, UKhttp://gensc.org/ 10. P. Yilmaz et al., Nat Biotech 29, 415420 (2011) 11. source: http://wiki.gensc.org/index.php?title=MIMARKS 12. http://www.tomorrowstarted.com/2013/01/how-a-key-works/.htmlENA 1st Dec 2014 EBI, UK 13. Project descriptions Scientific-content web pages Full text scientific articles Literature abstracts In-house documentsENA 1st Dec 2014 EBI, UK 14. Microbes are key players in both healthy anddegraded coral reefs. A combination ofmetagenomics, microscopy, culturing, andwater chemistry were used to characterizemicrobial communities on four coral atolls inthe Northern Line Islands, central Pacific.Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3(Project Description)ENA 1st Dec 2014 EBI, UK 15. Looking up terms:Intensive, learning curveENA 1st Dec 2014 EBI, UK 16. Literature MiningENA 1st Dec 2014 EBI, UK 17. processing textto extract facts of interestENA 1st Dec 2014 EBI, UK 18. ENVIRONMENTSENA 1st Dec 2014 EBI, UK 19. ENVIRONMENTS: ENVO term identification in textterrestrial, aquatic,marine, lagoon, coral reef,sediment, freshwater, soilENA 1st Dec 2014 EBI, UK 20. ENVIRONMENTS: ENVO term identification in textMicrobes are key players in both healthy anddegraded coral reefs. A combination ofmetagenomics, microscopy, culturing, andwater chemistry were used to characterizemicrobial communities on four coral atolls inthe Northern Line Islands, central Pacific.Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3(Project Description)ENA 1st Dec 2014 EBI, UK 21. ENVIRONMENTS: ENVO term identification in textID: ENVO:00000150Name: coral reefMicrobes are key players in both healthy anddegraded coral reefs. A combination ofmetagenomics, microscopy, culturing, andwater chemistry were used to characterizemicrobial communities on four coral atolls inthe Northern Line Islands, central Pacific.Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3(Project Description)ENA 1st Dec 2014 EBI, UK 22. ENVIRONMENTS: ENVO term identification in textID: ENVO:00000150Name: coral reefMicrobes are key players in both healthy anddegraded coral reefs. A combination ofmetagenomics, microscopy, culturing, andwater chemistry were used to characterizemicrobial communities on four coral atolls inthe Northern Line Islands, central Pacific.Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3(Project Description)ENA 1st Dec 2014 EBI, UK 23. ENVIRONMENTShttp://environments.hcmr.grhttp://environments-eol.blogspot.gr/ENA 1st Dec 2014 EBI, UK Dictionary based Open source Environment Ontology fast performance 4000 PubMed abstracts /second * Based on SPECIES name recognitiontagger (Pafilis et al, PLOS ONE) E600 gold standard: ENVO-basedcorpus of EOL Species pages Recognition Accuracy Mention Level:- F1: 82.0%87.1% of the TPs: exact idamong predicted ones Submitted preprint: http://biorxiv.org/content/early/2014/11/13/011403Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification ofTaxonomic Names in Text. PLoS ONE 8(6): e65390, *: based a single-thread run on an Intel 2,27GHz, 24GB RAM processing a set of 536,052 abstracts 24. ENVO: source of environment descriptornames and synonymshttp://environmentontology.org~1600 terms, June 2013ENA 1st Dec 2014 EBI, UKbiomeenvironmentalfeatureenvironmentalmaterialenvironmentalconditionhabitat Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany 25. ENVIRONMENTS Improving Accuracy Increasing matches in text orthographic variation supportede.g. freshwater, fresh water, and fresh-water Case-insensitive matching Synonym generation to reflect the way environment descriptiveterms are mentioned in text (both generic and ENVO specific)Action Example Preventing overmatching (i.e. avoiding increased FP) stopword-list (e.g. spring, well, range)ENA 1st Dec 2014 EBI, UKAdd a variant in whichnon-informative wordshave been removedepipelagic zone epipelagicestuarine biome estuarinePlural form addition sediment sedimentsAdjective form addition lagoon lagoonal 26. ScopeENVO parts Not included:speciestissuesfoodsLimitations Known Issuesnegation not supportedconflicts with anatomy terms(e.g. mouth, blowhole)ENA 1st Dec 2014 EBI, UK 27. ENVIRONMENTS Sample Outputeol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477ENA 1st Dec 2014 EBI, UKFile NameStartcoordEndcoordMatchtext ENVO IDTags corresponding to Habitat text data object: http://eol.org/data_objects/31415353of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221 28. ENVIRONMENTS Sample Outputeol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477ENA 1st Dec 2014 EBI, UKFile NameStartcoordEndcoordMatchtext ENVO IDTraversing allIS_A, PART_OFRelationships in ENVOTags corresponding to Habitat text data object: http://eol.org/data_objects/31415353of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221 29. DownloadENA 1st Dec 2014 EBI, UKENVIRONMENTS Home Page: http://environments.hcmr.gr/ Tagger Software:http://download.jensenlab.org/environments_tagger.tar.gz 30. other forms of accessENA 1st Dec 2014 EBI, UK 31. ENA 1st Dec 2014 EBI, UKhttp://eol.org/info/discover_what 32. ENA 1st Dec 2014 EBI, UKID: ENVO:00000150Name: coral reefENVIRONMENTSACTION ES1103Interactive Curationhttp://www.ncbi.nlm.nih.gov/pubmed/18301735 33. Interactive CurationENA 1st Dec 2014 EBI, UKACTION ES1103http://www.ncbi.nlm.nih.gov/pubmed/18301735 34. Interactive CurationENA 1st Dec 2014 EBI, UKACTION ES1103http://www.ncbi.nlm.nih.gov/pubmed/18301735 35. Interactive CurationENA 1st Dec 2014 EBI, UKACTION ES1103http://www.ncbi.nlm.nih.gov/pubmed/18301735 36. Interactive CurationENA 1st Dec 2014 EBI, UKACTION ES1103http://www.ncbi.nlm.nih.gov/pubmed/18301735 37. ENA 1st Dec 2014 EBI, UKACTION ES1103Not only ENVO terms 38. ENA 1st Dec 2014 EBI, UKACTION ES1103http://www.ncbi.nlm.nih.gov/pubmed/18301735 39. What else is being identified?ENA 1st Dec 2014 EBI, UKACTION ES1103ready you to discover! 40. ENA 1st Dec 2014 EBI, UKACTION ES1103 41. Summary! Importance of standardized metadata and annotations! ENVO: Standardized hierarchically organized descriptions ofenvironment types! Literature, project and other scientific content web pages maydescribe the environment context of a metagenomics sampleENA 1st Dec 2014 EBI, UK! ENVIRONMENTS:! Dictionary-based environment descriptive term identification! Ontological Community standards, e.g. ENVO: name source! Command line application! Browser extensions, a user-friendly interface! Highly Interactive! Can be used while browsing the web! Extract ENVO from a selected part of a web page! Extended for:! Organism, diseases, and tissue mention identification 42. D