Big Data Analytics in Healthcare: Promise and Potentialuocpga.gr/ddw1/tsiknakis_ddw.pdf · Pharma -...
Transcript of Big Data Analytics in Healthcare: Promise and Potentialuocpga.gr/ddw1/tsiknakis_ddw.pdf · Pharma -...
1 1
Big Data Analytics in Healthcare: Promise and Potential
Καθ. Μ. Τσικνάκης Τμ. Μηχανικών Πληροφορικής, ΤΕΙ Κρήτης
& Επισκέπτης Καθηγητής, CBML, ΙΠ-ΙΤΕ
2 2
Hypothesis vs Data driven Science.
Contemporary Data Explosion.
The Data-Driven Discovery Science.
Promises and Potential. Clinical Care
Understanding disease etiology
Pharmaceutical industry
Challenges & Pitfalls.
Opportunities for Computer and Data Scientists.
Conclusions.
Contents & Structure
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
3 3
As scientists, we have been raised with the view that the best research is hypothesis-driven.
This belief, part of our scientific culture, is passed from one generation of scientists to the next and has become deeply ingrained.
When training graduate students and fellows, evaluating dissertations and overseeing the writing of scientific papers, we strive to communicate the importance of enunciating a clear
hypothesis, defining its scientific antecedents, describing the best path for testing it and reporting on the results in terms of the original hypothesis.
Hypothesis Driven Research
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
4 4
Hypothesis vs Data driven science
The question is not whether hypothesis-driven research should be one way to conduct science but whether it should be the only way.
Data driven Science*.
Data-driven science is not necessarily new.
A compelling argument can be made that the astronomer Tycho Brahe and his assistant Johannes Kepler were doing data-driven science, at least by the scale of their time.
Kepler published the Rudolphine Tables in 1627, some twenty-six years after Brahe's death.
The tables, a catalog of stars and planets, were largely based on Brahe's observations, which were considered to be the most accurate and detailed of the time.
*Ref.: Peter Murray-Rust, "Data-Driven Science: A Scientist's View," NSF/JISC Repositories Workshop position paper, April 10, 2007, <http://www.sis.pitt.edu/~repwkshop/papers/murray.html>.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
5 5
Every 60 seconds there are:
72 hours of footage uploaded in YouTube
216,000 Instagram posts
204,000,000 emails sent
80% of data growth is videos, images and documents.
90% of data generated is unstructured
Including tweets, photos, customer purchase orders
Some figures
6 6
Data Volume
Every day, we create 2.5 quintillion (πεντάκις εκατομμύρια) bytes of data.
A full 90% of all the data in the world has been generated over the last two years.
This data comes from everywhere: sensors used to gather climate information, posts to social media, digital pictures and videos, and cell phone GPS signals to name a few.
This data is big data.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
7 7
Dimensions of Big Data A collection of large and
complex data sets which are difficult to process using common database management tools or traditional data processing applications.
The challenges include capturing, storing, searching, sharing & analyzing.
“Big data refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities” – according to zdnet.com
The four dimensions (V’s) of Big Data
The 5th is VALUE
BIG DATA
Volume
Veracity
Variety Velocity
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
8 8
Exponential growth of data (size & importance)
Parallels between the growth in size and decay in value of large heterogeneous datasets.
The horizontal axis represents time, whereas the vertical axis shows the value of data.
As we acquire more data at an ever faster rate, its size and value exponentially increase (black curve).
The color curves indicate the exponential decay of the value of data from the point of its fixation (becoming static).
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
9 9
Volume of Biomedical Data
The healthcare industry historically has generated large amounts of data.
Reports say that data from the U.S. healthcare system alone reached, in 2011, 150 exabytes.
At this rate of growth, big data for U.S. healthcare will soon reach the zettabyte scale and, not long after, the yottabyte.
Note: 1 ZB = 1000 exabytes = 1 million petabytes = 1 billion terabytes = 1 trillion gigabytes.
Kaiser Permanente, the California-based health network, which has more than 9 million members, is believed to have between 26.5 and 44 petabytes of potentially rich data from EHRs, including images and annotations.
IHTT: Transforming Health Care through Big Data Strategies for leveraging big data in the health care industry; 2013. http://ihealthtran.com/ wordpress/2013/03/iht%C2%B2-releases- big-data-research-reportdownload- today/
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
10 10
Variety of Biomedical Data
Big data in healthcare is overwhelming not only because of its volume but also because of the diversity of data types and the speed at which it must be managed. Molecular
Imaging
Organ
EHR (person) related data
Population
Ref.: Frost & Sullivan: Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
11 11
Coding of data.
Effectively integrating and efficiently analyzing various forms of healthcare data over a period of time can answer many of the impending healthcare problems.
EHR Data: Collection, Coding and Analysis
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
Jensen, Peter B., Lars J. Jensen, and Soren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012).
12 12
Biomedical (Genomic) Data Volume
Genomics poses some of the same challenges as astronomy, atmospheric science, crop science & particle physics.
Biologists worn that storing and processing genome data will exceed the computing challenges of running YouTube and Twitter, biologists warn.
The report says that this outstrips YouTube’s projected annual
storage needs of 1,2 exabytes of video by 2025 and
Twitter’s projected 1,7 petabytes per year.
It even exceeds the 1 exabyte per year projected for what will be the world’s largest astronomy project, the Square Kilometer Array, to be sited in South Africa and Australia.
SOURCE: EMBL–EBI
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
Ref.: Frost & Sullivan: Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations.
13 13
Flagship projects: The 100,000 Genome project
Our genome contains all our 20,000 genes.
It is all 3,2 billion letters of our DNA.
One sequenced genome equals 2 billion bytes or 200 GBytes of data.
It is estimated that half of all Britons will get some form of cancer at some point in their lives.
A rare disease is one that affects 1 in 2,000 or less.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
Source: http://www.genomicsengland.co.uk/ the-100000-genomes-project-by-numbers/
Plan: 100,000 genomes project.
There are over 100 rare diseases included in the effort.
70,000 patients and their families.
21 petabytes of data 1 petabyte of music would
take 2,000 years to play on an MP3 player.
14 14
Biomedical Text – mining and analysis
Medline currently contains 18 Million abstracts (scientific papers).
An average of about 40,000 to 50,000 abstracts are added every month.
To turn these data into medical or biological insight or interpretation of the collected data remains a key challenge.
Two IT fields have a lot to offer and plays a promising role Information Retrieval (IR) and Text mining (TM).
IR is concerned with the automatic identification of relevant documents from large text collections.
TM is the application of techniques from machine learning in conjunction with natural language processing, and statistical/mathematical approaches to extract useful knowledge from text.
Both have been applied successfully* to various problems such as: Intelligent Information Retrieval Biomedical text sub-classification
and clustering Biomedical concept identification Concept relation extraction, etc.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
*Ref: P. Sfakianaki, M. Tsiknakis, et al, Semantic biomedical resource discovery: a Natural Language Processing framework, BMC Medical Informatics and Decision Making201515:77
15 15
Watson is a collection of overlapping reasoning algorithms that address specific portions of the pipeline used for problem understanding and problem-solving.
For example, a specific instances of Watson might have specialized algorithms for: understanding the question
• Natural Language Processing, query expansion with synonyms, dictionaries, ontologies, language translation, speech translation, spelling correction.
making hypotheses • Indexing a corpus of data, searching for relevant passages, concept
annotations, passage expansion, passage filtering, passage scoring.
answer selection and scoring • Deep parsing, semantic matching, answer similarity, lexical matching,
temporal reasoning, geospatial reasoning, negation, knowledge graphs.
machine learning • Logistical regression, Bayesian networks, similarity learning.
Natural Language Processing Analytics for Unstructured Data (IBM’s Watson)
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
16 16
Pharma - Drug repositioning
Key application area of IR and TM.
Scope of Drug repositioning: identifying and developing new uses for existing drugs.
Despite enormous increases in spending in novel technologies over the last several years, R&D productivity has actually decreased since the mid-1990s, as measured either by the number of new drugs approved per year.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
17 17
Drug Repositioning: Methods
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
ICT have a key role to play
18 18
A very innovative Greek biotechnology company.
Leading the drug repositioning market through exploitation of their literature mining SW platform.
Exploitation Cases – BIOVISTA S.A.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
http://www.biovista.com/
19 19
Exploiting big social network data
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
Social Determinants of Health
20 20
Need to evaluate the feasibility and potential of exploring Social Determinants of Health (SDH1) and Community Vital Signs2 (community and geographic determinants of health) to build new hypotheses and help predict and improve patient outcomes.
Methods for information extracted from Social Networks3 can strengthen the analysis made with the more reliable and coherent
cohort data and can provide relevant hypotheses to be further verified with the cohort
datasets.
Social Determinants of Health
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
1 https://www.healthypeople.gov/2020/topics-objectives/topic/social-determinants-of-health 2 Andrew W Bazemore et al,, “Community Vital Signs”: Incorporating geocoded social
determinants into electronic records to promote patient and population health.
http://dx.doi.org/10.1093/jamia/ocv088 ocv088. First published online: 13 July 2015. 3 Bian J, Topaloglu U, Yu F, Yu F: Towards Large-scale Twitter Mining for Drug related Adverse
Events. Maui, Hawaii: SHB; 2012.
21 21
Multi-scale modelling
Biological systems span many orders of magnitude through the scales in a continuous way, from the smallest microscopic scales up to the largest macroscopic ones.
The sequence from the genome, proteome, metabolome, physiome to health comprises multi-scale, multi-science systems.
In many cases, we can select an appropriate scale at which we wish to study a natural system.
The history of science has shown how fruitful this approach has been.
In recent years the computational biology community has developed extremely powerful methods to model and simulate fundamental processes of a natural system on a multitude of separate scales.
The wealth of experimental data that has become available has made such in silico experimenting a viable methodology, which should allow for testing hypotheses and formulating predictions to be further tested in in vitro or in vivo studies
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
22 22
Data-driven modelling of biological multi-scale processes
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
Predictive modeling of biological processes and drugs becomes significantly more sophisticated and widespread.
By leveraging the diversity of available molecular and clinical data, predictive modeling could help identify new potential-
candidate molecules with a high probability of being successfully developed into drugs that act on biological targets safely and effectively.
Individualize therapy.
ViroLab (http://www.virolab.org/) is a multi-scale modelling, simulation and datamining environment for infectious Diseases.
23 23
Selection of well-known multi-scale models
Biological scales captured by the model, mathematical modelling approaches, topics and keywords.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
24 24
Examples: Computational Horizons in Cancer (CHIC) Computational
Horizons In Cancer (CHIC): Developing Meta- and Hyper-Multiscale Models and Repositories for In Silico Oncology.
7th Framework Programme of the European Commission - ICT - Large-scale Integrating Project (IP)
http://chic-vph.eu/project/
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
25 25
Use of semantics for model retrieval, alignment and integration
Demanding computational issues
Marvin Schulz, et al, Retrieval, alignment, and clustering of computational models based on semantic annotations, Molecular Systems Biology 7; Article number 512; doi:10.1038/msb.2011.41
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
26 26
PM-04–2016: Networking and optimising the use of population and patient cohorts at EU level
Cohort autonomy – legal framework.
Efficient cohort exploration.
Easy cohort definition and sharing with visual interfaces.
Cohort refinement and expansion. Clinical domain experts should be able to easily constrain and/or expand cohorts based on discovered findings as part of their exploration.
Flexible visualization.
Flexible and advanced analytics.
Iterative analysis. The above requirements should be supported within an iterative process that allows refinement and exploration during an open-ended investigation.
Advanced computational infrastructure. The above complex requirements and analytical pipelines demand that an appropriate high-performing computational and storage infrastructure is at place.
Participating cohorts, population and disease registries
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
27 27
Efficient exploration and visualization
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
Retrospective use of clinical data by clinical researchers covers a wide variety of research investigations including assessment of
clinical trial feasibility,
generation and retrospective validation of research hypotheses,
data exploration, quality assessment, etc.
Demand for advanced visualisation tools
28 28
Advanced, secure computational framework The type and complexity of the
analytical and knowledge discovery services pose huge demands on the computing platform.
Need to establish elastically scalable big data clusters that can respond to varying workload demands.
Integrating Hadoop (http://hadoop.apache.org/) and Spark (http://spark.apache.org) data processing on OpenStack (https://www.openstack.org/).
Ref.: DePristo MA, et al: A framework for variation discovery and genotyping using NGS data. Nat Genet 2011, 43:491–8.
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
29 29
OHDSI – Open Source Big Data Analytics in Healthcare
Variety of novel computational and analytical tools
International Flagship Initiatives - OHDSI
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
30 30
Healthcare Analytics in the Electronic Era
Old way: Data are expensive and small Input data are from clinical trials, which is small
and costly Modeling effort is small since the data is limited
Big Data era: Data are cheap and large Broader patient population Noisy data Heterogeneous data Diverse scale Longitudinal records
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
31 31
Healthcare Institutions are moving to Big Data Architectures
Enterprise Big Data Architectures
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
32 32
Big Data Landscape
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
The big data landscape for computer and information engineers
33 33
Over the past century, scientific advances in medicine have generally been made using a “frequentist” approach to statistical analysis:
Samples of populations are studied and the results from the samples are extrapolated to estimate the effects of the intervention being studied.
For most types of experiment, sampling data is sufficient to build an effective picture of the entire dataset and, statistically, we can give high levels of accuracy to predictions based on relatively small samples.
Data collected in this way is often of very high quality.
To ensure the sample is representative and accurate, the data is collected and ‘cleaned’ with great care.
This extra care is often very expensive, however, and over the last few decades we have seen the costs of running large randomized control trials spiral upwards.
The science of Big Data
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
34 34
Big Data offers a potential solution to this issue. Although data produced from such sources as
social networking communities, EHR systems, and wearable devices
are generally of much lower quality than data carefully collected by researchers looking to answer specific questions, the sheer volume of the data may outweigh their messiness. In addition, there is also a trend to higher quality ‘big data’ collection
such as the data produced in genomic analysis and structured data that can be generated from standard-compliant EHR systems.
As the percentage of the population being sampled approaches 100%, messy data can have greater predictive power than highly cleaned and carefully collected data that might only be a sample of 1% of the target population for the researcher.
The quantity of data alters the way and approaches used to relate, utilize, and understand data.
The science of Big Data
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
35 35
While researchers are still debating the definitions and boundaries of Big Data in health, benefits of health-related Big Data have been demonstrated in several areas so far, namely: design new therapies and candidate drugs,
Prevention of disease,
Identification of modifiable risk factors for disease, and
designing interventions for health behavior change.
Benefits
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
36 36
37 37
Big data denotes our capacity to gain insights from (in relative terms!) large amounts of data that we could not have had by just looking at samples.
Our difficulty in working with data has shaped our methods in the small data age.
As these limitations with respect to data diminish, we will have to rethink and adjust our scientific methods.
In return, we will gain a wealth of new insights, perhaps leading towards a new golden era of scientific discovery.
The power of Big Data demands, however, that we also are aware of its limitations and the significant dangers of abusing it.
Conclusions
Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete
Ref: Mayer-Schönberger V, Cukie K. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt; 2013. Chapter 2.
38 38
Q&A