Big Data Analytics in Healthcare: Promise and Potentialuocpga.gr/ddw1/tsiknakis_ddw.pdf · Pharma -...

1 1

Big Data Analytics in Healthcare: Promise and Potential

Καθ. Μ. Τσικνάκης Τμ. Μηχανικών Πληροφορικής, ΤΕΙ Κρήτης

& Επισκέπτης Καθηγητής, CBML, ΙΠ-ΙΤΕ

2 2

Hypothesis vs Data driven Science.

Contemporary Data Explosion.

The Data-Driven Discovery Science.

Promises and Potential. Clinical Care

Understanding disease etiology

Pharmaceutical industry

Challenges & Pitfalls.

Opportunities for Computer and Data Scientists.

Conclusions.

Contents & Structure

Saturday, June 11, 2016 1st Data Workshop "Big Data World", Heraklion, Crete

3 3

As scientists, we have been raised with the view that the best research is hypothesis-driven.

This belief, part of our scientific culture, is passed from one generation of scientists to the next and has become deeply ingrained.

When training graduate students and fellows, evaluating dissertations and overseeing the writing of scientific papers, we strive to communicate the importance of enunciating a clear

hypothesis, defining its scientific antecedents, describing the best path for testing it and reporting on the results in terms of the original hypothesis.

Hypothesis Driven Research


4 4

Hypothesis vs Data driven science

The question is not whether hypothesis-driven research should be one way to conduct science but whether it should be the only way.

Data driven Science*.

Data-driven science is not necessarily new.

A compelling argument can be made that the astronomer Tycho Brahe and his assistant Johannes Kepler were doing data-driven science, at least by the scale of their time.

Kepler published the Rudolphine Tables in 1627, some twenty-six years after Brahe's death.

The tables, a catalog of stars and planets, were largely based on Brahe's observations, which were considered to be the most accurate and detailed of the time.

*Ref.: Peter Murray-Rust, "Data-Driven Science: A Scientist's View," NSF/JISC Repositories Workshop position paper, April 10, 2007, <http://www.sis.pitt.edu/~repwkshop/papers/murray.html>.


http://www.sis.pitt.edu/~repwkshop/papers/murray.html

http://www.sis.pitt.edu/~repwkshop/papers/murray.html

5 5

Every 60 seconds there are:

72 hours of footage uploaded in YouTube

216,000 Instagram posts

204,000,000 emails sent

80% of data growth is videos, images and documents.

90% of data generated is unstructured

Including tweets, photos, customer purchase orders

Some figures

6 6

Data Volume

Every day, we create 2.5 quintillion (πεντάκις εκατομμύρια) bytes of data.

A full 90% of all the data in the world has been generated over the last two years.

This data comes from everywhere: sensors used to gather climate information, posts to social media, digital pictures and videos, and cell phone GPS signals to name a few.

This data is big data.


7 7

Dimensions of Big Data A collection of large and

complex data sets which are difficult to process using common database management tools or traditional data processing applications.

The challenges include capturing, storing, searching, sharing & analyzing.

“Big data refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities” – according to zdnet.com

The four dimensions (V’s) of Big Data

The 5th is VALUE

BIG DATA

Volume

Veracity

Variety Velocity


8 8

Exponential growth of data (size & importance)

Parallels between the growth in size and decay in value of large heterogeneous datasets.

The horizontal axis represents time, whereas the vertical axis shows the value of data.

As we acquire more data at an ever faster rate, its size and value exponentially increase (black curve).

The color curves indicate the exponential decay of the value of data from the point of its fixation (becoming static).


9 9

Volume of Biomedical Data

The healthcare industry historically has generated large amounts of data.

Reports say that data from the U.S. healthcare system alone reached, in 2011, 150 exabytes.

At this rate of growth, big data for U.S. healthcare will soon reach the zettabyte scale and, not long after, the yottabyte.

Note: 1 ZB = 1000 exabytes = 1 million petabytes = 1 billion terabytes = 1 trillion gigabytes.

Kaiser Permanente, the California-based health network, which has more than 9 million members, is believed to have between 26.5 and 44 petabytes of potentially rich data from EHRs, including images and annotations.

IHTT: Transforming Health Care through Big Data Strategies for leveraging big data in the health care industry; 2013. http://ihealthtran.com/ wordpress/2013/03/iht%C2%B2-releases- big-data-research-reportdownload- today/


10 10

Variety of Biomedical Data

Big data in healthcare is overwhelming not only because of its volume but also because of the diversity of data types and the speed at which it must be managed. Molecular

Imaging

Organ

EHR (person) related data

Population

Ref.: Frost & Sullivan: Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations.


11 11

Coding of data.

Effectively integrating and efficiently analyzing various forms of healthcare data over a period of time can answer many of the impending healthcare problems.

EHR Data: Collection, Coding and Analysis


Jensen, Peter B., Lars J. Jensen, and Soren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012).

12 12

Biomedical (Genomic) Data Volume

Genomics poses some of the same challenges as astronomy, atmospheric science, crop science & particle physics.

Biologists worn that storing and processing genome data will exceed the computing challenges of running YouTube and Twitter, biologists warn.

The report says that this outstrips YouTube’s projected annual

storage needs of 1,2 exabytes of video by 2025 and

Twitter’s projected 1,7 petabytes per year.

It even exceeds the 1 exabyte per year projected for what will be the world’s largest astronomy project, the Square Kilometer Array, to be sited in South Africa and Australia.

SOURCE: EMBL–EBI


Ref.: Frost & Sullivan: Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations.

13 13

Flagship projects: The 100,000 Genome project

Our genome contains all our 20,000 genes.

It is all 3,2 billion letters of our DNA.

One sequenced genome equals 2 billion bytes or 200 GBytes of data.

It is estimated that half of all Britons will get some form of cancer at some point in their lives.

A rare disease is one that affects 1 in 2,000 or less.


Source: http://www.genomicsengland.co.uk/ the-100000-genomes-project-by-numbers/

Plan: 100,000 genomes project.

There are over 100 rare diseases included in the effort.

70,000 patients and their families.

21 petabytes of data 1 petabyte of music would

take 2,000 years to play on an MP3 player.

http://www.genomicsengland.co.uk/the-100000-genomes-project-by-numbers/
















14 14

Biomedical Text – mining and analysis

Medline currently contains 18 Million abstracts (scientific papers).

An average of about 40,000 to 50,000 abstracts are added every month.

To turn these data into medical or biological insight or interpretation of the collected data remains a key challenge.

Two IT fields have a lot to offer and plays a promising role Information Retrieval (IR) and Text mining (TM).

IR is concerned with the automatic identification of relevant documents from large text collections.

TM is the application of techniques from machine learning in conjunction with natural language processing, and statistical/mathematical approaches to extract useful knowledge from text.

Both have been applied successfully* to various problems such as: Intelligent Information Retrieval Biomedical text sub-classification

and clustering Biomedical concept identification Concept relation extraction, etc.


*Ref: P. Sfakianaki, M. Tsiknakis, et al, Semantic biomedical resource discovery: a Natural Language Processing framework, BMC Medical Informatics and Decision Making201515:77

15 15

Watson is a collection of overlapping reasoning algorithms that address specific portions of the pipeline used for problem understanding and problem-solving.

For example, a specific instances of Watson might have specialized algorithms for: understanding the question

• Natural Language Processing, query expansion with synonyms, dictionaries, ontologies, language translation, speech translation, spelling correction.

making hypotheses • Indexing a corpus of data, searching for relevant passages, concept

annotations, passage expansion, passage filtering, passage scoring.

answer selection and scoring • Deep parsing, semantic matching, answer similarity, lexical matching,

temporal reasoning, geospatial reasoning, negation, knowledge graphs.

machine learning • Logistical regression, Bayesian networks, similarity learning.

Natural Language Processing Analytics for Unstructured Data (IBM’s Watson)


16 16

Pharma - Drug repositioning

Key application area of IR and TM.

Scope of Drug repositioning: identifying and developing new uses for existing drugs.

Despite enormous increases in spending in novel technologies over the last several years, R&D productivity has actually decreased since the mid-1990s, as measured either by the number of new drugs approved per year.


17 17

Drug Repositioning: Methods


ICT have a key role to play

18 18

A very innovative Greek biotechnology company.

Leading the drug repositioning market through exploitation of their literature mining SW platform.

Exploitation Cases – BIOVISTA S.A.


http://www.biovista.com/

19 19

Exploiting big social network data


Social Determinants of Health

20 20

Need to evaluate the feasibility and potential of exploring Social Determinants of Health (SDH1) and Community Vital Signs2 (community and geographic determinants of health) to build new hypotheses and help predict and improve patient outcomes.

Methods for information extracted from Social Networks3 can strengthen the analysis made with the more reliable and coherent

cohort data and can provide relevant hypotheses to be further verified with the cohort

datasets.

Social Determinants of Health


1 https://www.healthypeople.gov/2020/topics-objectives/topic/social-determinants-of-health 2 Andrew W Bazemore et al,, “Community Vital Signs”: Incorporating geocoded social

determinants into electronic records to promote patient and population health.

http://dx.doi.org/10.1093/jamia/ocv088 ocv088. First published online: 13 July 2015. 3 Bian J, Topaloglu U, Yu F, Yu F: Towards Large-scale Twitter Mining for Drug related Adverse

Events. Maui, Hawaii: SHB; 2012.

21 21

Multi-scale modelling

Biological systems span many orders of magnitude through the scales in a continuous way, from the smallest microscopic scales up to the largest macroscopic ones.

The sequence from the genome, proteome, metabolome, physiome to health comprises multi-scale, multi-science systems.

In many cases, we can select an appropriate scale at which we wish to study a natural system.

The history of science has shown how fruitful this approach has been.

In recent years the computational biology community has developed extremely powerful methods to model and simulate fundamental processes of a natural system on a multitude of separate scales.

The wealth of experimental data that has become available has made such in silico experimenting a viable methodology, which should allow for testing hypotheses and formulating predictions to be further tested in in vitro or in vivo studies


22 22

Data-driven modelling of biological multi-scale processes


Predictive modeling of biological processes and drugs becomes significantly more sophisticated and widespread.

By leveraging the diversity of available molecular and clinical data, predictive modeling could help identify new potential-

candidate molecules with a high probability of being successfully developed into drugs that act on biological targets safely and effectively.

Individualize therapy.

ViroLab (http://www.virolab.org/) is a multi-scale modelling, simulation and datamining environment for infectious Diseases.

http://www.virolab.org/

http://www.virolab.org/

23 23

Selection of well-known multi-scale models

Biological scales captured by the model, mathematical modelling approaches, topics and keywords.


24 24

Examples: Computational Horizons in Cancer (CHIC) Computational

Horizons In Cancer (CHIC): Developing Meta- and Hyper-Multiscale Models and Repositories for In Silico Oncology.

7th Framework Programme of the European Commission - ICT - Large-scale Integrating Project (IP)

http://chic-vph.eu/project/






25 25

Use of semantics for model retrieval, alignment and integration

Demanding computational issues

Marvin Schulz, et al, Retrieval, alignment, and clustering of computational models based on semantic annotations, Molecular Systems Biology 7; Article number 512; doi:10.1038/msb.2011.41


26 26

PM-04–2016: Networking and optimising the use of population and patient cohorts at EU level

Cohort autonomy – legal framework.

Efficient cohort exploration.

Easy cohort definition and sharing with visual interfaces.

Cohort refinement and expansion. Clinical domain experts should be able to easily constrain and/or expand cohorts based on discovered findings as part of their exploration.

Flexible visualization.

Flexible and advanced analytics.

Iterative analysis. The above requirements should be supported within an iterative process that allows refinement and exploration during an open-ended investigation.

Advanced computational infrastructure. The above complex requirements and analytical pipelines demand that an appropriate high-performing computational and storage infrastructure is at place.

Participating cohorts, population and disease registries


27 27

Efficient exploration and visualization


Retrospective use of clinical data by clinical researchers covers a wide variety of research investigations including assessment of

clinical trial feasibility,

generation and retrospective validation of research hypotheses,

data exploration, quality assessment, etc.

Demand for advanced visualisation tools

28 28

Advanced, secure computational framework The type and complexity of the

analytical and knowledge discovery services pose huge demands on the computing platform.

Need to establish elastically scalable big data clusters that can respond to varying workload demands.

Integrating Hadoop (http://hadoop.apache.org/) and Spark (http://spark.apache.org) data processing on OpenStack (https://www.openstack.org/).

Ref.: DePristo MA, et al: A framework for variation discovery and genotyping using NGS data. Nat Genet 2011, 43:491–8.


http://hadoop.apache.org/

http://spark.apache.org/

https://www.openstack.org/

29 29

OHDSI – Open Source Big Data Analytics in Healthcare

Variety of novel computational and analytical tools

International Flagship Initiatives - OHDSI


30 30

Healthcare Analytics in the Electronic Era

Old way: Data are expensive and small Input data are from clinical trials, which is small

and costly Modeling effort is small since the data is limited

Big Data era: Data are cheap and large Broader patient population Noisy data Heterogeneous data Diverse scale Longitudinal records


31 31

Healthcare Institutions are moving to Big Data Architectures

Enterprise Big Data Architectures


32 32

Big Data Landscape


The big data landscape for computer and information engineers

33 33

Over the past century, scientific advances in medicine have generally been made using a “frequentist” approach to statistical analysis:

Samples of populations are studied and the results from the samples are extrapolated to estimate the effects of the intervention being studied.

For most types of experiment, sampling data is sufficient to build an effective picture of the entire dataset and, statistically, we can give high levels of accuracy to predictions based on relatively small samples.

Data collected in this way is often of very high quality.

To ensure the sample is representative and accurate, the data is collected and ‘cleaned’ with great care.

This extra care is often very expensive, however, and over the last few decades we have seen the costs of running large randomized control trials spiral upwards.

The science of Big Data


34 34

Big Data offers a potential solution to this issue. Although data produced from such sources as

social networking communities, EHR systems, and wearable devices

are generally of much lower quality than data carefully collected by researchers looking to answer specific questions, the sheer volume of the data may outweigh their messiness. In addition, there is also a trend to higher quality ‘big data’ collection

such as the data produced in genomic analysis and structured data that can be generated from standard-compliant EHR systems.

As the percentage of the population being sampled approaches 100%, messy data can have greater predictive power than highly cleaned and carefully collected data that might only be a sample of 1% of the target population for the researcher.

The quantity of data alters the way and approaches used to relate, utilize, and understand data.

The science of Big Data


35 35

While researchers are still debating the definitions and boundaries of Big Data in health, benefits of health-related Big Data have been demonstrated in several areas so far, namely: design new therapies and candidate drugs,

Prevention of disease,

Identification of modifiable risk factors for disease, and

designing interventions for health behavior change.

Benefits


37 37

Big data denotes our capacity to gain insights from (in relative terms!) large amounts of data that we could not have had by just looking at samples.

Our difficulty in working with data has shaped our methods in the small data age.

As these limitations with respect to data diminish, we will have to rethink and adjust our scientific methods.

In return, we will gain a wealth of new insights, perhaps leading towards a new golden era of scientific discovery.

The power of Big Data demands, however, that we also are aware of its limitations and the significant dangers of abusing it.

Conclusions


Ref: Mayer-Schönberger V, Cukie K. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt; 2013. Chapter 2.

38 38

Q&A

Big Data Analytics in Healthcare: Promise and Potentialuocpga.gr/ddw1/tsiknakis_ddw.pdf · Pharma -...

Documents

Transcript of Big Data Analytics in Healthcare: Promise and Potentialuocpga.gr/ddw1/tsiknakis_ddw.pdf · Pharma -...