USING GPU AND POWER8 TO EXPLORE HOW GENOMES FOLD -...

Post on 05-Aug-2020

3 views 0 download

Transcript of USING GPU AND POWER8 TO EXPLORE HOW GENOMES FOLD -...

USING GPU AND POWER8

TO EXPLORE HOW

GENOMES FOLD

Ido Machol

Aiden Lab

Baylor College of Medicine

Rice University

GTC 2015

THE HUMAN GENOME

IS LONG!

…CGTTTACGAAAATCGCAAAACTTTCGATACCCATAGGCTACTGATCATACGACCGTTTACGAAAATCGAAACCTTTCCGATCTAGGCTAC…

3 BILLION Letters

2 METERS

Nucleus Cell

6 μm

10 bp

100 bp

1 Kb

10 Kb

100 Kb

1 Mb

10 Mb

100 Mb

SAME GENOME, DIFFERENT

FUNCTIONS

PART I:

TECHNOLOGY

MICROSCOPY &

FLUORESCENT IN SITU HYBRIDIZATION

FISH

CONTACT MAPPING

Exploring structure via proximity

4-11 (lives nearby)

0-3 (lives far away)

Always (same person)

Times in the Same Photo

FACEBOOK CONTACT MAP

Homer

Simpsons'

Contact

Map

# of Pictures Together

4 5 6 7 8 9 10 11 12 13 14

2 0 1 2 1 0 1 0 0

0 3 2 1 0 0 0 0 0

1 2 16 6 5 4 11 1 1

2 1 6 8 6 3 4 0 0

1 0 5 6 8 4 5 1 0

0 0 4 3 4 5 5 0 0

1 0 11 4 5 5 11 1 1

0 0 1 0 1 0 1 2 1

0 0 1 0 0 0 1 1 1 0 16

2 0 1 2 1 0 1 0 0

0 3 2 1 0 0 0 0 0

1 2 16 6 5 4 11 1 1

2 1 6 8 6 3 4 0 0

1 0 5 6 8 4 5 1 0

0 0 4 3 4 5 5 0 0

1 0 11 4 5 5 11 1 1

0 0 1 0 1 0 1 2 1

0 0 1 0 0 0 1 1 1

Hi-C

3D Genome Sequencing

Hi-C: genome-wide Chromosome

Conformation Capture

Erez Lieberman-Aiden, Nynke van Berkum

et al. Science 2009

Computational Challenge I

Alignment, calculate contacts

…CTGCCTCCTCGCGG CCGCGTGGTGGCAG…

DNA Reference

Sequence

Align to reference genome

… …

Alignment is not trivial

…CTGCC_TCCTCGCGG…

…CTGC__TCCTCGCGG… …CTGAA_TCCTCGCGG… …CTGCCCTCCTCGCGG…

Substitution

Deletion

Insertion

Computational HW and SW setup

8 x Power8 Servers

2 Sockets x 12 cores x 8 threads = 192 virtual cores each

Total of 1,536 virtual cores in cluster.

• 4 X 256GB RAM

• 2 X 1024GB RAM

• 2 X 256GB RAM with NVIDIA K40 Tesla

Model 8247-22L and 8247-42L

Byte order: BI-Endian

Rice RSCG PowerOmics

hardware

Tesla K40

Stream Processors 2880

Core Clock 745MHz

Boost Clock(s) 810MHz, 875MHz

Memory Clock 6GHz GDDR5

VRAM 12GB

Single Precision 4.29 TFLOPS

Double Precision 1.43 TFLOPS (1/3)

GPUs

Storage

• IBM GPFS Storage Server (Model 24)

• 4 X JBOD

• Total of 361 TB fast scratch disk space

• (Up to 1.4 Peta bytes)

• FlashSystem 840 20TB Flash

Interconnect:

• 56 Gigabit 36-port FDR IB switch

• Mellanox Next gen Connect-IB FDR Host Channel Adapters

• 10-Gigabit Ethernet

• Internet 2

Interconnect

Rice RSCG PowerOmics

software

Cluster management

• IBM Platform LSF, PPM, PAC, PowerKVM 2.1.0

Operating system

• Ubuntu 14.4 (little-endian) + Red Hat Enterprise Linux 7.0

Storage

• Mellanox OFED 2.4-1

• GPFS 4.1

Scientific

• BioBuilds 2014.11

Challenge -

Alignment of billions of contacts

High Resolution Map 13 billion reads forming 5 billion contacts in the map

IBM Power8 Cluster 675 read alignments / second / CPU core

192 cores

About 27 hours

…CTGCCTCCTCGCGG…

Chromosome

Hi-C

GENERATES GENOME-

WIDE CONTACT MAPS

Genome

Genome

Hi-C

GENERATES GENOME-

WIDE CONTACT MAPS

Genome Chromosome 8

Hi-C

GENERATES GENOME-

WIDE CONTACT MAPS

0 700

Reads/250 kb2

A

A

Hi-C

GENERATES GENOME-

WIDE CONTACT MAPS

0 700

Reads/250 kb2

A B

A

B

Hi-C

GENERATES GENOME-

WIDE CONTACT MAPS

0 700

Reads/250 kb2

PART II:

BIOLOGY

Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome Erez Lieberman-Aiden, Nynke van Berkum et al. Science 2009 Science, 2009

Genomic analysis of compartments

Genes

Chromosome 14 Mb2 Pixels 1

The two compartments correlate strongly with open and closed chromatin

kb2 Pixels 100

The whole genome is plaid

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 X

A TOUR OF THE NUCLEUS

Organization

observed at three distinct scales

NUCLEAR SCALE

100Mb

CHROMOSOME SCALE MEGABASE SCALE

10Mb 1Mb

Organization

observed at three distinct scales

NUCLEAR SCALE

100Mb

CHROMOSOME SCALE MEGABASE SCALE

10Mb 1Mb

Organization

observed at three distinct scales

NUCLEAR SCALE

100Mb

CHROMOSOME SCALE MEGABASE SCALE

10Mb 1Mb

A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping Suhas Rao*, Miriam Huntley*, Neva Durand, Elena Stamenova, Ivan Bochkov, James Robinson, Adrian Sanborn, Ido Machol, Arina Omer, Eric Lander, Erez Lieberman Aiden Cell 2014

5 b

illio

n c

on

tact

s

30

mill

ion

co

nta

cts

More Contacts, Higher Resolution

Detection of Chromatin Loops Genome-

wide via Hi-C

A

A-2ε A-ε

A+ε

A+2ε

B-ε

B-2ε

B

B+ε B+2ε

Into the loops

L3 L2 L1

L1 L2 L3

Computational Challenge III

Loop calling

Which one shows a loop?

X

X

3D Map Features

X

Computational Challenge III

Loop calling

• Apply 4 filters for each pixel.

• 20 Giga pixel image.

• Millions of parallel filters.

NVIDIA Tesla GPU 200x faster than previous CPU implementation – from 3 weeks to 3 hours.

10,000 Loops in the Human Genome

Loops turn genes on and off

Lung fibroblast cell Lymphoblastoid cell

SUMMARY OF

COMPUTATIONAL

EFFORTS

Sequence alignment

proportions

Genome data production and analysis

• In about 36 months we produced sequence equivalent of more than 2200x coverage of the human genome.

• For reference, the Human Genome Project produced 12.6x coverage, over the span of 4 years.

Storage

• We currently have 25 TB of RAW sequenced data

• We sequence 1 TB each month.

• After processing the raw sequenced data, we store 3 TB of Raw and processed data.

Computational speed up

Cluster processing

• We produce 1 Billion reads per month.

• Power8 is capable of processing alignments at 675 reads/second per CPU core.

• 50% faster then the cluster system we were using before.

• At this speed, we consume about 17 “CPU days” per month.

• With power8 cluster having over 192 cores, the jobs complete processing in about 2 hours.

GPU processing

• Using NVIDIA Tesla K40, we run our loop calling algorithm over a 20Giga pixel map 200x faster than CPU implementation.

• Instead of 3 weeks we get the work done in only 3 hours.

aidenlab.org/juicebox

Aiden Lab

Erez Lieberman Aiden

Suhas Rao

Miriam Huntley

Neva C Durand

Elena Stamenova

Adrian Sanborn

Arina Omer

Ivan Bochkov

Olga Dudchenko

Robert Nnake

Su-Chen Huang

Muhammad Shamim

Chris Lui

Sarah Nyquist

Sanjit Batra

Ashok Cutkosky

Najeeb Tarazi

Jian Li

Broad Institute

Eric Lander

Jim Robinson

GREETINGS FROM

ANOTHER DIMENSION