ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads...

27
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215

Transcript of ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads...

ChIP-seq QC

Xiaole Shirley Liu

STAT115, STAT215

Initial QC

• FASTQC• Mappability• Uniquely mapped reads• Uniquely mapped locations• Uniquely mapped locations / Uniquely mapped

reads• Good to keep one read / location in peak calling

2

Peak Calls

• Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size)

• ChIP-Seq show local biases in the genome– Chromatin and sequencing bias– 200-300bp control windows have to few tags– But can look

further

Dynamic λlocal =

max(λBG, [λctrl, λ1k,] λ5k, λ10k)

ChIP

Control

300bp1kb5kb10kb

http://liulab.dfci.harvard.edu/MACS/Zhang et al, Genome Bio, 2008

Peak Call Statistics

• P-value and FDR • Simulation: random sampling of reads? • FDR = A / B, BH correction or Qvalue• P-value / FDR changes with sequencing depth• Fold change does not

4

<1% enriched

MAT: Quality Control

Background

Enriched DNA

A B

ChIP-seq QC

• Number of peaks with good FDR and fold change• FRiP score:

– Fraction of reads in peaks

– Often higher for histone modifications than transcription factors

– Often increase slightly with increasing read depth

• Overlap with union of peaks in public DNase-seq data– Working ChIP-seq peaks overlap > 70% of union DHS

5

DNase-seq

• Captures all regulatory sequences in the prostate genome

66

Sabo et al, Nat Methods 2006; Thurman et al, Nat 2012

ChIP-seq QC

• Evolutionary conservation– Can be used for ChIP QC

• Conserved sites more functional?– Majority of functional sites

not conserved

7

Odom et al, Nat Genet 2007

Enrichment Distribution

• CEAS (Shin et al, Bioinfo, 2009)– Meta-gene profiles: TF and histone marks

– % of peaks at promoter, exons, introns, and distal intergenic sequences

– SitePro of signal at specific sites

• Replicate agreement: > 60% or > 0.6

8

ChIP-seq Downstream Analysis

9

Target Gene Assignment

10

Protein

Gene

RegulateTranscribe

Yeast TF Regulatory

Network

Human TF Binding Distribution

• Most TF binding sites are outside promoters• How to assign targets?• Nearest distance?• Binding within 10KB?• Number of binding?• Other knowledge?

11

Higher Order Chromatin Interactions

Chromatin confirmation capture

Hi-C

Interactions follows exponential decay with distance

Lieberman-Aiden et al, Science 2009

How to Assign Targets for Enhancer Binding Transcription Factors?

• Regulatory potential: sum of binding sites weighted by distance to TSS with exponential decay

• Decay modeled from Hi-C experiments

14

TSS

Direct Target Identification

• Binary decision?• Rank product of

regulatory potential and differential expression

• BETA

15

Is My Factor an Activator, Repressor, or Both?

• Most labs have differential expression profiling of transcription factor together with TF ChIP-seq

• Do genes with higher regulatory potential show more up- or down-expression than all the genes in the genome?

16

ChIP-chip/seq Motif Finding

• ChIP-chip gives 10-5000 binding regions ~200-1000bp long. Precise binding motif?– Raw data is like perfect clustering, plus enrichment

values

• MDscan– High ChIP ranking => true targets, contain more sites

– Search TF motif from highest ranking targets first (high signal / background ratio)

– Refine candidate motifs with all targets

17

Similarity Defined by m-match

For a given w-mer and any other random w-mer

TGTAACGT 8-mer

TGTAACGT matched 8

AGTAACGT matched 7

TGCAACAT matched 6

TGACACGG matched 5

AATAACAG matched 4

m-matches for TGTAACGT

Pick a reasonable m to call two w-mers similar

18

MDscan Seeds

ATTGCAAATTTTGCGAATTTTGCAAAT

Seedmotif pattern

ATTGCAAAT

A 9-mer

TTTGCAAAT

TTTGCGAAT

Hig

her

enri

chm

ent

ChIP-chip selected upstream sequences

TTGCAAATC

CAAATCCAACAAATCCAAGAAATCCAC

GCAAATCCAGCAAATTCGGCAAATCCAGGAAATCCAGGAAATCCT

TGCAAATCCTGCAAATTC

GCCACCGTACCACCGTACCACGGTGCCACGGC…

TTGCAAATCTTGCGAATATTGCAAATTTTGCCCATC

19

Seed1 m-matches

Update Motifs With Remaining Seqs

ExtremeHighRank

All ChIP-selected targets20

Seed1 m-matches

Refine the Motifs

ExtremeHighRank

All ChIP-selected targets21

Further Refine Motifs

• Could also be used to examine known motif enrichment

• Is motif enrichment correlated with ChIP-seq enrichment?

• Is motif more enriched in peak summits than peak flanks?

• Motif analysis could identify transcription factor partners of ChIP-seq factors

22

Estrogen Receptor

• Carroll et al, Cell 2005• Overactive in > 70% of breast cancers• Where does it go in the genome?• ChIP-chip on chr21/22, motif and expression

analysis found its “pioneering factor” FoxA1

TF??ER

Estrogen Receptor (ER) Cistrome in Breast Cancer

• Carroll et al, Nat Genet 2006

• ER may function far away (100-200KB) from genes

• Only 20% of ER sites have PhastCons > 0.2

• ER has different effect based on different collaborators

AP1

ER

NRIP

Estrogen Receptor (ER) Cistrome in Breast Cancer

• Carroll et al, Nat Genet 2006

• ER may function far away (100-200KB) from genes

• Only 20% of ER sites have PhastCons > 0.2

• ER has different effect based on different collaborators

AP1

ERNRIP

Cell Type-Specific Binding

• Same TF bind to very different locations in different tissues and conditions, why?

• TF concentration?• Collaborating factors, esp pioneering factors• Interesting observations about pioneering factors

26

Summary

• ChIP-seq identifies genome-wide in vivo protein-DNA interaction sites

• ChIP-seq peak calling to shift reads, and calculate correct enrichment and FDR

• Functional analysis of ChIP-seq data:– Strong vs weak binding, conserved vs non-conserved

– Target identification

– Motif analysis

• Cell type-specific binding Epigenetics

27