Signal processing. Example data – ChIP-Seq T.N. Siegel, D.R. Hekstra, L.E. Kemp, L.M. Figueiredo,...
-
Upload
laura-grant -
Category
Documents
-
view
218 -
download
1
Transcript of Signal processing. Example data – ChIP-Seq T.N. Siegel, D.R. Hekstra, L.E. Kemp, L.M. Figueiredo,...
Signal processing
Example data – ChIP-Seq
T.N. Siegel, D.R. Hekstra, L.E. Kemp, L.M. Figueiredo, J.E. Lowell, D. Fenyö, X. Wang, S. Dewell, G.A. Cross, "Four histone variants mark the boundaries of polycistronic transcription units in Trypanosoma brucei" , Genes Dev. 23 (2009) 1063-1076.
α-factor release
Example Data: Time-Resolved ChIP-chip
Chromosome 16
M.D. Sekedat, D. Fenyö, R.S. Rogers, A.J. Tackett, J.D. Aitchison, B.T. Chait, "GINS motion reveals replication fork progression is remarkably uniform throughout the yeast genome", Mol Syst Biol. 6 (2010) 353.
Example data – MALDI-TOF
m/z1000 4500
Inte
nsity
1800
0
D:\Users\Fenyo\Desktop\ATP.txt (15:42 02/03/11)Description: none available m/z2280 2400
Inte
nsi
ty
700
0
D:\Users\Fenyo\Desktop\ATP.txt (15:46 02/03/11)Description: none available
m/z1300 1460In
ten
sity
45
0
D:\Users\Fenyo\Desktop\ATP.txt (15:50 02/03/11)Description: none available
m/z1444.0 1458.0
Inte
nsi
ty
35
0
D:\Users\Fenyo\Desktop\ATP.txt (15:54 02/03/11)Description: none available
m/z2378.0 2394.0
Inte
nsi
ty
700
0
D:\Users\Fenyo\Desktop\ATP.txt (16:07 02/03/11)Description: none available
Peptide intensity vs m/z
Fragment intensity vs m/z
Example data – ESI-LC-MS/MS
Time
m/z
m/z
% R
ela
tive
Ab
un
da
nce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
9071020663 778 1080
1022
MS/MS
Peptide intensity vs m/z vs time
Example Data: Super-Resolution Microscopy
Dylan Reid and Eli Rothenberg
Sinus
amplitude
Wave length
b
ac
a
ca /)sin(
Sinus and Cosinus
b
ac
a
ca /)sin( cb /)cos(
Two Frequencies
Fourier Transform
dxxff eix 2^
)()(
)2sin()2cos(2
iiiei
Fourier Transform
from numpy import *x=2.0*pi*arange(1000.0)/100000.0sin1 = sin(1000.0*x)sin2 = 0.2*sin(10000.0*x)sin12=sin1+sin2
fft12=fft.rfft(sin12)
Frequency
Inverse Fourier Transform
dfxf exi2^
)()(
Frequency
Inverse Fourier Transform
from numpy import *x=2.0*pi*arange(1000.0)/100000.0sin1 = sin(1000.0*x)sin2 = 0.2*sin(10000.0*x)sin12=sin1+sin2fft12=fft.rfft(sin12)
sin12_=fft.irfft(fft12,len(sin12))
Frequency
Inverse Fourier Transform
Frequency
A Peak
centroid
full width at half
maximum (FWHM)
area
height
maximum
meanvarianceskewnesskurtosis
Inte
nsit
y
Mean and variance
)(xxf
)()(22xfx
Mean
Variance
)(xfA peak is defined by and 1)( xf
Skewness and kurtosis
3/)(44
)( xfx
Skewness
Kurtosis
33/)()( xfx
A Gaussian Peak
def gaussian(x,x0,s):return exp(-(x-x0)**2/(2*s**2))
x = linspace(-1,1,1000)y=gaussian(x,0,0.1)ffty=fft.rfft(y)
Frequency
A Gaussian Peak
Skewness = 0
Kurtosis = 0
2log22FWHM
2heightarea
Frequency
Peak with a longer tail
2FWHM
heightarea
)( 01
1)(
2
xxxf
Frequency
A skewed peak
def pdf(x): return 1/sqrt(2*pi) * exp(-x**2/2)
def cdf(x): return (1 + erf(x/sqrt(2))) / 2
def skew(x,e=0,w=1,a=0): t = (x-e) / w return 2 / w * pdf(t) * cdf(a*t)
Frequency
Normal noise
x = linspace(-1,1,1000)y=0.2*random.normal(size=len(x))
If the noise is not normally distributed, try to find a transform that makes it normal
Frequency
Lognormal noise
x = linspace(-1,1,1000)y=0.2*random.lognormal(size=len(x))
Frequency
Skewed noise
x=random.uniform(-1.0,1.0,size=10*len(x))y=random.uniform(0.0,1.0,size=10*len(x))yskew=skew(x,-0.1,0.2,10)/max(yskew)yn_skew=x_test[y<yskew][:len(x)]
Frequency
Gaussian peak with normal noise
Frequency
Frequency
Frequency
Removing High Frequences
Frequency
Convolution
http://en.wikipedia.org/wiki/Convolution
)()())(*( tgftgf
Describes the response of a linear and time-invariant system to an input signal
The inverse Fourier transform of the pointwise product in frequency space
Smoothing by convolution
Smoothing
w=ones(2*width+1,'d')convolve(w/w.sum(),y,'valid‘)
Frequency Frequency Frequency
Inte
nsit
y
Smoothing
Smoothing
Adaptive Background Correction (unsharp masking)
wlk
wlk
kIw
dwdlI )(
12),,('
Unsharp masking
Original
wi = linspace(1,window_len,window_len)w = 1 / ( 2*r_[wi[::-1],0,wi] + 1 )x_ = x - d*convolve(w/w.sum(),x,'valid')
Adaptive Background Correction
Smoothing and Adaptive Background Correction
Savitsky-Golay smoothingPolynomial order = 3
Bin size = 25
Bin size = 75
Bin size = 150
Polynomial order = 5 Polynomial order = 7
Background
Frequency
Frequency
Background Subtraction Using Smoothing
Bin size = 100 Bin size = 200 Bin size = 300
Smooting Smooting Smooting
Background subtractionBackground subtractionBackground subtraction
Root Mean Square Deviation (RMSD)
22
2
//||
))((w
wlkIkI
The Root Mean Square Deviation (RMSD) is often constant for the noise and larger for the peak if the window size is approximately the size of the peak.
Background Subtraction using RMSDBin size = 100 Bin size = 200 Bin size = 300
RM
SD
RM
SD
RM
SD
Inte
nsit
y
Inte
nsit
y
Inte
nsit
y
Convolution, Cross-correlation, and Autocorrelation
http://en.wikipedia.org/wiki/Convolution
Convolution describes the response of a linear andtime-invariant system to an input signal.
The inverse Fourier transform of the pointwise product in frequency space.
Cross-correlation is a measure of similarity of two signals.
It can be used for finding a shift between two signals.
Auto-correlation is the cross-correlation of a signal with itself.
It can be used for finding periodic signals obscured by noise.
Cross-correlation and autocorrelation
)()())(( tgftgf
http://en.wikipedia.org/wiki/Convolution
)()())(*( tfftff
Autocorrelation
Autocorrelation
Signal
Same signal
Cross-correlation
Cross-correlation
Signal
Shifted signal
Cross-correlation
Cross-correlation
Signal
Half of the peaks shifted
How similar are two signals?
Dot product),...,,(
21 aaa nA
),...,,(21 bbb n
B
cos
BA
BA iiiba
Identical vectors: 1,0 BAPerpendicular vectors: 0,
2 BA
)()()0)(( gfgf
The dot product is the came as the cross-correation at zero:
What are the characteristics of the dot product?
10 3 1 0.3 0.1 S/N 10
100
1000
Dimensions
Signal+Noise
Noise
Autocorrelation
Autocorrelation
Signal
Shifted signal
Sum of signal and shifted
signal
Coincidence – enhances the signal
The signal to noise can be dramatically increased by measuring several independent signals of the same phenomenon and combining these signals.
Ideal signal
Product of the four measurements
Four measurements
Coincidence – supresses and transforms the noise
Noise in productOriginal noise
Coincidence – supresses interference
Ideal signal
Product of the four measurements
Four measurements with interference
Peak Finding
The derivative of a function is zero at its minima and maxima.
The second derivative is negative at maxima and positive at minima.
Detection of steps
Motivation: To demonstrate a general strategy for separating signal from noise:
1. Characterize the signal and the noise2. Make a model of the data3. Select detection method4. Select parameters using simulations
Inte
nsit
y
Detection of steps: Characterization of noise
Remove signal by subtracting a moving average
Detection of steps: Model of data
points=1000x = linspace(-1,1,points)y=noise*random.normal(size=len(x))y[points/2:]+=signal
S/N=0.75 S/N=1 S/N=2
Detection of steps: Detection method
Steps can be converted into peaks by calculating the difference between the moving average in two windows
S/N=0.75 S/N=1 S/N=2
Detection of steps: Detection method
S/N=0.75 S/N=1 S/N=2
Bin size = 10
Bin size = 30
Bin size = 100
Avera
ge
Inte
nsit
yA
vera
ge
Inte
nsit
yA
vera
ge
Inte
nsit
y
Avera
ge
Inte
nsit
yA
vera
ge
Inte
nsit
yA
vera
ge
Inte
nsit
y
Avera
ge
Inte
nsit
yA
vera
ge
Inte
nsit
yA
vera
ge
Inte
nsit
y
Detection of steps: Simulations - peak location
S/N=0.05 S/N=0.25 S/N=1
Bin size = 10
Bin size = 30
Bin size = 100
Detection of steps: Simulations – correct peak
S/N=0.05 S/N=0.25 S/N=1
Bin size = 10
Bin size = 30
Bin size = 100
Fre
qu
en
cy
Fre
qu
en
cy
Fre
qu
en
cy
Fre
qu
en
cy
Fre
qu
en
cy
Fre
qu
en
cy
Fre
qu
en
cy
Fre
qu
en
cy
Fre
qu
en
cy
Score
Score
Score
Score
Score
Score
Score
Score
Score
Detection of steps: Simulations - FDR and FNR
S/N=0.05 S/N=0.25 S/N=1
Bin size = 10
Bin size = 30
Bin size = 100
Fals
e R
ate
Fals
e R
ate
Fals
e R
ate
Fals
e R
ate
Fals
e R
ate
Fals
e R
ate
Fals
e R
ate
Fals
e R
ate
Fals
e R
ate
Threshold
Threshold
Threshold
Threshold
Threshold
Threshold
Threshold
Threshold
Threshold
False Discovery
Rate
False Negative
Rate
Peak Finding
1. Characterize the signal and the noise2. Make a model of the data3. Select detection method4. Select parameters using simulations
Inte
nsit
y
Peak Finding: Characterizing the noise
Inte
nsit
y
Let’s first try without removing the peaks
Peak Finding: Characterizing the noise
Inte
nsit
y
Removing the peaks by looking for outliers in the root mean square deviation (RMSD)
RMSD
Peak Finding: Characterizing the peaks
Inte
nsit
y
Peak Finding: Model of data
points=1000x = linspace(-1,1,points)y=noise*random.normal(size=len(x))y+=signal*gaussian(x,0,0.01)
S/N=1 S/N=2 S/N=4
Peak Finding: Detection method
S/N=1 S/N=2 S/N=4
Peaks can be detected by finding maxima in the moving average with a window size similar to the peak width
wlk
wlk
kIlS )()(
Peak Finding: Detection method – moving average
S/N=1
S/N=2
S/N=4
Bin size = 5 Bin size = 20 Bin size = 80 Signal
Peak Finding: Detection method – RMSD
S/N=1
S/N=2
S/N=4
Bin size = 5 Bin size = 20 Bin size = 80 Signal
Peak Finding: Information about the Peak
centroid(mean)
full width at half
maximum (FWHM)
area
height
maximum
meanvarianceskewnesskurtosis
Inte
nsit
y
Information about a Peak
)(
)(
xf
xxf
)(xfarea
Centroid or mean
)(xfA peak is defined by
))(max( xfheight
To calculate any of these measures we needto know where the peak starts and ends.
Where does a peak start and end?
Estimating peptide quantity
Peak heightCurve fittingPeak area
Peak heightCurve fitting
m/z
Inte
ns
ity
Time dimension
m/z
Inte
ns
ity
Tim
e
m/z
Tim
e
Sampling
Retention Time
Inte
nsi
ty
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 3 4 5 6 7 8 9 10
Th
res
ho
lds
(90
%)
# of points
Sampling
What is the best way to estimate quantity?
Peak height - resistant to interference- poor statistics
Peak area - better statistics - more sensitive to
interference
Curve fitting - better statistics- needs to know the peak
shape- slow
Homework: Background Subtraction Using Smoothing
Summary
Fourier transform - transformation to frequency space and back
Signal – how do we detect and characterize signals?
Noise – how do we characterize noise?
Modeling signal and noise
Simulation to select thresholds and select parameters
Filters – fitering by low-pass (i.e. smoothing) and high-pass filters
(e.g. adaptive background correction)
Detection methods based on moving average and RMSD
Convolution - describes the response of a linear and
time-invariant system to an input signal
Cross-correlation is a measure of similarity of two signals
Autocorrelation can be used for finding periodic signals obscured by
noise
The dot product can be used to determine how similar two signals
are
Coincidence measurements enhance the signal and supresses noise
The quantity associated with a peak – height and area
Sampling – how often do we need to sample a peak to get a good
estimate of its area?