Drum Transcription in the presence of pitched instruments...

_________________________________________________________ISSC 2003, Limerick. July 1-2

Drum Transcription in the presence of pitched instrumentsusing Prior Subspace Analysis

Derry FitzGeraldφ, Bob Lawlor*, and Eugene Coyleφ

φMusic Technology Centre,Dublin Institute of Technology,

Rathmines Road, Dublin,IRELAND

E-mail:φ[email protected] ,[email protected]

* Department of Electronic Engineering,National University of Ireland,

Maynooth,IRELAND

E-mail: *[email protected]

__________________________________________________________________________________________

Abstract -- This paper demonstrates the use of Prior SubspaceAnalysis (PSA) as a method for transcribing drums in thepresence of pitched instruments. PSA uses prior subspaces thatrepresent the sources to be transcribed to overcome some of theproblems associated with other subspace methods such asIndependent Subspace Analysis (ISA) or sub-band ISA. The use ofprior knowledge results in improved robustness for transcriptionpurposes and enables the method to work more readily in thepresence of pitched instruments than other subspace methods.The system presented in this paper attempts to extend the use ofPSA to transcribe drum sounds in the presence of interferingpitched instruments.

__________________________________________________________________________________________

I INTRODUCTION

In the past few years a number of subspace methodssuch as Independent Subspace Analysis (ISA) andPrior Subspace Analysis (PSA) have been proposedfor sound source separation in single channelmixtures [1,2]. The underlying assumptions of thesemethods make them particularly suited to attemptingthe task of transcribing drums from single channelaudio mixtures, as has been shown in [2,3]. Howeverthese methods have dealt solely with the case wheredrums only are present. This system presented in thispaper attempts to extend the PSA method totranscribe drums robustly in the presence of pitchedinstruments.

II SUBSPACE METHODS FOR SOUNDSOURCE SEPARATION

The subspace methods described in [1,2,3] attempt torepresent sound sources as low dimensionalindependent subspaces in the time-frequency plane.

These methods make a number of assumptionsabout the signal. An input signal containing anumber of sound sources is transformed to a time-frequency representation such as a magnitude

spectrogram. It is assumed that the mixture signalspectrogram Y can be decomposed into l statisticallyindependent spectrograms Yj. These spectrograms areassumed to be represented by the outer product of aninvariant frequency basis function fj, and acorresponding invariant amplitude basis function tjwhich describes the variations in amplitude of thefrequency basis function over time. This yields:

∑∑==

==l

j

Tjj

l

jj tfY

11

Y (1)

These independent basis functions representfeatures of the individual sources. Each source ismade up of a number of these basis functions whichform a low dimensional subspace that represents thesound source.Where the subspace methods differ is in howdecomposition of the original mixture spectrogram Yinto outer product basis functions is achieved. ISAachieves the decomposition by performing PrincipalComponent Analysis on the mixture spectrogram.Components of low variance are then discarded toachieve low dimensionality. Independent ComponentAnalysis (ICA) [4] is then performed on theremaining components to obtain independentsubspaces. The above decomposition is performed ina totally blind manner and makes no use ofinformation about sources known to be present in the

mailto:[email protected]

mailto:*[email protected]

mixture. A detailed description of the abovedecomposition can be found in [1].

Though an effective means of separating soundmixtures there are significant limitations to the ISAmethod. Firstly the assumption that the basisfunctions are invariant means no pitch changes areallowed in the overall spectrogram. However this isnot a problem when dealing with sources that can beconsidered stationary in pitch such as drum sounds,making ISA suited to dealing with drum sounds.

Secondly estimating the number of componentsto retain from PCA remains a problem. The numberof components required for separation varies with thefrequency and amplitude characteristics of the sourcesounds, and the threshold method proposed in [1]cannot adequately predict the required number ofcomponents. This results in the necessity of anobserver to decide the number of components toretain. An attempt to overcome this problem bymeans of sub-band preprocessing is described in [3].

Despite these limitations ISA provides a methodof overcoming the problem of identifying mixturesof drums encountered by Sillanpää et al when tryingto identify and transcribe mixtures of drums [5].

On the other hand PSA assumes that there existsknown prior frequency basis functions fp that aregood initial approximations to the actual basisfunctions of the sources of interest. Substitutingthese fp for the fj in equation 1 yields:

∑=

≈l

j

Tjptf

1Y (2)

Multiplying the overall spectrogram Y by thepseudoinverse of the prior frequency subspacesyields estimates of the amplitude basis functions, t̂ :

Yft pp=ˆ (3)where fpp is the pseudoinverse of fp . However theamplitude basis functions returned are notindependent and so ICA is carried out on t̂ to give

tWt ˆ= (4)

where W is the unmixing matrix obtained using ICA.Improved estimates of the frequency basis functionscan then be obtained from

( )TpYtf = (5)

PSA uses prior knowledge to obtain the mostimportant information specifically on the sources ofinterest and so overcomes the problem of estimatingthe amount of information needed for separation thatis associated with ISA. PSA also relaxes theassumption that no pitch changes are allowed in theoverall spectrogram. Instead it assumes that only thesources of interest are stationary in pitch. This makesPSA suitable for attempting to transcribe drums inthe presence of pitched instruments. PSA isdemonstrated in Figure 1, which shows the

amplitude envelopes obtained from analysing a drumloop using PSA.

Figure 1. Drum loop separation using PSA

The prior subspaces used in this example werecreated by analysing large numbers of each type ofdrum. An ISA-type analysis such as described in [6]was carried out on each example. As mentionedpreviously this amounts to carrying out PCAfollowed by ICA on the spectrogram of the example.The first three principal components retained fromthe PCA step were passed to the ICA algorithm andthe resulting independent frequency subspace withthe largest projected variance was taken to representthe example. K-means clustering was then carriedout on the frequency subspaces for a given drum typeto yield a single subspace that best characterised agiven drum type.

PSA was initially tested on 15 drum loopscontaining snares, kick drums and hi-hats. Itachieved an overall success rate of 92.5% insuccessfully identifying the drums present. Thisrepresents an improvement over the 89.5% successrate achieved using sub-band ISA on the samesignals [2]. PSA was found to be better than sub-band ISA in correctly identifying hi-hats and wasalso significantly faster than ISA or sub-band ISAdue to the fact that PSA does not require the use ofPCA. In tests on the same signals PSA was found tobe approximately ten times faster than sub-band ISAand five times faster than ISA.

III PSA IN THE PRESENCE OF PITCHEDINSTRUMENTS

It was previously noted that as the basis functionsobtained by ISA are invariant no pitch changes areallowed within the sources present. It was also notedthat PSA provides a relaxation of this assumption inthat this restriction now only applies to the sourcesbeing searched for. As already noted drum soundsmeet this criterion, making PSA a valuable tool fordrum transcription. As it is no longer required that allthe sources present be stationary in pitch, only thesources being searched for, it is possible to extendPSA to work in the presence of pitched instruments.

However a number of issues must be addressedbefore PSA can be used to transcribe drums in thepresence of pitched instruments.

The first of these is to note that the presence of alarge number of pitched instruments will cause apartial match with the prior subspace used to identifya given drum. This causes interference in therecovered amplitude envelope, which can in turnmake detection of the drums more difficult. Howeverit should be noted that pitched instruments haveharmonic spectra with resulting regions of lowintensity between partials. Furthermore due to therules of harmony used in popular music many of thepitches played simultaneously will be in harmonicrelation to each other and so will have manyoverlapping partials.

As a result every time pitched instruments occurthere will be regions in the frequency spectrumwhere little or no energy is present due to pitchedinstruments. It can therefore be seen that using ahigher frequency resolution reduces the interferencedue to the pitched instruments, and as a resultimproves the likelihood of recognition of the drums.

This is demonstrated in Figure 2, which showsthe snare amplitude envelopes obtained fromspectrograms of an excerpt from a pop song. Thespectrograms had FFT sizes of 512 and 4096respectively. The interference due to otherinstruments can be seen to be greatly reduced at thehigher frequency resolution, and as a result the snaredrum is more easily identified at the higherfrequency resolution. However the use of higherfrequency resolution comes at the price of areduction in the time resolution, which leads toinaccuracies in the detected onset times of the drumevents.

Figure 2. Snare envelopes at different frequencyresolutions

Despite the use of high frequency resolution theinterference present in the hi-hat subspace was insome cases found to be considerably greater than thatin the bass drum or snare subspaces. This causedproblems in trying to identify hi-hat events. Theextra interference appears to be as a result of the factthat the hi-hat prior subspace has its energy spread

out over a greater range of the spectrum than thesnare and kick drum, making it more sensitive to thepresence of pitched instruments.

However by noting that most of the energy ofpop songs is contained in the lower region of thespectrum, it is possible to overcome this problem.The power spectral density (PSD) of a signal givesan estimate of the average power at each point in thespectrum [6]. Dividing a spectrogram by the PSDwill emphasise those regions of the spectrum wherethere is less power, in this case the upper regions ofthe spectrum. This results in improvedrecognisability of the hi-hats. This is demonstrated inFigure 3 which shows the hi-hat amplitude envelopesobtained from an excerpt from a pop song both withand without PSD normalisation. The PSD wasobtained using an eigenvector method using a smallnumber of eigenvectors to capture only the broadregions where most of the energy occurs.

During testing of the modified PSA algorithm itwas discovered that while successful in many cases,in some cases the algorithm did not performcorrectly. Further analysis revealed that this was as aresult of the sensitivity of the ICA algorithm to theinterference or noise due to the presence of pitchedinstruments remaining in the snare and kick drumamplitude envelopes.

To overcome this problem all values in theamplitude envelope below a set threshold are set tozero. A normalised amplitude of 0.4 was found to bea suitable threshold for both the snare and kick drum.This operation is not carried out on the hi-hats as theinterference was found to have been sufficientlyeliminated by the PSD normalisation step.

Figure 3. Hi-hat amplitude envelopes with/withoutPSD step

However the thresholding operation was foundto have another consequence. The resulting snare andkick drum envelopes contained large areas of noactivity, with sudden and sharp peaks occurring whena snare or kick occurred. This contrasts with themore natural peaks and decays occurring in the hi-hat envelope. When these very different amplitudeenvelopes were input to an ICA algorithm the

resulting independent signals contained unusualartifacts such as numerous sudden large amplitudemodulations which were detected as events wherenone was present. To eliminate this problem it wasnecessary to carry out ICA on only the snare andbass drum amplitude envelopes, as they arecomparable in that they both contain sharp peaks andlarge areas of no activity. This resulted in the correctseparation of bass drums and snare drums in mostcases. The hi-hat envelope is passed directly to theonset detection algorithm. While this gives goodresults in general it can result in extra errors indetection of hi-hats. As the hi-hat amplitudeenvelope no longer undergoes ICA the algorithmloses the ability to distinguish between a snareoccurring on its own and a snare and hi-hat occurringsimultaneously. However in many cases a hi-hatdoes occur simultaneously with the snare, so thisonly results in a small reduction in the efficiency ofthe transcription algorithm.

IV DRUM TRANSCRIPTION IN THEPRESENCE OF PITCHED INSTRUMENTS

To test the ability of PSA to transcribe drums in thepresence of pitched instruments a drum transcriptionsystem was implemented in Matlab. The systemimplemented deals only with snares, bass drums andhi-hats. Due to the source signal ordering problem inthe ICA step it is assumed that the bass drum has alower spectral centroid than the snare. The systemwas tested on 20 excerpts taken at random from popsongs from as wide a range of styles as possibleranging from pop to disco and rock. The drumpatterns from these excerpts were transcribed by anexpert listener.

Because of the imperfect separation of the ICAstep the amplitude envelopes were normalised andonsets over a given threshold were taken to be adrum onset. The same threshold was used for bothsnare and kick drums while a lower threshold wasused for the hi-hats. This reflects the fact that theamplitude of the hi-hats in real world examples canvary widely depending on the style of drumming.The results obtained are outlined in Table 1. Thoughthe results demonstrate the effectiveness of PSA as amethod for transcribing drums in the presence ofpitched instruments a greater number of errors occurthan for PSA with drums only. Possible reasons forthis are discussed below.

Type Total Missing Incorrect %Snare 57 1 9 82.5Kick 84 4 7 86.9

Hi-hats 238 14 30 81.5Overall 379 19 46 82.8

Table 1. Drum Transcription Results

In the case of the bass drums, six snare eventswere incorrectly identified as bass drums. These

errors occurred in excerpts where a “disco” style ofdrumming was employed. In these excerpts the snaredrum is typically less bright than in the other genresof music, and so a greater chance of incorrectidentification is the result. Only one of the incorrectbass drum detections was as a result of a bass guitarnote being identified as a bass drum. The missingfour undetected bass drum events were visible on theamplitude envelope of the excerpts in question, butwere below the threshold for detection. The bassdrums at these points were audibly lower than theother bass drum events in the excerpts.

In the case of the snare drum, five of theincorrect snares were as a result of the combinationof a bass drum and a hi-hat occurring simultaneouslybeing mistaken for snares. This happened in twoexcerpts. The remaining errors occurred as a result ofnoise due to pitched instruments.

With regards to the hi-hats the majority ofincorrect identifications were as a result ofinterference that had not been eliminated in the PSDnormalisation step. In two cases an event with thecharacteristics of a hi-hat was clearly visible in boththe spectrogram and the recovered amplitudeenvelope, but no event of this type was audible to thelistener. These events may be genuine hi-hat eventsthat have been masked by other audio events, but asthere is no way of determining this for excerpts fromcommercial recordings, these onsets have beenclassed as incorrect detections. In the case of theundetected hi-hats the majority of the hats wereclearly visible in the amplitude envelopes, but belowthe threshold required for identification. Furtherimprovements may be possible by adjusting thethresholds for detection but there is a trade-offbetween reducing the number of incorrectidentifications and increasing the number of missedevents.

Due to the limitations in the time resolution ofthe STFT, the detection of onset times had anaverage error of 10ms. It should be noted that thiserror tended to be consistent across all the drums in agiven loop, so that inter-onset intervals remainedconsistent within a given loop. However it is stilldesirable to improve the accuracy of onset detectionin PSA.

It should be noted that these results wereobtained without the use of any form of rhythmicmodelling to predict when a given drum was mostlikely to occur.

V CONCLUSIONS & FUTURE WORK

Prior Subspace Analysis has been shown to be aviable approach for the transcription of drums in thepresence of pitched instruments, overcoming some ofthe problems associated with Independent SubspaceAnalysis. Further work needs to be done to improvethe correct identification of the drums and to increasethe accuracy of the onset times. It is also proposed to

generalise the method to deal with an increasednumber of drum types.

REFERENCES

[1] Casey, M. & Westner, A., “Separation of MixedAudio Sources By Independent SubspaceAnalysis” Proceedings Of ICMC 2000, pp. 154-161, Berlin, Germany, 2000.

[2] FitzGerald, D., Lawlor, B., Coyle, E., “PriorSubspace Analysis for Drum Transcription”,114th AES Conference Amsterdam March 22nd–25th 2003

[3] FitzGerald, D., Coyle E, Lawlor B., “Sub-bandIndependent Subspace Analysis for DrumTranscription”, Proceedings of the. DigitalAudio Effects Conference (DAFX02), Hamburg,pp. 65-69, 2002.

[4] Hyvärinen A. & Oja E., “IndependentComponent Analysis: Algorithms andApplications”. Neural Networks, 13(4-5): pp.411-430, 2000.

[5] Sillanpää J., Klapuri A., Seppänen J., VirtanenT., “Recognition of acoustic noise mixtures bycombining bottom-up and top-downprocessing”. In proc. European SignalProcessing Conference, EUSIPCO 2000.

[6] Casey, M., “Auditory Group Theory: withApplications to Statistical Basis Methods forStructured Audio”, Ph.D. Thesis, MIT MediaLab, February 1998.

[7] Vaseghi, Saeed V., Advanced Digital SignalProcessing and Noise Reduction, 2nd ed. JohnWiley & Sons Ltd. pp. 270-290.

Drum Transcription in the presence of pitched instruments...

Documents

Transcript of Drum Transcription in the presence of pitched instruments...