Music And Speech Analysis Using The 'Bach' Scale Filter-Bank

Ananthakrishnan, G

View/Open

G21044.pdf (1.562Mb)

Date

2009-08-13

Author

Ananthakrishnan, G

Metadata

Show full item record

Abstract

The aim of this thesis is to deﬁne a perceptual scale for the ‘Time-Frequency’ analysis of music signals. The equal tempered ‘Bach ’ scale is a suitable scale, since it covers most of the genres of music and the error is equally distributed for each semi-tone. However, it may be necessary to allow a tolerance of around 50 cents or half the interval of the Bach scale, so that the interval can accommodate other common intonation schemes. The thesis covers the formulation of the Bach scale ﬁlter-bank as a time-varying model. It makes a comparative study with other commonly used perceptual scales. Two applications for the Bach scale ﬁlter-bank are also proposed, namely automated segmentation of speech signals and transcription of singing voice for query-by-humming applications. Even though this ﬁlter-bank is suggested with a motivation from music, it could also be applied to speech. A method for automatically segmenting continuous speech into phonetic units is proposed. The results, obtained from the proposed method, show around 82% accuracy for the English and 85% accuracy for the Hindi databases. This is an improvement of around 2 -3% when the performance is compared with other popular methods in the literature. Interestingly, the Bach scale ﬁlters perform better than the ﬁlters designed for other common perceptual scales, such as Mel and Bark scales. ‘Musical transcription’ refers to the process of converting a musical rendering or performance into a set of symbols or notations. A query in a ‘query-by-humming system’ can be made in several ways, some of which are singing with words, or with arbitrary syllables, or whistling. Two algorithms are suggested to annotate a query. The algorithms are designed to be fairly robust for these various forms of queries. The ﬁrst algorithm is a frequency selection based method. It works on the basis of selecting the most likely frequency components at any given time instant. The second algorithm works on the basis of ﬁnding time-connected contours of high energy in the ‘Time-Frequency’ plane of the input signal. The time domain algorithm works better in terms of instantaneous pitch estimates. It results in an error of around 10 -15%, while the frequency domain method results in an error of around 12 -20%. A song rendered by two diﬀerent people will have quite a few diﬀerent properties. Their absolute pitches, rates of rendering, timbres based on voice quality and inaccuracies, may be diﬀerent. The thesis discusses a method to quantify the distance between two diﬀerent renderings of musical pieces. The distance function has been evaluated by attempting a search for a particular song from a database of a size of 315, made up of songs sung by both male and female singers and whistled queries. Around 90 % of the time, the correct song is found among the top ﬁve best choices picked. Thus, the Bach scale has been proposed as a suitable scale for representing the perception of music. It has been explored in two applications, namely automated segmentation of speech and transcription of singing voices. Using the transcription obtained, a measure of the distance between renderings of musical pieces has also been suggested.

URI

https://etd.iisc.ac.in/handle/2005/592

Collections

Electrical Engineering (EE) [423]