Feature transformation for speaker identification
Abstract
The gradual automation of services and transactions that previously required human handling has led to a growing interest in biometrics, which are unique measurable traits of a human being used for automatic identity verification. Among the many physiological characteristics that qualify as biometrics, voice has emerged as a natural choice in many applications, especially those involving telephone-based transactions.
Most text-independent speaker recognition systems attempt to model speaker individuality by employing features that carry information about the vocal tract configuration. These features, such as Mel-cepstrum and LPC-cepstrum, also contain significant phonetic information. In fact, these are the same features used by most speech recognition systems.
In this thesis, we present a linear feature transformation technique that maps the original feature vectors onto a more speaker-specific subspace. From the perspective of speaker identification, speaker information in the feature vectors is considered the signal, while phonemic information is treated as noise. The directions in the feature space along which the signal-to-noise ratio is maximized are obtained using generalized eigenvalue decomposition of the signal and noise covariance matrices.
We study three different transformations:
Fisher’s Discriminant Analysis, which maximizes the ratio between inter-class scatter and intra-class scatter.
Divergence Maximization, which maximizes the information-theoretic distance between each pair of speakers.
Cepstral Difference, which captures formant variations between different phonemes and speakers.
Speaker identification experiments were conducted on two subsets of the TIMIT database. In addition to Mel-cepstrum and LPC-cepstrum, we used two variants of the Discrete Cosine Transformed Cepstrum, a new feature that alleviates the problem of phase unwrapping encountered in Fourier transform-based realizations of the complex cepstrum. Each speaker was modeled using a 30-length vector quantization codebook.
Results show that speaker identification rates with transformed features are significantly higher than with original features. On a 25-speaker database, the average identification rate with original features was 62.8%, while the rate after transformation was 77.1%.
We also evaluated the generalization capabilities of the transformations by applying those obtained from the 25-speaker set to a different set of 50 speakers. The improvement in identification rate after transformation was 16.7%. Among the three methods, the cepstral difference transformation, which contains perceptually significant formant information, outperformed the other two.
The feature transformation can be interpreted as taking the inner product of the feature vectors with the eigenvectors forming the columns of the transformation matrix. Alternatively, this inner product can be viewed as circular convolution followed by decimation, leading to the interpretation of the transformation matrix as a filter bank. Using the relation between LPC-cepstrum and log spectrum, we interpret the eigenvectors as filters that shape the log spectrum of the speech signal. This analysis shows that the low and high frequency regions of the speech spectrum convey more speaker information than the middle frequency regions.
To further investigate the importance of different frequency bands, we study the subspace pattern recognition method, where each class is viewed as a subspace in the feature space and represented by a set of orthonormal basis vectors obtained via Principal Component Analysis (PCA). The interpretation of these principal components as filters confirms the significance of the low and high frequency regions of the speech spectrum for speaker identification.