Algorithms for Multilingual IR in Low Resource Languages using Weakly Aligned Corpora
Abstract
Multilingual information retrieval (MLIR) methods generally rely on linguistic resources such
as dictionaries, parallel corpora, etc., to overcome the language barrier. For low resource
languages without these resources, an alternative approach is to use topical cross-lingual topical
correspondence models learned from document-aligned multilingual corpora. However, there is
a large and growing corpus of \weakly aligned" text in the form of user-generated content (UGC)
data from social networks, commenting widgets, etc. that has been ignored by researchers so
far. This task is made more challenging due to romanization, sparsity, and noise. Topic models
learned from such text are not readily usable for all applications. In particular, the size of the
textual units of interest has a strong bearing on the methods used for applying these models.
In this thesis, we develop a series of hierarchical Bayesian models for capturing topical
correspondence in multilingual news-comment corpora. The nal model, called Multi-glyphic
Correspondence Topic Model (MCTM), captures different kinds of topical relationships, and
effectively handles problems of data sparsity, noise and comment romanization.
From an application perspective, we consider three MLIR problems corresponding to differ-
ent levels with respect to the size of the textual units involved. The rst MLIR problem (at
the level of single words) involves inducing translation correspondences in arbitrary language
pairs using Wikipedia data. We present an approach for translation induction that leverages
Wikipedia corpora in auxiliary languages using the notion of translingual themes . We propose
extensions that enable the computation of cross lingual semantic relatedness between words and
apply the method to the task of cross-lingual Wikipedia title suggestion. The second MLIR
problem (at the level of sentences or paragraphs) involves the detection of topical relevance be-
tween speci c portions of articles and comments. Using MCTM, we devise methods for topical
categorization of comments with respect to the article. In the third MLIR problem (at the level
of document collections), the task is to generate keyword summaries for a cluster of documents.
We develop an architecture for performing Scatter/Gather on multilingual document collections
and a labeling procedure that generates keyword labels on-the-fly to summarize a multilingual
document cluster in any language.