Algorithms for Multilingual IR in Low Resource Languages using Weakly Aligned Corpora

Tholpadi, Goutham

dc.contributor.advisor	Bhattacharyya, Chiranjib
dc.contributor.advisor	Shevade, Shirish
dc.contributor.author	Tholpadi, Goutham
dc.date.accessioned	2021-09-29T09:36:27Z
dc.date.available	2021-09-29T09:36:27Z
dc.date.submitted	2018
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5360
dc.description.abstract	Multilingual information retrieval (MLIR) methods generally rely on linguistic resources such as dictionaries, parallel corpora, etc., to overcome the language barrier. For low resource languages without these resources, an alternative approach is to use topical cross-lingual topical correspondence models learned from document-aligned multilingual corpora. However, there is a large and growing corpus of \weakly aligned" text in the form of user-generated content (UGC) data from social networks, commenting widgets, etc. that has been ignored by researchers so far. This task is made more challenging due to romanization, sparsity, and noise. Topic models learned from such text are not readily usable for all applications. In particular, the size of the textual units of interest has a strong bearing on the methods used for applying these models. In this thesis, we develop a series of hierarchical Bayesian models for capturing topical correspondence in multilingual news-comment corpora. The nal model, called Multi-glyphic Correspondence Topic Model (MCTM), captures different kinds of topical relationships, and effectively handles problems of data sparsity, noise and comment romanization. From an application perspective, we consider three MLIR problems corresponding to differ- ent levels with respect to the size of the textual units involved. The rst MLIR problem (at the level of single words) involves inducing translation correspondences in arbitrary language pairs using Wikipedia data. We present an approach for translation induction that leverages Wikipedia corpora in auxiliary languages using the notion of translingual themes . We propose extensions that enable the computation of cross lingual semantic relatedness between words and apply the method to the task of cross-lingual Wikipedia title suggestion. The second MLIR problem (at the level of sentences or paragraphs) involves the detection of topical relevance be- tween speci c portions of articles and comments. Using MCTM, we devise methods for topical categorization of comments with respect to the article. In the third MLIR problem (at the level of document collections), the task is to generate keyword summaries for a cluster of documents. We develop an architecture for performing Scatter/Gather on multilingual document collections and a labeling procedure that generates keyword labels on-the-fly to summarize a multilingual document cluster in any language.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	;G29412
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Multilingual information retrieval	en_US
dc.subject	information retrieval	en_US
dc.subject	parallel corpora	en_US
dc.subject	hierarchical Bayesian models	en_US
dc.subject	Wikipedia	en_US
dc.subject	Multi-glyphic Correspondence Topic Model	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Algorithms for Multilingual IR in Low Resource Languages using Weakly Aligned Corpora	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: G29412.pdf
Size:: 2.358Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [542]

Show simple item record