Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks

Arun, R

dc.contributor.advisor	Veni Madhavan, C E
dc.contributor.author	Arun, R
dc.date.accessioned	2013-09-17T07:31:21Z
dc.date.accessioned	2018-07-31T04:38:24Z
dc.date.available	2013-09-17T07:31:21Z
dc.date.available	2018-07-31T04:38:24Z
dc.date.issued	2013-09-17
dc.date.submitted	2010
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/2247
dc.identifier.abstract	https://etd.iisc.ac.in/static/etd/abstracts/2861/G24683-Abs.pdf	en_US
dc.description.abstract	The problem of identifying clusters arising in the context of topic models and related approaches is important in the area of machine learning. The problem concerning traversals on Concept Association Networks is of great interest in the area of cognitive modelling. Cluster identification is the problem of finding the right number of clusters in a given set of points(or a dataset) in different settings including topic models and matrix factorization algorithms. Traversals in Concept Association Networks provide useful insights into cognitive modelling and performance. First, We consider the problem of authorship attribution of stylometry and the problem of cluster identification for topic models. For the problem of authorship attribution we show empirically that by using stop-words as stylistic features of an author, vectors obtained from the Latent Dirichlet Allocation (LDA) , outperforms other classifiers. Topics obtained by this method are generally abstract and it may not be possible to identify the cohesiveness of words falling in the same topic by mere manual inspection. Hence it is difficult to determine if the chosen number of topics is optimal. We next address this issue. We propose a new measure for topics arising out of LDA based on the divergence between the singular value distribution and the L1 norm distribution of the document-topic and topic-word matrices, respectively. It is shown that under certain assumptions, this measure can be used to find the right number of topics. Next we consider the Non-negative Matrix Factorization(NMF) approach for clustering documents. We propose entropy based regularization for a variant of the NMF with row-stochastic constraints on the component matrices. It is shown that when topic-splitting occurs, (i.e when an extra topic is required) an existing topic vector splits into two and the divergence term in the cost function decreases whereas the entropy term increases leading to a regularization. Next we consider the problem of clustering in Concept Association Networks(CAN). The CAN are generic graph models of relationships between abstract concepts. We propose a simple clustering algorithm which takes into account the complex network properties of CAN. The performance of the algorithm is compared with that of the graph-cut based spectral clustering algorithm. In addition, we study the properties of traversals by human participants on CAN. We obtain experimental results contrasting these traversals with those obtained from (i) random walk simulations and (ii) shortest path algorithms.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartofseries	G24683	en_US
dc.subject	Machine Learning	en_US
dc.subject	Clustering (Concepts)	en_US
dc.subject	Association Networks	en_US
dc.subject	Concept Association Networks (CAN)	en_US
dc.subject	Latent Dirichlet Allocation (LDA)	en_US
dc.subject	Matrix Factorization	en_US
dc.subject	Concept Association Networks - Clustering	en_US
dc.subject	Concept Association Networks - Traversals	en_US
dc.subject	Entropy (Information Theory)	en_US
dc.subject	Cognitive Clustering	en_US
dc.subject	Cluster Identification	en_US
dc.subject	Non-negative Matrix Factorization (NMF)	en_US
dc.subject.classification	Computer Science	en_US
dc.title	Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks	en_US
dc.type	Thesis	en_US
dc.degree.name	MSc Engg	en_US
dc.degree.level	Masters	en_US
dc.degree.discipline	Faculty of Engineering	en_US

Files in this item

Name:: G24683.pdf
Size:: 1.675Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [561]

Show simple item record

Cluster Identification : Topic Models, Matrix Factorization And Concept Association Networks

Files in this item

This item appears in the following Collection(s)

Related items

Spatially Correlated Data Accuracy Estimation Models in Wireless Sensor Networks ﻿

Novel Concepts In Divisible Load Scheduling With Realistic System Constraints ﻿

Overcoming Challenges Associated with Hydrogen Storage Efficiency and Fuel Cell Catalysis : An Ab Initio Study ﻿

Spatially Correlated Data Accuracy Estimation Models in Wireless Sensor Networks

Novel Concepts In Divisible Load Scheduling With Realistic System Constraints

Overcoming Challenges Associated with Hydrogen Storage Efficiency and Fuel Cell Catalysis : An Ab Initio Study