Statistical Network Analysis: Community Structure, Fairness Constraints, and Emergent Behavior
Abstract
Networks or graphs provide mathematical tools for describing and analyzing relational data. They are used in biology to model interactions between proteins, in economics to identify trade alliances among countries, in epidemiology to study the spread of diseases, and in computer science to rank webpages on a search engine, to name a few. Each application domain in this wide assortment encounters networks with diverse properties and imposes various constraints. For example, networks may be dynamic, heterogeneous, or attributed, and an application domain may require a fairness constraint on the communities (e.g. requiring communities in a social network to be balanced with respect to genders). However, most existing research is concerned with the simplest type of networks with a fixed set of nodes and edges and focuses on the canonical forms of tasks like community detection and link prediction. This thesis aims at bridging this gap between the simplistic problem settings considered in the literature and the complex requirements of real-world applications by proposing community detection and link prediction methods to analyze different types of networks from various perspectives.
Our first contribution includes two spectral algorithms for finding `fair' communities in a given network $\calG$. We define what it means for communities to be fair from the perspective of each node (a.k.a. individual fairness). This is done via an auxiliary `representation graph' $\calR$ that connects nodes if they can represent each other's interests in various communities. Informally speaking, a node finds communities fair if its neighbors from $\calR$ are propotionally distributed across all communities in $\calG$. The goal is to find communities that are considered fair by all nodes. We show that this fairness criterion \textbf{(i)} generalizes the well-explored idea of statistical fairness and \textbf{(ii)} is also applicable in cases where sensitive node attributes (like gender and race) are not observable but instead manifest themselves as intrinsic or latent features in $\calR$. We develop fair spectral clustering algorithms and prove that they are weakly consistent ($\#\text{mistakes} = o(N)$ with probability $1 - o(1)$) under a proposed variant of the stochastic block model.
Second, we propose a community-based statistical model for dynamic networks where edges appear and disappear over time. Many networks like social networks, citation networks, contact networks, etc., are dynamic in nature. Our model embeds the nodes and communities in a $d$-dimensional latent space and specifies a procedure for updating these embeddings over time to model the network's evolution. Given an observed dynamic network, we infer these latent quantities using variational inference and use them for link forecasting and community detection. Unlike existing approaches, our model supports the birth and death of communities. It also allows us to use powerful neural networks during inference. Experiments demonstrate that our model is better at link forecasting and community detection as compared to existing approaches. Moreover, it discovers \textit{stable} communities, as quantified by the normalized mutual information (NMI) score between communities discovered at successive time steps. This desirable quality is absent in methods that ignore the network dynamics.
Third, we propose a statistical model for heterogeneous dynamic networks where the nodes and relations additionally have a \textit{type} associated with them (e.g., knowledge graphs). Besides the latent node attributes, this model also encodes a set of \textit{interaction matrices} for each type of relation. These matrices specify the affinity between nodes based on their attribute values and can represent both homophyllic (like attracts like) and heterophyllic relationships (opposites attract). We develop a scalable neural network-based inference procedure for this model and demonstrate that it outperforms existing state-of-the-art approaches on several homogeneous and heterogeneous dynamic network datasets, particularly the temporal knowledge graphs.
Fourth, we develop a model for networks with node covariates to bring explainability to community detection. This model integrates node covariates into a stochastic block model using restricted Boltzmann machines. We subscribe to the view that a community can be explained by identifying the defining covariates of its member nodes. Our model provides the relative importance of various covariates in each community, thereby explaining its decision to group the members. Existing approaches for modeling networks with covariates lack this property, especially the ones that are based on deep neural networks. We also derive an efficient inference procedure that runs in linear time in the number of nodes and edges. Experiments confirm that our model's community detection performance is comparable with recent deep neural network-based approaches. However, it additionally offers the advantage of explainability.
The discussion till this point views communities as passive structures arising out of interactions between nodes. However, just like existing links in a network determine future links, communities also play a functional role in shaping the behavior of the nodes (for example, preference for a clothing brand). Our final contribution explores this functional view of communities and shows that they affect emergent communication in a networked multi-agent reinforcement learning setting.