Epistasis Detection and Phenotype Prediction in GWAS Using Machine Learning Methods
Abstract
Genome-wide association studies (GWAS) are used to find the association between genetic variants, Single Nucleotide Polymorphisms (SNPs), and phenotypic traits or diseases in a population. The number of GWAS has increased exponentially over the past decade due to the availability of ample data owing to the advancements in sequencing technologies. This has led to the discovery of several SNPs associated with complex diseases, but these SNPs only explain a small fraction of the disease heritability; i.e., the proportion of observed phenotypic variation between individuals in a population that is attributable to genetic variations. In these cases, univariate studies based on the single-locus analysis test the SNPs independently for the association with a particular disease/phenotype. These approaches fail to capture the complete genetic risk of complex diseases originating from the synergistic effect of multiple genes. Moreover, there is a lack of accuracy in the disease risk prediction from the genetic information due to the limited knowledge of the genetic architecture of complex diseases. Finding the interaction among SNPs is one of the ways to find this missing heritability information to evaluate the disease risks.
In this work, we propose a methodology, Learning Epistasis for Phenotype Prediction (LEPP), to discover the SNPs and the epistatic interactions among them contributing to the disease etiology. By taking the SNPs and epistatic interactions together, we also predict the disease status of a subject. Usually, GWAS identifies hundreds of thousands of SNPs, and finding interactions in such high dimensional data is a challenging task and raises issues like the curse of dimensionality. To overcome this challenge, we begin by filtering the data using multivariate feature selection methods like ReliefF, Gradient Boosting, and Random Forests, so that interacting SNPs without marginal effects are also selected along with SNPs showing marginal effects. Subsequently, we use a combination of machine learning models like Gradient Boosting, Random Forests, and Support Vector Machines to capture the non-linear relationships between features (SNPs) that are difficult for linear models to identify. We also focus on interpretability using SHAP (SHapley Additive exPlanations) to understand the predictions and the reason(s) behind the results.
Our methods are first evaluated on simulated data where the ground truth is known a priori. The best-performing methods in both feature selection and prediction steps, ReliefF and XGBoost, respectively, are chosen for the analysis of real data. Two real GWAS datasets on breast cancer and schizophrenia are used to demonstrate the efficacy of our method (LEPP). These are controlled datasets and are accessed through the dbGaP database. Redundancy of the feature set was checked using Principal Component Analysis (PCA) and the datasets were partitioned into separate subsets using a 70:30 ratio for training and testing, respectively, to assure unbiased training. We achieve maximum accuracy of 78.34% with an Area Under the Curve (AUC) value of 0.85 for the breast cancer dataset using 1000 selected SNPs. This is better than the previously reported accuracy of 60.25% on the same dataset which uses the absolute mean difference of SNP values over cases and controls for feature selection followed by the k-Nearest Neighbor method (KNN) for prediction. The reason LEPP performs better is that ReliefF is a multivariate feature selection algorithm that outperforms the univariate algorithm used in the previous work; in addition, Gradient Tree Boosting is also known to outperform KNN for classification problems. For the schizophrenia dataset, 1500 selected SNPs yielded a maximum accuracy of 76.82% and 0.84 AUC value. This is again better than the AUC value of 0.60 previously reported on this dataset which uses weighted Genetic Risk Score (wGRS) obtained by Logistic Regression coefficients. This method fails to identify the non-linear relationship between features resulting in poorer performance than XGBoost. These results show that LEPP can explain some portion of the missing heritability in these diseases which was previously not attainable. We have found novel SNPs and interactions through our analysis which provides corroboration to the current knowledge and unravels new SNPs that may be of importance. We also perform a gene-level analysis by mapping these SNPs to their corresponding genes and then explore the ways these genes and the pathways involved may influence the diseases.