Work

Comparative analysis of feature selection and classification methods for epigenetic methylation data

Public

Epigenetics, the study of heritable changes in organisms not caused by mutations to DNA,holds tremendous promise for future medical applications. Although still in its infancy, feature selection in statistics plays an important role in correlating epigenetic changes with diseases and various health issues. Feature selection may also be used in the development of models for the classification of disease or healthy states. In epigenetic studies, it is common that the health and disease classes are not equally represented, which is known as imbalanced data. This study investigates whether a model’s predictive accuracy can be increased when using a ‘rebalancing’ algorithm called synthetic minority oversampling technique (SMOTE) in conjunction with various algorithms for feature screening and feature selection, and classification methods. The study aims to ascertain whether the feature selection and rebalancing algorithm combination can assist researchers in addressing the problem of imbalanced data, and thereby in developing more accurate epigenetic classification models. Four different feature screening methods, Beta regression, D3M, Wilcoxon Rank Sum Test, and Kolmogorov-Smirnov test, are used to detect differentially methylated CpG regions and select the top 1000 CpG regions. Classification methods such as Random Forest, Support Vector Machine, Extreme Gradient Boosting (XGBoost), Artificial Neural Network, Naive Bayes models, logistic regression models using the minimax concave penalty (MCP), and the least absolute shrinkage and selection operator (LASSO), are used for both the full epigenetic data sets as well as the top 1000 CpG regions. Additionally, the SMOTE algorithm is employed to rebalance each of the data sets. Five-fold cross-validation is used to evaluate the performance of feature selection and classification methods by comparing the metrics such as accuracy, balanced accuracy, sensitivity, specificity, and precision. Three studies based on HumanMethyl450K from the National Center for Biotechnology’s Gene Expression Omnibus (GEO) are analyzed for this project. Beta regression outperformed all other methods used to detect differentially methylated CpG locations. Also, these results suggest that the Wilcoxon Rank Sum Test, Student’s T-test, and Kolmogorov-Smirnov tests performed poorly in comparison. Furthermore, the results of the analyses indicate that many of the parsimonious models developed using various classification algorithms may detect these diseases with a high degree of accuracy. Our results suggested the use of SMOTE marginally increased classification metrics, while the MCP and LASSO algorithms consistently outperformed many commonly used classification algorithms.

Creator
DOI
Subject
Language
Alternate Identifier
Date created
Resource type
Rights statement

Relationships

Items