Work

Risk Prediction with Longitudinal Gene Expression Data Using Statistical and Machine Learning Method

Public

Downloadable Content

Download PDF

With the advancement of high-throughput sequencing technology, it has become much easier to extract gene expression data and to discover gene-disease associations more efficiently. Longitudinal gene expression data offer more insight into expression patterns for distinct patient groups compared to cross-sectional data. For instance, patients diagnosed with subclinical acute rejections (subAR) following kidney transplants may exhibit signs of rejection at the gene level up to a year prior, with these signals potentially concealed within gene expression levels and patterns. This dissertation concentrates on developing innovative statistical and machine learning methodologies to improve disease diagnosis and prediction performance using longitudinal high-dimensional gene expression data. In Chapter 2, we focus on developing diagnosis models. We introduce a two-stage random effect estimator to quantify longitudinal trajectories for the identification of patients with potential subAR. In the first stage, we model the complex dynamics between patients, genes, and time while accounting for within- and between-patient variations. To stabilize the estimation, we adopt an Empirical Bayes framework where parameters for the prior distribution are estimated using linear mixed-effect models. In the second stage, a binary classification model is employed to link the estimated gene expression trajectories with the disease outcomes. We further consider two model variations that cater to different data generating processes. We conduct simulation studies to showcase the efficacy of our models in comparison to the benchmark model. Moreover, through real data analysis, we demonstrate that the appropriate new model achieves a higher AUC (Area under the ROC Curve) while maintaining a satisfactory level of sensitivity and specificity. In Chapter 3, we present attentive Recurrent Neural Networks (RNN) that directly models the intricate dynamics between patient, time and genes. Additionally, we propose a novel data augmentation scheme that can enhance the predictive capacity for gene samples within the prediction time frame. The attentive RNN learns the relationship between visits and makes predictions after processing the entire sequence of data. The efficacy of the model is demonstrated through a real data analysis, demonstrating its potential in predicting disease outcomes based on longitudinal high-dimensional genomics data. Although the proposed model outperforms the benchmark, there is still room for improvement due to the data-driven nature of machine learning algorithms.

Creator
DOI
Subject
Language
Alternate Identifier
Keyword
Date created
Resource type
Rights statement

Relationships

Items