Dokument: Tree-based statistical learning for modeling genetic risk scores and identifying gene–gene and gene–environment interactions

Titel:

Tree-based statistical learning for modeling genetic risk scores and identifying gene–gene and gene–environment interactions

URL für Lesezeichen:

https://docserv.uni-duesseldorf.de/servlets/DocumentServlet?id=66394

URN (NBN):

urn:nbn:de:hbz:061-20240729-101111-1

Kollektion:

Dissertationen

Sprache:

Englisch

Dokumententyp:

Wissenschaftliche Abschlussarbeiten » Dissertation

Medientyp:

Text

Autor:

Lau, Michael [Autor]

Dateien:

[Dateien anzeigen]	Adobe PDF
[Details]	7,17 MB in einer Datei
[ZIP-Datei erzeugen]
Dateien vom 11.07.2024 / geändert 11.07.2024

Beitragende:

Prof. Dr. Schwender, Holger [Gutachter]
PD Dr. Schikowski, Tamara [Gutachter]
Prof. Dr. Jung, Klaus [Gutachter]

Stichwörter:

interpretable machine learning, variable importance, ensemble prediction, decision trees, random forests, logic regression, polygenic risk scores, epistasis

Dewey Dezimal-Klassifikation:

500 Naturwissenschaften und Mathematik » 510 Mathematik

Beschreibung:

Genetic risk scores (GRS) summarize (parts of) the genetic makeup of individuals with regard to a specific phenotype such as a disease status. GRS can be used for personal risk assessment or for deriving biological mechanisms involved in the development of the considered phenotype. It is well known that genetic variants do not have to independently influence the considered outcome but they might also interact with each other. GRS are commonly constructed using linear approaches such as the elastic net or aggregating individual effect estimates of genetic variants. Linear approaches, however, do not incorporate such gene–gene interaction effects, unless prior knowledge about which predictors might interact is available, which is typically not the case in genetic epidemiology.

Therefore, tree-based statistical learning methods that are able to autonomously detect and incorporate interaction effects are investigated for their ability of constructing GRS in this thesis. More precisely, variants of random forests and logic regression are evaluated against the elastic net. Simulation studies as well as a real data application show that these tree-based methods are able to outperform the elastic net in terms of the induced predictive ability.

Genetic risk factors can also interact with environmental risk factors in the development of complex phenotypes. Standard statistical tests for testing the presence of such a gene–environment (GxE) interaction effect either do not properly model the genetic risk factors or suffer from reduced statistical power due to splitting the available data into training data sets for constructing a GRS and test data sets for statistically testing the GxE interaction effect to avoid overfitting. Therefore, a novel GxE interaction test is designed that utilizes bagging (bootstrap aggregating) and OOB (out-of-bag) predictions to both construct a GRS model and subsequently test the GxE interaction using the complete data set. Moreover, it is proposed to employ random forests as the GRS construction procedure, as random forests yielded high predictive performances in the first part of this dissertation due to flexibly modeling arbitrary effects. Empirical evaluations show that the proposed GxE interaction test yields a high statistical power while controlling the type I error rate.

A notable shortcoming of the ensemble tree methods random forests and logic regression with bagging, that yield GRS with comparably strong associations with the outcome, is their lack of interpretability, i.e., contrary to elastic net models, it can no longer be easily understood how predictions are composed and which predictors influence the outcome in which interplay and magnitude. Hence, a novel statistical learning method is developed that constructs a single decision tree that can split on single predictors or Boolean conjunctions/interactions of multiple predictors. This procedure, therefore, captures gene–gene interactions on split level, and moreover, incorporates GxE interactions by fitting regression models in the decision tree leaves. This statistical learning method is accompanied by a framework for measuring the importance of predictors and interactions between predictors. In simulation studies and real data applications, it is shown that this new method yields strongly predictive and interpretable models.

Lizenz:

Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz

Fachbereich / Einrichtung:

Mathematisch- Naturwissenschaftliche Fakultät » WE Mathematik » Mathematische Optimierung

Dokument erstellt am:

29.07.2024

Dateien geändert am:

29.07.2024

Promotionsantrag am:

13.02.2024

Datum der Promotion:

09.07.2024

Heinrich-Heine-Universität Düsseldorf

Dokument: Tree-based statistical learning for modeling genetic risk scores and identifying gene–gene and gene–environment interactions