Supervised statistical inference for data of versatile dimensionality with application to GWAS studies

Xu, Sheng

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Applied Mathematics	en_US
dc.contributor.advisor	Liu, Chunling Catherine (AMA)	en_US
dc.creator	Xu, Sheng	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/10651	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Supervised statistical inference for data of versatile dimensionality with application to GWAS studies	en_US
dcterms.abstract	Genome-Wide Association Studies (GWAS) have been successful strategies of applying biological insights into diseases in epigenetics and epigenomics in the past two decades, by linking diseases or their traits with genomic variants, environmental confounders, and clinically relevant information. The companion data used to be of versatile dimensionality and of complex data structure, posing exciting challenges and opportunities for new statistical methodology and inference, coupled with new modeling and effective computing implementation. The thesis composes of three parts and aims to address several important regression problems of estimation, hypothesis testing, and classification arising from the prevailing GWAS data pool, to meet the increasing need of statistical analytic toolsets. Part I focuses on regression with censored survival outcomes and is motivated by data of diffuse large B-cell lymphoma (DLBCL), which integrated a large number of gene expression variants and censored survival time of patients with low sample size. This calls for efficient algorithms for feature screening and delicate statistical inference for the selected subset of influenced variables after dimensionality reduction. In Chapter 2, we present the non-monotone proximal gradient (NPG) algorithm to speed up sure joint screening for ultrahigh-dimensional Cox proportional hazard model and prove its convergence with LASSO initiator. The accompanied R-package named coxnpgsjs is fast and efficient to select a designated number of influenced gene variants from the DLBCL data. In Chapter 3, we investigate the impact of such a subset of genetic factors on the survival time through the single-index hazard (SIH) semiparametric regression model. The SIH model is robust but challenging in efficient statistical inference owing to the nested single index structure. We propose a censored version of multiple local linear regression to attain uniformly consistent estimator of the nonparametric component and the semiparametric efficient bound for the profile likelihood estimator of the parametric component. Two classes of estimations equations are derived as the practical alternative of the score equation from the perspective of double robustness. The proposed methods and results are applied to estimate the gene effects and to detect its significance on the aforementioned lymphoma. Part II focuses on regression with sparse longitudinal responses and is motivated by large-scale longitudinal GWAS for Alzheimer's Disease in detecting Single Nucleotide Polymorphisms (SNPs) level genotype effects on the phenotype response. It is in urgent need of powerful test procedures to detect the significance at the GWAS P-value significant threshold to the wide community of associated researchers. To compare multiple treatments, Chapters 4 and 5 present practical strategies on bootstrap procedures and apply successfully on models with Gaussian and non-Gaussian phenotype response and gigantic SNP level genotypes. This unveils some interesting association discoveries of generic effect on the disease at the GWAS significance level for the well-known Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Part III focuses on regression with binary outcomes and is motivated by labeling the Multiple Sclerosis disease precisely among a population where the projection scores are skewed. In Chapter 6, we define a general distance to incorporate existing optimal functional classifiers and interpret reasonably why our proposed quantile classifier is robust. The optimal property of near perfect is derived. The accompanied classification procedure is fast and accurate. A Shiny app is built for the convenience of clinical practitioners.	en_US
dcterms.extent	xiii, 204 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2020	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Genomics -- Statistical methods	en_US
dcterms.LCSH	Genomics -- Data processing	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
5075.pdf	For All Users	1.74 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10651