Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Applied Mathematicsen_US
dc.contributor.advisorLiu, Chunling Catherine (AMA)en_US
dc.creatorXu, Sheng-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/10651-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic Universityen_US
dc.rightsAll rights reserveden_US
dc.titleSupervised statistical inference for data of versatile dimensionality with application to GWAS studiesen_US
dcterms.abstractGenome-Wide Association Studies (GWAS) have been successful strategies of applying biological insights into diseases in epigenetics and epigenomics in the past two decades, by linking diseases or their traits with genomic variants, environmental confounders, and clinically relevant information. The companion data used to be of versatile dimensionality and of complex data structure, posing exciting challenges and opportunities for new statistical methodology and inference, coupled with new modeling and effective computing implementation. The thesis composes of three parts and aims to address several important regression problems of estimation, hypothesis testing, and classification arising from the prevailing GWAS data pool, to meet the increasing need of statistical analytic toolsets. Part I focuses on regression with censored survival outcomes and is motivated by data of diffuse large B-cell lymphoma (DLBCL), which integrated a large number of gene expression variants and censored survival time of patients with low sample size. This calls for efficient algorithms for feature screening and delicate statistical inference for the selected subset of influenced variables after dimensionality reduction. In Chapter 2, we present the non-monotone proximal gradient (NPG) algorithm to speed up sure joint screening for ultrahigh-dimensional Cox proportional hazard model and prove its convergence with LASSO initiator. The accompanied R-package named coxnpgsjs is fast and efficient to select a designated number of influenced gene variants from the DLBCL data. In Chapter 3, we investigate the impact of such a subset of genetic factors on the survival time through the single-index hazard (SIH) semiparametric regression model. The SIH model is robust but challenging in efficient statistical inference owing to the nested single index structure. We propose a censored version of multiple local linear regression to attain uniformly consistent estimator of the nonparametric component and the semiparametric efficient bound for the profile likelihood estimator of the parametric component. Two classes of estimations equations are derived as the practical alternative of the score equation from the perspective of double robustness. The proposed methods and results are applied to estimate the gene effects and to detect its significance on the aforementioned lymphoma. Part II focuses on regression with sparse longitudinal responses and is motivated by large-scale longitudinal GWAS for Alzheimer's Disease in detecting Single Nucleotide Polymorphisms (SNPs) level genotype effects on the phenotype response. It is in urgent need of powerful test procedures to detect the significance at the GWAS P-value significant threshold to the wide community of associated researchers. To compare multiple treatments, Chapters 4 and 5 present practical strategies on bootstrap procedures and apply successfully on models with Gaussian and non-Gaussian phenotype response and gigantic SNP level genotypes. This unveils some interesting association discoveries of generic effect on the disease at the GWAS significance level for the well-known Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Part III focuses on regression with binary outcomes and is motivated by labeling the Multiple Sclerosis disease precisely among a population where the projection scores are skewed. In Chapter 6, we define a general distance to incorporate existing optimal functional classifiers and interpret reasonably why our proposed quantile classifier is robust. The optimal property of near perfect is derived. The accompanied classification procedure is fast and accurate. A Shiny app is built for the convenience of clinical practitioners.en_US
dcterms.extentxiii, 204 pages : color illustrationsen_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2020en_US
dcterms.educationalLevelPh.D.en_US
dcterms.educationalLevelAll Doctorateen_US
dcterms.LCSHGenomics -- Statistical methodsen_US
dcterms.LCSHGenomics -- Data processingen_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertationsen_US
dcterms.accessRightsopen accessen_US

Files in This Item:
File Description SizeFormat 
5075.pdfFor All Users1.74 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10651