Author: Qin, Huihui
Title: Statistical learning with empirical features and data of different types
Advisors: Huang, Jian (AMA)
Guo, Xin (AMA)
Degree: Ph.D.
Year: 2020
Subject: Mathematical statistics
Machine learning
Data mining
Hong Kong Polytechnic University -- Dissertations
Department: Department of Applied Mathematics
Pages: x, 67 pages : color illustrations
Language: English
Abstract: The thesis consists of three parts that cover different aspects of statistical learning for data mining. In the first part, we propose a new algorithm, LESS (Learning with Empirical feature-based Summary statistics from Semi-supervised data), which uses only summary statistics instead of raw data for regression learning. Nowadays the extensive collection and analyzing of data is stimulating widespread privacy concerns, and therefore is increasing tensions between the potential sources of data and researchers. A privacy-friendly learning framework can help to ease the tensions, and to free up more data for research. In LESS, The selection of empirical features serves as a trade-off. between prediction precision and the protection of privacy. We show that LESS achieves the minimax optimal rate of convergence, in terms of the size of the labeled sample. LESS extends naturally to the applications where data are separately held by different sources. Compared with existing literature on distributed learning, LESS removes the restriction of minimum sample size on single data sources. In the second part of the thesis, we study different approaches for analyzing topics in text data. Topic modeling has been an important field in natural language processing (NLP) and recently witnessed great methodological advances. Yet, the development of topic modeling is still, if not increasingly, challenged by two critical issues. First, despite intense efforts toward nonparametric/post-training methods, the search for the optimal number of topics K remains a fundamental question in topic modeling and warrants input from domain experts. Second, with the development of more sophisticated models, topic modeling is now ironically been treated as a black box and it becomes increasingly difficult to tell how research findings are informed by data, model specifications, or inference algorithms. Based on about 120,000 newspaper articles retrieved from three major Canadian newspapers (Globe and Mail, Toronto Star, and National Post) since 1977, we employ five methods with different model specifications and inference algorithms (Latent Semantic Analysis, Latent Dirichlet Allocation, Principal Component Analysis, Factor Analysis, Non-negative Matrix Factorization) to identify discussion topics. The optimal topics are then assessed using three measures: coherence statistics, held-out likelihood (loss), and graph-based dimensionality selection. Mixed findings from this research complement advances in topic modeling and provide insights into the choice of optimal topics in social science research. In the third part, we consider the generalized linear hurdle model with grouped and right-censored count data. This data type is widely applied in demography, epidemiology, sociology, criminology, psychology, and many other branches of social sciences. The corresponding generalized linear model and the zero-inflated model recently draw much attention. In this part, we study the hurdle model which covers not only zero inflation but also zero deflation. We provide sufficient conditions for the asymptotic consistency and asymptotic normality of maximum likelihood estimator. We represent the Fisher information matrix of the hurdle model in terms of the vanilla grouped and right-censored model. We provide an elegant sufficient and necessary condition for the Fisher information matrix of the hurdle model to be strictly positive definite. The research complements the recent development of the statistical inference with grouped and right-censored count data.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
5184.pdfFor All Users691.39 kBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10732