Author: Zheng, Yangzi
Title: Model for zero-inflated proportion data analysis
Advisors: Zhao, Xingqiu (AMA)
Jiang, Binyan (AMA)
Degree: Ph.D.
Year: 2024
Subject: Mathematics -- Data processing
Regression analysis
Hong Kong Polytechnic University -- Dissertations
Department: Department of Applied Mathematics
Pages: xviii, 78 pages : illustrations
Language: English
Abstract: The examination and interpretation of datasets containing a substantial number of zeros have become increasingly relevant across various disciplines, including ecology and sociological studies. While there has been extensive research on zero-inflated count data, models specifically designed for proportion data with a high occurrence of zeros remain relatively limited. This thesis addresses this gap by focusing on zero-inflated proportion data and proposing a novel modeling approach to distinguish between two types of zeros present in the dataset. The primary objective is to de­velop a regression model that can effectively capture and differentiate these two types of zeros. The first type of zero, which corresponds to random absence, is modeled using a binomial sampling approach. This accounts for instances where the propor­tion value is zero due to random factors or chance. The second type of zero, arising from unsuitability, is handled using a general classification indicator. This indicator helps identify situations where the proportion value is zero due to the unsuitability of certain conditions or factors. To achieve our objective, we propose both parametric and semi-parametric models, providing flexibility and robustness in capturing the characteristics of the zero-inflated proportion data. By introducing these innovative models, we aim to enhance the understanding and analysis of datasets with a high occurrence of zeros. This research contributes to the development of methodologies specifically tailored for zero-inflated proportion data, addressing a significant gap in the existing literature.
In the first section of our study, we focus on investigating a semi-parametric model. This model comprises two components: a regression component that incorporates weighted least squares to account for heterogeneity, and a classification component that benefits from an optimal decision rule derived from our model. To estimate the parameters based on the optimal decision rule, we employ the Nadaraya-Watson estimator. This estimator ensures the accuracy of our classification and contributes to the overall robustness of the model. The results of our investigation reveal that environmental features play a crucial role in understanding both types of zeros: those related to perfection and those resulting from random absence. By utilizing our pro­posed modeling approach, researchers can gain deeper insights into the factors that contribute to these different types of zeros, thereby improving their understanding of the underlying processes. Furthermore, our model demonstrates superior per­formance in both simulated and real-world scenarios when compared to traditional methods such as the Tobit model and the zero-inflated beta regression model. By significantly reducing prediction errors, our model is proven to be a valuable tool for accurate estimation and prediction in various applications. By presenting these find­ings, we highlight the effectiveness and practicality of our semi-parametric model, enabling researchers to make more informed decisions and gain a comprehensive understanding of the factors influencing both types of zeros and the positive percent rate.
In the second section, our main objective is to provide a precise interpretation of the factors that influence the defective rate. Particularly, we focus on the indicator part, which was left undefined in the first part but has garnered more attention due to its exploration of the covariates that distinguish the zero part from the non-zero part. In the original model assumption, the presence of the indicator part creates complexity in inferring the parameters. Taking inspiration from the smoothed maximum score estimator, we introduce a parametric model by replacing the indicator part with a smoothed kernel estimator. This substitution yields a continuously differentiable loss function, which greatly facilitates further analysis. Similar to the previous section, we take into account heterogeneity and utilize the weighted least square method to estimate both parameters. Subsequently, we establish the consistency and asymp­totically normal properties for both the regression and indicator estimators. These properties assure the reliability and validity of our estimators in capturing the under­lying relationships and distinguishing between the zero and non-zero parts effectively.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
7640.pdfFor All Users980.85 kBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13188