Crowdsourcing method in empirical linguistic research : Chinese studies using mechanical turk-based experimentation

Pao Yue-kong Library Electronic Theses Database

Crowdsourcing method in empirical linguistic research : Chinese studies using mechanical turk-based experimentation


Author: Wang, Shichang
Title: Crowdsourcing method in empirical linguistic research : Chinese studies using mechanical turk-based experimentation
Degree: Ph.D.
Year: 2016
Subject: Computational linguistics -- Research.
Human computation.
Data mining.
Hong Kong Polytechnic University -- Dissertations
Department: Dept. of Chinese and Bilingual Studies
Pages: xv, 294 pages : illustrations
Language: English
InnoPac Record:
Abstract: Empirical linguistic research is driven by linguistic data. However linguistic data collection, be it corpus annotation, scripture and audio material transcription, survey, or psycholinguistic experiment, etc., has been proved to be very time-and resource-intensive. As a result, linguistic researchers have to frequently make compromises on linguistic data: instead of using large scale linguistic data, they have to use small scale linguistic data; when recruiting subjects for surveys or psycholinguistic experiments, instead of using random sampling, they have to use convenient sampling (recruiting subjects on the basis of proximity, ease-of-access, and willingness to participate). They typically only use college students as the subject pool which is rather homogeneous; and even when they use convenient sampling, they usually cannot use samples of a very large size. Since linguistic data is the foundation of empirical linguistic research, compromises on linguistic data may corrupt the whole research project. In a word, linguistic data has become the bottleneck of empirical linguistic research. In order to solve this problem, we need to find a more efficient and economic data collection method. In recent years, the crowdsourcing technology, which means outsourcing tasks to crowds in the form of open call via Internet, has become a promising new method of linguistic data collection to break the bottleneck.
This dissertation reports our work on exploring the application of crowd-sourcing method, especially Mechanical Turk-based linguistic experimentation (Mechanical Turk is a primary genre of crowdsourcing), in empirical linguistic research. We have three correlated general goals which concern methodology, language resource, and linguistic theory respectively: (1) to explore Mechanical Turk-based linguistic experimentation, (2) to build useful linguistic datasets using Mechanical Turk-based experiments, and (3) to investigate some linguistic theoretical issues using the data collected. This dissertation consists of three studies. Study one is a pilot study on Mechanical Turk-based linguistic experimentation which is used to lay a methodological foundation for our research. We reviewed literature on Mechanical Turk-based experimentation, analyzed platform usability, conducted a pilot experiment, proposed a general framework of Mechanical Turk-based experiment, and also discussed data quality control methods. Study two firstly created a very large semantic transparency dataset of Chinese nominal compound using Mechanical Turk-based experiments. This dataset contains the overall and constituent semantic transparency rating data of about 1,200 disyllabic Chinese nominal compounds. We also conducted a semantic transparency rating experiment using the traditional laboratory-based method which enabled us to further evaluate the Mechanical Turk-based experimentation by comparing the data collected by Mechanical Turk-based experiment and Laboratory-based experiment. And based on the semantic transparency dataset we created, we explored the uncertainty of semantic transparency judgment among raters and the effect of semantic head of compound on semantic transparency rating. Study three firstly created a large manual Chinese word segmentation dataset using Mechanical Turk-based experiments. This dataset contains 152 long Chinese sentences selected mainly from the Sinica corpus; each sentence was segmented manually by more than 120 online subjects. This dataset is then used to investigate the effect of semantic transparency on word intuition and the measurement of the word intuition of Chinese speakers.

Files in this item

Files Size Format
b29041417.pdf 6.776Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.


Quick Search


More Information