Author: Sun, Hongliang
Title: Domain-specific language model continue pretraining for Chinese Weibo
Advisors: Li, Jing Amelia (COMP)
Degree: M.Sc.
Year: 2021
Subject: Natural language processing (Computer science)
Machine learning
Hong Kong Polytechnic University -- Dissertations
Department: Department of Computing
Pages: viii, 51 pages : color illustrations
Language: English
Abstract: With deep learning technology development, deep neural networks have been increasingly used by many natural language applications in real life. Large pretrained language models like BERT are very effective for many natural language processing (NLP) tasks. Many recent studies have shown that using domain-specific data for in-domain pretraining will be effective in downstream tasks within the corresponding domain. This dissertation focuses on studying the effect of using Chinese social media text on the pretraining model BERT after continue pretraining. We collected a large-scale Chinese social media text dataset from Chinese Sina Weibo. We adopted the method of continue pretraining the original BERT model. And we used the corpus of Chinese social media for in-domain continue pretraining and obtained the continue pretrained Weibo version of BERT. We also made an evaluation dataset, and evaluated the three downstream tasks of Chinese word segmentation, POS-tagging, and NER by using our own annotated evaluation corpus. The experimental results show that the training effect of the model is improved after continue pretraining in the Chinese social media domain.
Rights: All rights reserved
Access: restricted access

Files in This Item:
File Description SizeFormat 
5860.pdfFor All Users (off-campus access for PolyU Staff & Students only)1.69 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/11374