Author: | Sun, Hongliang |
Title: | Domain-specific language model continue pretraining for Chinese Weibo |
Advisors: | Li, Jing Amelia (COMP) |
Degree: | M.Sc. |
Year: | 2021 |
Subject: | Natural language processing (Computer science) Machine learning Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | viii, 51 pages : color illustrations |
Language: | English |
Abstract: | With deep learning technology development, deep neural networks have been increasingly used by many natural language applications in real life. Large pretrained language models like BERT are very effective for many natural language processing (NLP) tasks. Many recent studies have shown that using domain-specific data for in-domain pretraining will be effective in downstream tasks within the corresponding domain. This dissertation focuses on studying the effect of using Chinese social media text on the pretraining model BERT after continue pretraining. We collected a large-scale Chinese social media text dataset from Chinese Sina Weibo. We adopted the method of continue pretraining the original BERT model. And we used the corpus of Chinese social media for in-domain continue pretraining and obtained the continue pretrained Weibo version of BERT. We also made an evaluation dataset, and evaluated the three downstream tasks of Chinese word segmentation, POS-tagging, and NER by using our own annotated evaluation corpus. The experimental results show that the training effect of the model is improved after continue pretraining in the Chinese social media domain. |
Rights: | All rights reserved |
Access: | restricted access |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
5860.pdf | For All Users (off-campus access for PolyU Staff & Students only) | 1.69 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/11374