Author: Liu, Shuaiqi
Title: Neural abstractive summarization for long documents
Advisors: Cao, Jiannong (COMP)
Degree: Ph.D.
Year: 2024
Subject: Computational intelligence
Automatic abstracting
Text processing (Computer science)
Hong Kong Polytechnic University -- Dissertations
Department: Department of Computing
Pages: xxi, 172 pages : color illustrations
Language: English
Abstract: Long documents, like academic literature, financial reports, and legal instruments, are important information sources. Nowadays, people can access massive long documents through the Internet. Reading through all their acquired documents and finding their desired content would be a heavy burden. The high-quality summaries can help people quickly grasp the key information from original documents. Automatic text summarization techniques can be employed to produce concise summaries for long documents. Abstractive summarization methods can approximate how humans write summaries by capturing input documents’ salient content and generating novel sentences as summaries.
In this thesis, I study neural abstractive summarization for long documents. I aim to train neural network models to generate informative, fluent, and non-redundant summaries covering the multi-granularity, multi-document, and multimodal salient content in various long documents. Some new challenges arise in order to accomplish this objective: 1) the scarcity of available datasets, 2) identifying the multi-granularity salient information scattered in long inputs, 3) incorporating multi-document and multimodal content when generating summaries, 4) evaluating the quality of the generated summaries, 5) improving the efficiency of model training and inference. To tackle the above challenges, I built multiple large-scale datasets, novel summarization methods, and evaluation metrics, which are summarized below.
First, I built multiple large-scale long document summarization datasets for academic literature, financial reports, and legal instruments, which can be the foundation of long document summarization research. Meanwhile, my datasets support extending long document summarization research from unimodal to multimodal, from summarizing a limited number of documents to a large number of documents.
Second, I propose a series of techniques to identify the multi-granularity salient information scattered in long documents. This thesis introduces novel attention mechanisms, category-based content alignment method, and the multistage content selection schema for identifying and encoding phrase-level, sentence-level, and segment-level salient content.
Besides, my research validates the importance of jointly considering multimodal or multi-document content when summarizing long documents. This thesis proposes multiple methods incorporating salient content from text and tables into summary generation. Besides, this thesis also proposes methods to summarize multiple categories of salient content from a large number of documents and generate structured summaries.
To evaluate various summarization methods, my research not only employs commonly used automatic evaluation metrics but also proposes novel evaluation metrics. We also compare different models’ generated summaries by human evaluation.
Last but not least, my research leverages various techniques to improve the efficiency of model training and inference. This thesis not only proposes efficient summarization models but also adopts some memory-efficient training methods. These techniques enable training large neural summarization models over long inputs on an of-the-shelf GPU.
I hope this thesis can promote the long document summarization research. Although this thesis presents novel datasets, methods, and evaluation metrics for this topic, it still has many open problems. I list some future research directions at the end of this thesis.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
7260.pdfFor All Users2.35 MBAdobe PDFView/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: