Double-byte text compression

Pao Yue-kong Library Electronic Theses Database

Double-byte text compression


Author: Cheng, Kwok-shing
Title: Double-byte text compression
Degree: Ph.D.
Year: 2001
Subject: Data compression (Computer science)
Hong Kong Polytechnic University -- Dissertations
Department: Dept. of Computing
Pages: xvii, 244 leaves : ill. ; 30 cm
Language: English
InnoPac Record:
Abstract: Owing to the rapid growth of double-byte text, double-byte information processing is becoming an interesting research topic. Today, text files such as newspaper, magazines, dictionaries and novels that are encoded by double-byte languages (Chinese, Japanese and Korean) are being used extensively. Therefore, there is a great demand on high-performance double-byte text compression algorithms. However, the current commonly used data compression algorithms are originally designed for compressing single-byte text instead of double-byte text. When these algorithms are applied to double-byte text, it can be observed that the compression performance will be poor. This thesis describes the inefficiency of conventional compression algorithms for double-byte text, and proposes efficient algorithms for improving the compression performance of the double-byte languages in general, but the algorithms do not target for compressing a specific language. In this thesis, a survey of commonly used text compression algorithms is presented first. Then, the characteristics of double-byte text and important terminology related to text compression is introduced. Since the properties of double-byte text are different from ASCII text, conventional compression algorithms obtain an unsatisfactory compression performance for the double-byte text. The weakness of the algorithms is explained in detail. In order to have an accurate performance evaluation on the conventional or proposed compression algorithms, we have built representative corpora with total file size over 40M bytes for the languages of Chinese, Japanese and Korean, which are the well known and commonly used double-byte language in Asia. The corpora cover a wide range of different areas and topics. They will be released for public use. Throughout this thesis, a systematic study from analyzing the character-based compression model, the word-based compression model to the cascading compression model will be taken. The main concern is to propose algorithms to improve the compression ratios. The execution time and memory consumption of the algorithms are also considered. In this project, all compressors are run on Sun UltraSPARC 5 Workstation with 400-MHz UltraSPARC III Processor and 64 Mbytes Memory. The operating system is Solaris. In the character-based model, a new algorithm called indicator dependent Huffman coding scheme (IDC) is proposed. IDC uses multiple Huffman trees to encode the input text. It obtains the best average compression ratio of 1.91 among the variations of Huffman coding scheme, and outperforms an UNIX Huffman-based compressor, PACK by 43%. Another encoding algorithm, called arithmetic coding scheme is under the same category as Huffman coding scheme. They both belong to statistical compression algorithms. The partial predictive matching algorithm (PPM) is based on the high-order arithmetic coding scheme. A well-known compressor of PPM is COMP-2 which can achieve a very high compression ratio for ASCII files. When COMP-2 is applied to our corpora, it obtains a compression ratio of 2.57. An extended PPM algorithm for double-byte characters, called DBPPMC+ is proposed. DBPPMC+ can obtain very good compression ratios of 3.04. However, both COMP-2 and DBPPMC+ suffer from extremely slow compression and decompression rates (smaller than 30 Kbytes per second). It is hard to implement them into practical use because of their slow rates, but their good compression ratios can serve as a benchmark to evaluate the effectiveness of a compression algorithm. In the word-based compression model, well-known compression algorithms LZSS and LZW are chosen for experiments. A Double-byte Character Identification Routine, called Readltem is proposed and applied to both the LZSS and LZW to form the new algorithms of 16LZSS and 16LZW. The modified algorithms, 16LZSS and 16LZW can achieve the average compression ratios of 2.26 and 2.33 respectively. The results are definitely better than the UNIX LZW-based compressor, COMPRESS, which has compression ratio of 1.93 only. 16LZSS and I6LZW can also have comparable results with the famous UNIX cascading compressor, GZIP, which has compression ratio of 2.35. In the cascading compression model, we propose to combine the advantages of the above two models to form a better compression model. In character-based model, the advantage is that it assigns variable-length bits to characters depending on their frequencies, but it suffers from using a single code to represent one character only. In word-based model, it can represent several characters (a word) by a single code, but each code is represented in inefficient fixed length of bits. Therefore, the proposed cascading compression model tries to create a compressor that can represent several characters by a single code. Moreover, depending on the frequencies of the words, each code is represented by variable-length bits. Among the variations of cascading algorithms, the proposed algorithm 16LZSSPDC achieves the average compression ratio of 2.58, which outperforms GZIP by 9.8%. If the input text is segmented into words by matching against a dictionary, and the algorithm of 16LZSSPDC is applied to the segmented words, further improvement is expected. This can be realized by modifying the 16LZSSPDC to form a new algorithm called WLZSSPDC. WLZSSPDC can obtain a compression ratio of 2.69, and it outperforms GZIP by 14%. The relationship between the size of dictionary for word-segmented text and the compression ratios obtained in WLZSSPDC is also studied. The last part of this project is to study the compression model for Hypertext Markup Language (HTML). Since the HTML is widely available and it has a certain level of structural properties, it is suitable for testing the rule-based grammatical compression model. Again, the double-byte HTML is focused. The benefit of compressing HTML is that we can save the storage and reduce the loading time of a web page. The testing web pages we collect are encoded in Chinese, Japanese and Korean. Specific compression algorithm called P16LZSSPDC is designed for the HTML tags and text content only. The P16LZSSPDC is a hybrid algorithm, which consists of a preprocessor for parsing the HTML tags into symbols, and 16LZSSPDC is used as the encoding engine. It makes use of the grammar in HTML and represents the HTML tags in an intelligent way. On average, it outperforms the 16LZSS by 22.3%. The research results are not only applicable to the HTML. It can further be applied to other web-based structural text or markup languages with grammar or rules provided. The basic idea is to extend the algorithm to encode the rules of grammar instead of characters or phrases.

Files in this item

Files Size Format
b15995537.pdf 7.843Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.


Quick Search


More Information