|Title:||Real-time scene text identification using mobile devices|
|Advisors:||Lun, Pak Kong Daniel (EIE)|
|Subject:||Signal processing -- Digital techniques.|
Cell phone systems.
Hong Kong Polytechnic University -- Dissertations
|Department:||Faculty of Engineering|
|Pages:||ix, 67 leaves : color illustrations ; 30 cm|
|Abstract:||Mobile devices have become ubiquitous in our daily lives. As a part of an "Internet of Things" project, this study targets at the detection and recognition of text in static images and live video streams, deployed on mobile devices It is applied to a shopping mall environment for the real-time identification of the trademark of the shops from the images or videos captured using a mobile phone. To achieve this, we integrated state-of-the-art algorithms into our system and add novel features. For instance, we adopted the linear time Maximally Stable Extremal Region (MSER) estimation algorithm for extracting the text candidates and designed a grouping classifier to build hypothesis about text lines. To recognize text contents, we used Google's open source OCR engine Tesseract and designed text similarity measurements for pattern matching in our database. As a feedback, the OCR engine can help the algorithm to further eliminate the non-text candidates. Since the shop trademark can be a graphical logo, we extend our study to the identification of shop logo. We trained a boosting classifier for each logo template in the database using the HOG feature descriptor. The candidates are firstly verified by the difference of color histogram. In addition, we designed client-server architecture. The client uses the fast HOG classifier to extract candidates and the server uses the SIFT to verify with high accuracy. Based on the motion model of the user, we adopt the frame skipping strategy to satisfy the real-time requirements. The contribution of this thesis can be attributed to three main aspects. First, we implemented full functionality of MSER extractor and MSER pruning methods running in linear time. Experiments show that our implementation runs faster and the accuracy is competitive. Second, our system is much more time-efficient and user-friendly. Traditional approaches such as the Google Goggles capture images and then upload to the server. Its success largely relies on the power of the server clusters and big data. And users have to upload photos and wait for results. The data traffic can be a severe factor of performance. While for our system, all the localization and recognition can be done natively and automatically due to the prior knowledge of shopping malls and motion model of the camera. Thirdly, we introduced a text similarity measurement to backward improve the text localization and recognition results. State-of-the-art methods just extract text regions in an aggressive way, but our system makes use of the "meaningfulness" of the text to further filter out non-text candidates. Therefore, this application merges the advantages of machine learning and computer vision to make benefits for human users.|
|Rights:||All rights reserved|
Files in This Item:
|b28110584.pdf||For All Users (off-campus access for PolyU Staff & Students only)||3.58 MB||Adobe PDF||View/Open|
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item: