Author: Yang, Qiang
Title: Towards context-aware voice interaction via acoustic sensing
Advisors: Zheng, Yuanqing (COMP)
Xiao, Bin (COMP)
Degree: Ph.D.
Year: 2023
Subject: Speech processing systems
Hong Kong Polytechnic University -- Dissertations
Department: Department of Computing
Pages: xv, 139 pages : color illustrations
Language: English
Abstract: Voice interaction has become the fundamental approach to connecting humans and smart devices. Such an interface enables users to easily complete daily tasks by voice commands, which not only contain the explicit user's semantic meaning but also imply the user's physical context information such as location and speaking direction. Although current speech recognition technology allows devices to accurately understand voice content and take smart actions, these contextual clues can further help smart devices make more intelligent responses. For example, knowing a user's location helps narrow down the possible set of voice commands and provides customized services to users in a kitchen.
Acoustic sensing has been studied for a long time. However, unlike actively transmitting handcrafted sensing signals, we can only obtain the voice on the receiver side, making sensing voice contexts challenging. In this thesis, we use voice signals as a sensing modality and propose new acoustic sensing techniques in a passive way to extract the physical context of the voice/user: location, speaking direction, and liveness. Specifically, (1) inspired by the human auditory system, we investigate the effects of human ears on binaural sound localization and design a bionic machine hearing framework to locate multiple sounds with binaural microphones. (2) We exploit the voice energy and frequency radiation patterns to estimate the user's head orientation. By modeling the anisotropic property of voice propagation, we can measure the user's speaking direction, serving as a valuable context for smart voice assistants. (3) Attackers may use a loudspeaker to play pre-recorded voice commands to deceive voice assistants. We check the sound generation difference between humans and loudspeakers and find that the human's rapid-changing mouth leads to a more dynamic sound field. Thus, we can detect voice liveness and defend against such replay attacks by examining sound field dynamics.
To achieve such context-aware voice interactions, we look into the physical properties of voice, work with hardware and software, and introduce new algorithms by drawing from principles in acoustic sensing, signal processing, and machine learning. We implement these systems and evaluate them with various experiments, demonstrating that they can facilitate many new real-world applications, including multiple sound localization, speaking direction estimation, and replay attack defense.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
6711.pdfFor All Users3.59 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/12290