A systematic vision-based methodology for holistic scene understanding in human-robot collaboration

Fan, Junming

Author:	Fan, Junming
Title:	A systematic vision-based methodology for holistic scene understanding in human-robot collaboration
Advisors:	Zheng, Pai (ISE) Lee, K. M. Carman (ISE)
Degree:	Ph.D.
Year:	2024
Subject:	Human-robot interaction -- Industrial applications Human engineering Hong Kong Polytechnic University -- Dissertations
Department:	Department of Industrial and Systems Engineering
Pages:	xxiii, 180 pages : color illustrations
Language:	English
Abstract:	The next generation of the industry has depicted a visionary blue-print of human-centricity in futuristic manufacturing systems. In the modern manufacturing sector, there has already begun a dramatic shift from the traditional mode of mass production towards mass personalization, driven by the increasing prevalence of personalization culture and customization requirements. The conventional approach for mass production has predominantly relied on the use of automated production lines, along with machines and robots that operate on preprogrammed routines. Although this method has demonstrated effectiveness in the era of mass production, the lack of intelligence and flexibility largely restrict its capacity to dynamically adjust to the frequently changing production schedule and specifications typical in mass personalization scenarios. To mitigate these limitations, human-robot collaboration (HRC) has emerged as an advanced manufacturing paradigm and is gaining traction as a promising solution to mass personalization since it can simultaneously leverage the consistent strength and repetitive precision of robots and the flexibility, creativity, and versatility of humans. Over the past decade, a considerable amount of research efforts have been dedicated to HRC, addressing issues such as system architecture, collaboration strategy planning, and safety considerations. Among these topics, context awareness has drawn significant attention as it forms the bedrock of critical functionalities such as collision avoidance and robot motion planning. Existing research works in context awareness have extensively concentrated on certain aspects of human recognition, such as activity recognition and intention prediction, due to the paramount importance of human safety in HRC systems. Nevertheless, there has been a noticeable shortage in addressing other vital components of the HRC scene, which can also substantially influence the collaborative working process. In order to fill this gap, this thesis aims to provide a systematic vision-based methodology for holistic scene understanding in HRC, which takes into account the cognition of HRC scene elements including 1) objects, 2) humans, and 3) environments, coupled with 4) visual reasoning to gather and compile visual information into semantic knowledge for subsequent robot decision-making and proactive collaboration. In this thesis, the four aspects are examined and potential solutions are explored to demonstrate the applicability of the vision-based holistic scene understanding scheme in HRC settings. Firstly, a high-resolution network-based two-stage 6-DoF (Degree of Freedom) pose estimation model is constructed to enhance the object perception skill for subsequent robotic manipulation and collaboration strategy planning. Given the visual observation of an industrial workpiece, the first stage makes a coarse estimation of the 6-DoF pose to narrow down the solution space, and the second stage takes the coarse result along with the original image to refine the pose parameters to produce finer estimation results. In HRC scenarios, the workpieces are frequently manipulated by human hands, leading to another issue – the hand-object occlusion. Regarding this problem, an integrated hand-object 3D dense pose estimation model is designed with an explicit occlusion-aware training strategy aiming to mitigate the occlusion-related accuracy degradation (Chapter 3). Then a vision-based human digital twin (HDT) modelling approach is explored in the HRC scenarios, hoping to serve as a holistic and centralized digital representation of human operator status for seamless integration into the cyber-physical production system (Chapter 4). The proposed HDT model is primarily composed of a convolutional neural network designed to concurrently monitor various aspects of hierarchical human status, including 3D human posture, action intention, and ergonomic risk assessment. Subsequently, based on the HDT information, a novel robotic motion planning strategy is introduced, which is focused on the adaptive optimization of the robotic motion trajectory, aiming to enhance the effectiveness and efficiency of robotic movements in complex environments. The proposed HDT modelling scheme provides an exemplary solution of how to model various human states from vision data with a unified deep learning model in an end-to-end manner. Thirdly, a research endeavour is devoted to the perception of the HRC environment, for which a multi-granularity HRC scene segmentation scheme is proposed, along with a specifically devised semantic segmentation network with a bunch of advanced network designs (Chapter 5). Traditional semantic segmentation models mostly rely on a single-granularity semantic level. This formulation cannot adapt to different HRC situations where the requirements of semantic granularity are diversified such as a close-range collaborative assembly task versus a robotic workspace navigation case. Aiming to address this issue, the proposed model is designed to provide a hierarchical representation of the HRC scene which can dynamically switch between different semantic levels to flexibly accommodate the constantly changing needs of various HRC tasks. Lastly, a vision-language reasoning approach is investigated to take a step further from visual perception to human-like reasoning and understanding of the HRC situation (Chapter 6). To address the inherent ambiguity of sole vision-based human-robot communication such as unclear reference of target objects or action intentions, linguistic data is introduced to complement visual data in the form of a vision-language guided referred object retrieval model. Based on the retrieved target object location, a large language model-based robotic action planning strategy is devised to adaptively generate executable robotic action code via natural form language interaction with the human operator. The incorporation of vision-language data demonstrates a viable pathway to achieve complex reasoning to enhance embodied robotic intelligence and maximize HRC working efficiency.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
7508.pdf	For All Users	10.64 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13056