Artificial agents and, in particular, humanoid robots can interact with the surrounding environment, objects, and people using their cameras, actuators, and embodiment. They can employ verbal and non-verbal communication methods to engage with other agents. However, the behaviour of these robots is typically pre-programmed, limiting their capabilities to a predetermined set of actions. One alternative approach is to teleoperate the robot using additional devices, which provide precise measurements of pose and speed. Nonetheless, these devices can be costly and require expertise to operate effectively. Another option is imitation, where a system can replicate observed actions using only the robot's camera, akin to how humans learn. Consequently, one intriguing avenue of research is the acquisition of non-verbal communication skills through learning from demonstrations. This approach holds promise for applications such as teaching machines to comprehend and express sign language. The overall aim of this thesis is to study imitation learning in a human-like fashion for artificial agents. As a case study, we teach a simulated humanoid agent American Sign Language (ASL) by imitating videos of people performing different signs. We use computer vision and imitation learning techniques, namely deep learning and reinforcement learning, to extract information from pre-recorded videos and teach the agent how to replicate the observed action. When compared to other approaches, our proposal removes the need for additional hardware (like motion capture suits or virtual reality headsets) to collect information necessary for imitation. Additionally, our approach shows how to take advantage of data without ground truth for regression via weak validation through classification. We perform a first study to evaluate the data extracted from the videos using pre-trained vision models. We base our evaluation on subunits (i.e., phonological classes), as we believe they provide more fine-grained information when compared to lemmas. Consequently, we expand our findings to a novel large-scale dataset that we generate, in order to test the generalisability of our approach. We apply state-of-the-art techniques for generating animation based on motion capture data to our scenario, teaching a virtual agent how to fingerspell and perform signs. The results we collected show that keypoints extracted from videos provide a good reference for recognising phonological classes of sign language. Additionally, we demonstrate how, albeit with some limitations, an imitation learning approach based on reinforcement learning and motion data extracted from videos provides a possible way of acquiring sign language. In particular, we show how our methodology is able to learn fingerspelling for 6 different letters and 5 different signs involving the whole upper body. To summarise, we generate two novel datasets in which, for each sign, we associate phonological properties. Moreover, we demonstrate how to automate the recognition of such properties in signs, even on unseen signs, by carrying out the first large-scale attempt of phonological properties. Finally, to the best of our knowledge, we showcase for the first time how to acquire sign language (both fingerspelling and full signs) in an embodied fashion using only videos as input data.
Date of Award | 31 Dec 2023 |
---|
Original language | English |
---|
Awarding Institution | - The University of Manchester
|
---|
Supervisor | Aphrodite Galata (Supervisor) & Angelo Cangelosi (Supervisor) |
---|
- Phonological properties
- Reinforcement Learning
- American Sign Language
- Imitation Learning
- Machine Learning
- Artificial agents
- Robotics
Machine Learning for American Sign Language Recognition and Acquisition in Artificial Agents
Tavella, F. (Author). 31 Dec 2023
Student thesis: Phd