Recent years have seen a renewed interest in RGB-based hand tracking, as opposed to the depth-based tracking that has dominated the field since the introduction of commodity depth cameras. This trend has been driven by the ability of convolutional neural networks to process large quantities of image data. In this paper, we propose an approach to hand tracking that operates on sets of dense semantic labels. A full pipeline for RGB-based hand tracking is presented. This pipeline uses convolutional neural networks to produce a per-pixel semantic map of the scene before optimising the state of a kinematic model according to this semantic map using a tracking algorithm based on Different Evolution (DE). This technique allows us to simultaneously localise the hand in 3D space and recover the pose, and requires only monocular RGB input. We apply our technique to a benchmark dataset, reporting semantic segmentation and 3D pose tracking results, which we compare to the current state of the art. We also compare our DE-based algorithm to an equivalent one based on Particle Swarm Optimisation (PSO) and show that it is superior.
|Published - Nov 2020