Uncertainty-Aware 3D Human Pose Estimation from Monocular Video

ACM International Conference on Multimedia

Abstract: Estimating the 3D human pose from the monocular video is challenging mainly due to the depth ambiguity and inaccurate 2D detected keypoints, which always occur in common 2D-to-3D lifting solutions. To quantify the depth uncertainty of 3D human pose via the neural network, we imbue the uncertainty modeling to depth prediction by using evidential deep learning (EDL). Meanwhile, to calibrate the distribution uncertainty of the 2D detection, we explore a probabilistic representation to model the realistic distribution. Specifically, we exploit the EDL to measure the ๐‘‘๐‘’๐‘๐‘กโ„Ž prediction uncertainty of the network, and decompose the (๐‘ฅ, ๐‘ฆ) coordinates into individual distributions to model the deviation uncertainty of the inaccurate 2D keypoints. Then we optimize the depth uncertainty parameters and calibrate the 2D deviations to obtain accurate 3D human poses. Besides, to provide effective high dimensional features for uncertainty learning, we design an effective encoder, which combines graph convolutional network (GCN) and transformer to learn discriminative spatio-temporal representations of the input keypoints sequence. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva-I) to evaluate the proposed method. The comprehensive results show that our model surpasses the state-of-the-arts by a large margin.

Jinlu Zhang
Jinlu Zhang
Master student in Computer Vision

My current research interests include 3D human pose estimation, human motion, and video understanding.