Abstract: Estimating the 3D human pose from the monocular video is challenging mainly due to the depth ambiguity and inaccurate 2D detected keypoints, which always occur in common 2D-to-3D lifting solutions. To quantify the depth uncertainty of 3D human pose via the neural network, we imbue the uncertainty modeling to depth prediction by using evidential deep learning (EDL). Meanwhile, to calibrate the distribution uncertainty of the 2D detection, we explore a probabilistic representation to model the realistic distribution. Specifically, we exploit the EDL to measure the 𝑑𝑒𝑝𝑡ℎ prediction uncertainty of the network, and decompose the (𝑥, 𝑦) coordinates into individual distributions to model the deviation uncertainty of the inaccurate 2D keypoints. Then we optimize the depth uncertainty parameters and calibrate the 2D deviations to obtain accurate 3D human poses. Besides, to provide effective high dimensional features for uncertainty learning, we design an effective encoder, which combines graph convolutional network (GCN) and transformer to learn discriminative spatio-temporal representations of the input keypoints sequence. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva-I) to evaluate the proposed method. The comprehensive results show that our model surpasses the state-of-the-arts by a large margin.