skip to Main Content
Human Pose Estimation Experiment

Today’s blog will be delivered by Yoshinobu Ogura, Data Scientist at HACARUS.

Hello, my name is Yoshinobu Ogura, and I am a HACARUS Data Scientist. In this article, I will be using deep learning to estimate human skeletal structure – known as “Human Pose Estimation”.

About Human Pose Estimation

The final goal aim of this article is to create a baseline for Human Pose Estimation. Human Pose Estimation is a technique to model the human skeleton from images, which is the same as inferring “bones” in 3D modeling.  There are two types of modeling techniques: 2D and 3D modeling. 

Reference: Simple baselines for Human Pose Estimation and Tracking([1])

Google Colaboratory was used as the execution environment. The following environment was assigned:

OS: Ubuntu 18.04.3 LTS

CPU: Intel(R) Xeon(R) CPU @ 2.30GHz(x2)

MemTotal: 12.7GB

GPU: Tesla K80(x1)


Researching and choosing the baseline model

Human pose estimation is a task that has been around for a long time, but all of the state of the art methods are based on deep learning. Human Pose Estimation can largely be divided into the following 2 methods:


Determine key features → Apply to individuals


Detect individuals → Find key features for each individual


An individual’s key features are the areas that are important for estimating posture, such as the eyes and joints. Key feature detection creates a heat map for each type of key feature (e.g., shoulders or knees). This alone is a very similar task to human part segmentation which we covered in the hyperlinked blog in the past. 


In this article we will use the SOTA model at the time of April 2018, “Simple baselines for Human Pose Estimation and Tracking” ([1]) The networks used for Human Pose Estimation tasks are often complex, but since they are most likely to be tweaked during the experiment, the simplest model was chosen in this case.


The model proposed in “Simple baselines for Human Pose Estimation and Tracking” is based on a top-down approach. First, I used faster-RCNN to estimate the bounding box, which represents an individual’s domain, and then the network proposed by the authors were used for key feature inference.


The network proposed by the authors is a simple one; ResNet (C5) with a deconvolution layer (D3) added at the end. The authors haven’t given this model a particular name, but simply for explanation purposes, let’s call it C5-D3 in this article.

Reference: Simple baselines for Human Pose Estimation and Tracking([1])


C5-D3 takes a human image as input data, and outputs a heat map of where key features are likely to be. In addition, the heatmap image, which is the training data of C5-D3, is a gaussian function – that is key feature centered . After creating a heatmap for each key feature in C5-D3, the point with the maximum pixel value in the heat map is consequently the predicted key feature point.


Model Evaluation (Validation)

An implementation of this paper can be found on Github and you can download the pre-trained model. In this article, validation is performed using the COCO Dataset. In order to evaluate the only the key feature detection, the human image was cut out based on the COCO Dataset annotation.


Also, to perform a quick evaluation, I made a sample program on Google Colaboratory  which you can try here. Please check it out if you are interested! 

Below are the experiment results. Predictions were performed for each individual.

Below is an image where all individuals are shown,

The head, shoulders, elbows, knees, and toes are accurately estimated, regardless of the orientation or angle of the body.

In the code used in this case, inference was performed on two mini-batches consisting of 200 images and 100 images. The execution time was 3.856s and 0.611s, respectively. This means that the speed we confirmed was 6-20 ms per image, which is about 50-170 fps.  The variation in speed was probably due to the number of people in the images.


In Conclusion

Using the network proposed in「Simple baselines for Human Pose Estimation and Tracking, Human Pose Estimation was performed, and inference of key features such as hands and feet was possible regardless of what angle the bodies were facing. In the future, I will look to achieve real time Human Pose Estimation using lightweight machines such as jeston nano.

Human Pose Estimation has many applications, including correcting baseball swings, automated dance reviews and avatar filming for Vtubers. Please feel free to contact HACARUS if you have any inquiries concerning this technology.

Thank you very much for reading this post.



[[1] Simple baselines for Human Pose Estimation and Tracking(2018)](