China’s AI scientists teach a neural net to train itself
Researchers at China’s Sun Yat-Sen University, with help from Chinese startup SenseTime, improved upon their own attempt to get a computer to discern human poses in images by adding a bit of self-supervised training. The work suggests continued efforts to limit the reliance on human labels and “ground truth” in AI.
More and more, AI is trying to make machines teach themselves with a minimum of human guidance. So-called self-supervision is an element that can be added to lots of machine learning tasks so that a computer learns with less human help, perhaps someday with none at all.
Scientists at China’s Sun Yat-Sen University and Hong Kong Polytechnic University use self-supervision in a new bit of research to help a computer learn the pose of a human figure in a video clip.
Understanding what a person is doing in a picture is its own rich vein of machine learning research, useful for a whole number of things including video surveillance. But such methods rely on “annotated” data sets where labels are carefully applied to the orientation of the joints of the body.
That’s a problem because larger and larger “deep” neural networks are hungry for more and more data, but there isn’t always enough labeled data to feed the network.
So, the Sun Yat-Sen researchers set out to show a neural network can refine its understanding by continually comparing the guesses of multiple networks with one another, ultimately lessening the need for the “ground truth” afforded by a labeled data set.
The authors demonstrate success in beating other AI methods in predicting the pose of a figure across a series of benchmark tests. They also show they even beat their own results from 2017 with the addition of this new self-supervision approach.
The paper, 3D Human Pose Machines with Self-supervised Learning, is posted on the arXiv pre-print server and is authored by Keze Wang, Liang Lin, Chenhan Jiang, Chen Qian, and Pengxu Wei. Notably, Qian is with SenseTime, the Chinese AI startup that sells software for various applications such as facial recognition, and which distributes a machine learning programming framework called “Parrots.”
In their original paper from 2017, the authors used an annotated data set, the “MPII Human Pose” data set compiled in 2014 by Mykhaylo Andriluka and colleagues at Germany’s Max Planck Institute for Informatics. They used that labeled data set to extract two-dimensional human body parts from still images — basically, stick-figure drawings of the limbs oriented in space. They then converted those 2D body-part representations into 3D representations that indicate orientation of the limbs in three-dimensional space.
In the new paper, the authors do the same “pre-training” via the MPII data set, to extract the 2D poses from the images. And just as in 2017, they use another data set, “Human3.6M,” to extract the ground truth for 3D, as well. Human3.6M has 3.6million images taken in a laboratory setting of paid actors carrying out a variety of tasks, from running to walking to smoking to eating.
What’s new this time is that in the final part of their neural net, they throw away the 2D and 3D annotations. They instead compare the prediction their 3D model makes about what its 2D version should be to the 2D images that were produced in the first step. “After initialization, we substitute the predicted 2D poses and 3D poses for the 2D and 3D ground-truth to optimize” the model “in a self-supervised fashion.”
They “project the 3D coordinate(s)” of the 3D pose “into the image plane to obtain the projected 2D pose” and then they “minimize the dissimilarity” between this new 2D pose and the first one they had derived “as an optimization objective.”
In a sense, the neural network keeps asking if its 3D model of the body is predicting accurately in three dimensions what it thought at the beginning of the process in two dimensions, learning about how 3D and 2D correspond.
There is a lot of now-standard machine learning stuff here: A convolutional neural network, or CNN, allows the system to extract the the 2D stick figure. That approach is borrowed from an earlier piece of work by Carnegie-Mellon researchers in 2014 and a follow up they did in 2016.
Then, a long short-term memory, or LSTM, a neural network specialized to retain a memory of sequences of events, is used to extract the continuity of the body from multiple sequential video frames to create the 3D model. That work is modeled after work done in 2014 by Alex Graves and colleagues at Google’s DeepMind, which had originally been built for speech recognition.
What’s novel here is imposing self supervision to make the whole thing hold together without ground-truth labels. By taking this last step, the authors were able to lessen the need for 3D data and instead lean upon 2D images. “The imposed correction mechanism enables us to leverage the external large-scale 2D human pose data to boost 3D human pose estimation,” they write.