our baseline model. While in our experiments we utilize video data for training, our model is a single-
image pose estimator and it is fundamentally different from video pose estimation models [
1
,
19
,
62
]
which take multiple continuous frames as inputs. This gives our model the flexibility to perform pose
estimation on static images and thus it is not directly comparable to approaches with video inputs.
Our work is also related to personalization on human pose estimation from Charles et al. [
10
], which
uses multiple temporal and continuity constraints to propagate the keypoints to generate more training
data. Instead of tracking keypoints, we use a self-supervised objective to perform personalization in
test time. Our method is not restricted to the continuity between close frames, and the self-supervision
can be applied on any two frames far away in a video as long as they belong to the same person.
Test-Time Adaptation.
Our personalization setting falls into the paradigm of Test-Time Adaptation
which is recently proposed in [
51
,
50
,
3
,
58
,
35
,
61
,
69
,
28
,
42
,
20
] for generalization to out-of-
distribution test data. For example, Shocher et al. [
51
] propose a super-resolution framework which
is only trained during test time with a single image via down-scaling the image to create training
pairs. Wang et al. [
61
] introduce to use entropy of the classification probability distribution to provide
fine-tuning signals when given a test image. Instead of optimizing the main task itself during test time,
Sun et al. [
58
] propose to utilize a self-supervised rotation prediction task to help improve the visual
representation during inference, which indirectly improves semantic classification. Going beyond
image classification, Joo et al. [
30
] propose a method that proposes test time optimization for 3D
human body regression. In our work for pose personalization, we try to bridge the self-supervised and
supervised objectives close. We leverage a self-supervised keypoint estimation task and transform
the self-supervised keypoints to supervised keypoints via a Transformer model. In this way, training
with self-supervision will directly improve the supervised keypoint outputs.
Self-supervised Keypoint Estimation.
There are a lot of recent developments on learning keypoint
representations with self-supervision [
55
,
72
,
26
,
38
,
32
,
27
,
68
,
36
,
40
]. For example, Jakab et
al. [
26
] propose a video frame reconstruction task which disentangles the appearance feature and
keypoint structure in the bottleneck. This work is then extended for control and Reinforcement
Learning [
32
,
36
,
40
], and the keypoints can be mapped to manual defined human pose via adding
adversarial learning loss [
27
]. While the results are encouraging, most of the results are reported
in relatively simple scenes and environments. In our paper, by leveraging the self-supervised task
together with the supervised task, we can perform human pose personalization on images in the wild.
Transformers.
Transformer has been widely applied in both language processing [
60
,
16
] and
computer vision tasks [
63
,
46
,
23
,
49
,
56
,
17
,
11
,
4
,
73
,
6
,
37
], specifically for pose estimation
recently [
66
,
54
,
41
,
33
]. For example, Li et al. [
33
] propose to utilize the encoder-decoder model in
Transformers to perform keypoint regression, which allows for more general-purpose applications
and requires less priors in architecture design. Inspired by these works, we apply Transformer to
reason the relation and mapping between the supervised and self-supervised keypoints.
3 Method
Our method aims at generalizing better for pose estimation on a single image by personalizing with
unlabeled data. The model is firstly trained with diverse data on both a supervised pose estimation
task and a self-supervised keypoint estimation task, using a proposed Transformer design to model
the relation between two tasks. During inference, the model conducts Test-Time Personalization
which only requires the self-supervised keypoint estimation task, boosting performance without costly
labeling or sacrificing privacy. The whole pipeline is shown in Figure 2.
3.1 Joint Training for Pose Estimation with a Transformer
Given a set of
N
labeled images of a single person
I = {I
1
, I
2
. . . , I
N
}
, a shared encoder
φ
maps
them into feature space
F = {F
1
, F
2
. . . , F
N
}
, which is shared by both a supervised and a self-
supervised keypoint estimation tasks. We introduce both tasks and the joint framework as follows.
3.1.1 Self-supervised Keypoint Estimation
For the self-supervised task, we build upon the work from Jakab et al. [
26
] which uses an image
reconstruction task to perform disentanglement of human structure and appearance, which leads to
self-supervised keypoints as intermediate results. Given two images of a single person
I
s
and
I
t
, the
task aims at reconstructing
I
t
using structural keypoint information from target
I
t
and appearance
information from source
I
s
. The appearance information
F
app
s
of source image
I
s
is extracted with a
3