用于人体姿势估计的变压器测试时间个性化_Test-TimePersonalizationwithaTransformer资源-CSDN文库

版权申诉

92 浏览量 2022-02-05 01:11:09 上传评论收藏 5.65MB PDF 举报

在人体姿态估计领域，模型的泛化能力是关键挑战之一，尤其当面对不同未知环境和未见过的个体时。论文“用于人体姿态估计的变压器测试时间个性化”（Test-Time Personalization with a Transformer for Human Pose Estimation）提出了一种新颖的方法，能够在没有手动注释的情况下，仅通过一组测试图像对2D人体姿态估计器进行个性化定制。传统的深度学习方法通常致力于在大规模的人体姿态数据集上训练通用模型，以应对各种外观变化。然而，该论文关注的是针对单个主题进行个性化的姿势估计。对于特定个人，我们可能拥有长时间的视频或多个个人设备拍摄的照片。利用这些数据，可以调整模型来捕获个人特有的特征，以提高姿态估计的准确性和处理遮挡等问题的能力。作者们采用了一个Transformer模型来构建自我监督关键点与监督关键点之间的转换关系。Transformer以其强大的序列建模能力和并行计算效率，在许多自然语言处理任务中取得了突破，并逐渐被应用到计算机视觉领域。在这个框架下，Transformer能够捕获不同关键点间的长期依赖关系，这对于理解和估计人体姿态至关重要。在训练阶段，模型同时学习了监督学习目标和自我监督学习目标，以利用多样化数据。而在测试阶段，模型通过自我监督目标进行微调，实现个性化适应。更新后的自我监督关键点经过转换，可以进一步优化姿态估计结果。实验结果显示，这种方法在多个数据集上显著提高了姿态估计的性能。通过自监督个性化，模型能够更好地适应特定个体的特征，降低环境和个体差异带来的影响。项目的代码和更多细节可以在提供的项目页面上找到，这为研究者提供了实践这一方法的资源。总结起来，这篇论文提出了一个基于Transformer的测试时间个性化策略，旨在解决人体姿态估计中的泛化问题。通过结合监督学习和自我监督学习，模型在测试阶段能够根据新个体的特性进行动态调整，从而提升对未知个体的姿势识别精度。这种方法不仅扩展了Transformer的应用场景，也为个性化机器学习提供了一个新的视角，特别是在需要高度适应性和个体差异处理的任务中。

资源推荐

资源详情

资源评论

Test-Time Personalization with a Transformer for

Human Pose Estimation

Yizhuo Li

∗

Shanghai Jiao Tong University

liyizhuo@sjtu.edu.cn

Miao Hao

∗

UC San Diego

mhao@ucsd.edu

Zonglin Di

∗

UC San Diego

zodi@ucsd.edu

Nitesh B. Gundavarapu

UC San Diego

nbgundav@ucsd.edu

Xiaolong Wang

UC San Diego

xiw012@ucsd.edu

Abstract

We propose to personalize a 2D human pose estimator given a set of test images

of a person without using any manual annotations. While there is a signiﬁcant

advancement in human pose estimation, it is still very challenging for a model

to generalize to different unknown environments and unseen persons. Instead of

using a ﬁxed model for every test case, we adapt our pose estimator during test

time to exploit person-speciﬁc information. We ﬁrst train our model on diverse

data with both a supervised and a self-supervised pose estimation objectives jointly.

We use a Transformer model to build a transformation between the self-supervised

keypoints and the supervised keypoints. During test time, we personalize and

adapt our model by ﬁne-tuning with the self-supervised objective. The pose is then

improved by transforming the updated self-supervised keypoints. We experiment

with multiple datasets and show signiﬁcant improvements on pose estimations

with our self-supervised personalization. Project page with code is available at

https://liyz15.github.io/TTP/.

1 Introduction

Recent years have witnessed a large advancement in human pose estimation. A lot of efforts have

been spent on learning a generic deep network on large-scale human pose datasets to handle diverse

appearance changes [

]. Instead of learning a generic model, another line of research

is to personalize and customize human pose estimation for a single subject [

]. For a speciﬁc person,

we can usually have a long video (e.g., instructional videos, news videos) or multiple photos from

personal devices. With these data, we can adapt the model to capture the person-speciﬁc features for

improving pose estimation and handling occlusion and unusual poses. However, the cost of labeling

large-scale data for just one person is high and unrealistic.

In this paper, we propose to personalize human pose estimation with unlabeled video data during

test time, namely, Test-Time Personalization. Our setting falls in the general paradigm of Test-Time

Adaptation [

], where a generic model is ﬁrst trained with diverse data, and then it is

ﬁne-tuned to adapt to a speciﬁc instance during test time without using human supervision. This

allows the model to generalize to out-of-distribution data and preserves privacy when training is

distributed. Speciﬁcally, Sun et al. [

] propose to generalize image classiﬁcation by performing

joint training with a semantic classiﬁcation task and a self-supervised image rotation prediction

task [

]. During inference, the shared network representation is ﬁne-tuned on the test instance with

the self-supervisory signal for adaptation. While the empirical result is encouraging, it is unclear how

∗

Equal contribution

35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia.

arXiv:2107.02133v3 [cs.CV] 8 Nov 2021

Training

Inference

Predict

Personalize

non-continuous

continuous

Figure 1:

Test-Time Personalization

. Our model is ﬁrstly trained on diverse data with both super-

vised and self-supervised keypoint estimation tasks. During test time, we personalize the model

using only the self-supervised target in single person domain and then predict with the updated

model. During Test-Time Personalization, no continuous data is required but only unlabeled samples

belonging to the same person are needed. Our method boosts performance at test time without costly

labeling or sacriﬁcing privacy.

the rotation prediction task can help image classiﬁcation, and what is the relation between two tasks

besides sharing the same feature backbone.

Going beyond feature sharing with two distinct tasks, we introduce to perform joint supervised and

self-supervised human keypoint estimation [

] tasks where the supervised keypoint outputs are

directly transformed from the self-supervised keypoints using a Transformer [

]. In this way, when

ﬁne-tuning with the self-supervised task in test time, the supervised pose estimation can be improved

by transforming from the improved self-supervised keypoints.

We adapt the self-supervised keypoint estimation task proposed by Jakab et al. [

]. The task is built

on the assumption that the human usually maintains the appearance but changes poses across time in

a video. Given a video frame, it trains a network to extract a tight bottleneck in the form of sparse

spatial heatmaps, which only contain pose information without appearance. The training objective

is to reconstruct the same frame by combining the bottleneck heatmaps and the appearance feature

extracted from another frame. Note while this framework can extract keypoints to represent the human

structure, they are not aligned with the semantic keypoints deﬁned in human pose estimation. Building

on this model, we add an extra keypoint estimation objective which is trained with human supervision.

Instead of simply sharing features between two objectives as [

], we train a Transformer model on

top of the feature backbone to extract the relation and afﬁnity matrix between the self-supervised

keypoint heatmap and the supervised keypoint heatmap. We then use the afﬁnity matrix to transform

the self-supervised keypoints as the supervised keypoint outputs. With our Transformer design, it

not only increases the correlation between two tasks when training but also improves Test-Time

Personalization as changing one output will directly contribute to the the output of another task.

We perform our experiments with multiple human pose estimation datasets including Human

3.6M [

], Penn Action [

], and BBC Pose [

] datasets. As shown in Figure 1, our Test-Time

Personalization can perform on frames that continuously exist in a video and also with frames that

are non-continuous as long as they are for the same person. We show that by using our approach for

personalizing human pose estimation in test time, we achieve signiﬁcant improvements over baselines

in all datasets. More interestingly, the performance of our method improves with more video frames

appearing online for the same person during test time.

2 Related Work

Human Pose Estimation.

Human pose estimation has been extensively studied and achieved great

advancements in the past few years [

]. For

example, Toshev et al. [

] propose to regress the keypoint locations from the input images. Instead

of direct location regression, Wei et al. [

] propose to apply a cascade framework for coarse to ﬁne

heatmap prediction and achieve signiﬁcant improvement. Building on this line of research, Xiao et

al. [

] provides a simple and good practice on heatmap-based pose estimation, which is utilized as

our baseline model. While in our experiments we utilize video data for training, our model is a single-

image pose estimator and it is fundamentally different from video pose estimation models [

]

which take multiple continuous frames as inputs. This gives our model the ﬂexibility to perform pose

estimation on static images and thus it is not directly comparable to approaches with video inputs.

Our work is also related to personalization on human pose estimation from Charles et al. [

], which

uses multiple temporal and continuity constraints to propagate the keypoints to generate more training

data. Instead of tracking keypoints, we use a self-supervised objective to perform personalization in

test time. Our method is not restricted to the continuity between close frames, and the self-supervision

can be applied on any two frames far away in a video as long as they belong to the same person.

Test-Time Adaptation.

Our personalization setting falls into the paradigm of Test-Time Adaptation

which is recently proposed in [

] for generalization to out-of-

distribution test data. For example, Shocher et al. [

] propose a super-resolution framework which

is only trained during test time with a single image via down-scaling the image to create training

pairs. Wang et al. [

] introduce to use entropy of the classiﬁcation probability distribution to provide

ﬁne-tuning signals when given a test image. Instead of optimizing the main task itself during test time,

Sun et al. [

] propose to utilize a self-supervised rotation prediction task to help improve the visual

representation during inference, which indirectly improves semantic classiﬁcation. Going beyond

image classiﬁcation, Joo et al. [

] propose a method that proposes test time optimization for 3D

human body regression. In our work for pose personalization, we try to bridge the self-supervised and

supervised objectives close. We leverage a self-supervised keypoint estimation task and transform

the self-supervised keypoints to supervised keypoints via a Transformer model. In this way, training

with self-supervision will directly improve the supervised keypoint outputs.

Self-supervised Keypoint Estimation.

There are a lot of recent developments on learning keypoint

representations with self-supervision [

]. For example, Jakab et

al. [

] propose a video frame reconstruction task which disentangles the appearance feature and

keypoint structure in the bottleneck. This work is then extended for control and Reinforcement

Learning [

], and the keypoints can be mapped to manual deﬁned human pose via adding

adversarial learning loss [

]. While the results are encouraging, most of the results are reported

in relatively simple scenes and environments. In our paper, by leveraging the self-supervised task

together with the supervised task, we can perform human pose personalization on images in the wild.

Transformers.

Transformer has been widely applied in both language processing [

] and

computer vision tasks [

], speciﬁcally for pose estimation

recently [

]. For example, Li et al. [

] propose to utilize the encoder-decoder model in

Transformers to perform keypoint regression, which allows for more general-purpose applications

and requires less priors in architecture design. Inspired by these works, we apply Transformer to

reason the relation and mapping between the supervised and self-supervised keypoints.

3 Method

Our method aims at generalizing better for pose estimation on a single image by personalizing with

unlabeled data. The model is ﬁrstly trained with diverse data on both a supervised pose estimation

task and a self-supervised keypoint estimation task, using a proposed Transformer design to model

the relation between two tasks. During inference, the model conducts Test-Time Personalization

which only requires the self-supervised keypoint estimation task, boosting performance without costly

labeling or sacriﬁcing privacy. The whole pipeline is shown in Figure 2.

3.1 Joint Training for Pose Estimation with a Transformer

Given a set of

labeled images of a single person

I = {I

, I

. . . , I

}

, a shared encoder

maps

them into feature space

F = {F

, F

. . . , F

}

, which is shared by both a supervised and a self-

supervised keypoint estimation tasks. We introduce both tasks and the joint framework as follows.

3.1.1 Self-supervised Keypoint Estimation

For the self-supervised task, we build upon the work from Jakab et al. [

] which uses an image

reconstruction task to perform disentanglement of human structure and appearance, which leads to

self-supervised keypoints as intermediate results. Given two images of a single person

and

, the

task aims at reconstructing

using structural keypoint information from target

and appearance

information from source

. The appearance information

app

of source image

is extracted with a

剩余17页未读，继续阅读

评论收藏

内容反馈

版权申诉

易小侠

粉丝: 6606
资源: 9万+

用于人体姿势估计的变压器测试时间个性化_Test-Time Personalization with a Transformer

最新资源

用于人体姿势估计的变压器测试时间个性化_Test-Time Personalization with a Transformer

python学习-联邦学习个性化模型-深度学习-FedTP-通过Transformer实现个性化联合学习

2018 real-time Personalization using Embeddings for Search Ranking at Airbnb

Oracle_Form_Personalization_個性化.pdf

Content-Personalization-with-Contextual-Bandits

ATG - Personalization Guide

IBM WebSphere Portal 个性化设计

[TheBeerHouse] TheBeerHouse for .Net 3.5 源代码

个性化推荐 Personalization Techniques And Recommender Systems

TheBeerHouse MVC Framework ASP.NET Site

8篇个性化搜索（personalization search）的论文

具有非独立同分布和不平衡数据集的个性化联邦学习仿真平台___下载.zip

Aesthetics, Personalization and Recommendation A survey on De

yubikey-personalization:YubiKey个性化跨平台库和工具

藏经阁-Streaming datasets for Personalization.pdf

HRTF-Personalization:在单个虚拟听觉空间中基于人工神经网络的HRTF个性化

EMV_CPS_v1.1

互联网上个性化的前景-研究论文

Cobalt Strike下载

计算机系统-笔记-HUN2021级

北京邮电大学计算机考研复试笔试资料

cs1.6老版本供下载

合成孔径雷达的经典成像算法cs(matlab)仿真代码（吐血整理，内容全，注释全）

港大CS（MSC）面试整理

合成孔径雷达RD CS OmegaK算法点目标仿真.rar

计算机科学导论原书第二版答案.zip

Cobalt-Strike-4.5

cobaltstrike4.3.zip

在dataGridView的列中出现日历选择控件的类型

最新资源