【免费】MAE-MaskedAutoencodersAreScalableVisionLearners_MAE的深度学习架构资源-CSDN文库

需积分: 0 11 浏览量更新于2024-11-24 收藏 1.52MB PDF 举报

MAE-Masked Autoencoders Are Scalable Vision Learners

Masked Autoencoders Are Scalable Vision Learners

Kaiming He

∗,†

Xinlei Chen

∗

Saining Xie Yanghao Li Piotr Doll

ar Ross Girshick

∗

equal technical contribution

†

project lead

Facebook AI Research (FAIR)

Abstract

This paper shows that masked autoencoders (MAE) are

scalable self-supervised learners for computer vision. Our

MAE approach is simple: we mask random patches of the

input image and reconstruct the missing pixels. It is based

on two core designs. First, we develop an asymmetric

encoder-decoder architecture, with an encoder that oper-

ates only on the visible subset of patches (without mask to-

kens), along with a lightweight decoder that reconstructs

the original image from the latent representation and mask

tokens. Second, we ﬁnd that masking a high proportion

of the input image, e.g., 75%, yields a nontrivial and

meaningful self-supervisory task. Coupling these two de-

signs enables us to train large models efﬁciently and ef-

fectively: we accelerate training (by 3× or more) and im-

prove accuracy. Our scalable approach allows for learning

high-capacity models that generalize well: e.g., a vanilla

ViT-Huge model achieves the best accuracy (87.8%) among

methods that use only ImageNet-1K data. Transfer per-

formance in downstream tasks outperforms supervised pre-

training and shows promising scaling behavior.

1. Introduction

Deep learning has witnessed an explosion of archi-

tectures of continuously growing capability and capacity

[33, 25, 57]. Aided by the rapid gains in hardware, mod-

els today can easily overﬁt one million images [13] and

begin to demand hundreds of millions of—often publicly

inaccessible—labeled images [16].

This appetite for data has been successfully addressed in

natural language processing (NLP) by self-supervised pre-

training. The solutions, based on autoregressive language

modeling in GPT [47, 48, 4] and masked autoencoding in

BERT [14], are conceptually simple: they remove a portion

of the data and learn to predict the removed content. These

methods now enable training of generalizable NLP models

containing over one hundred billion parameters [4].

The idea of masked autoencoders, a form of more gen-

eral denoising autoencoders [58], is natural and applicable

in computer vision as well. Indeed, closely related research

encoder

....

decoder

input target

Figure 1. Our MAE architecture. During pre-training, a large

random subset of image patches (e.g., 75%) is masked out. The

encoder is applied to the small subset of visible patches. Mask

tokens are introduced after the encoder, and the full set of en-

coded patches and mask tokens is processed by a small decoder

that reconstructs the original image in pixels. After pre-training,

the decoder is discarded and the encoder is applied to uncorrupted

images (full sets of patches) for recognition tasks.

in vision [59, 46] preceded BERT. However, despite signif-

icant interest in this idea following the success of BERT,

progress of autoencoding methods in vision lags behind

NLP. We ask: what makes masked autoencoding different

between vision and language? We attempt to answer this

question from the following perspectives:

(i) Until recently, architectures were different. In vision,

convolutional networks [34] were dominant over the last

decade [33]. Convolutions typically operate on regular grids

and it is not straightforward to integrate ‘indicators’ such as

mask tokens [14] or positional embeddings [57] into con-

volutional networks. This architectural gap, however, has

been addressed with the introduction of Vision Transform-

ers (ViT) [16] and should no longer present an obstacle.

(ii) Information density is different between language

and vision. Languages are human-generated signals that

are highly semantic and information-dense. When training

a model to predict only a few missing words per sentence,

this task appears to induce sophisticated language under-

standing. Images, on the contrary, are natural signals with

heavy spatial redundancy—e.g., a missing patch can be re-

covered from neighboring patches with little high-level un-

15979

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

DOI 10.1109/CVPR52688.2022.01553

Authorized licensed use limited to: Shandong Normal University. Downloaded on April 15,2023 at 07:13:39 UTC from IEEE Xplore. Restrictions apply.

下载后可阅读完整内容，剩余9页未读，立即下载

资源推荐

资源评论

薛定谔的猫ovo

粉丝: 5w+
资源: 5

MAE-Masked Autoencoders Are Scalable Vision Learners

Masked Autoencoders Are Scalable Vision Learners (MAE)代码样例

MAE论文分享，MAE：Masked Autoencoders Are Scalable Vision Learners

Masked Autoencoders Are Scalable Vision Learners.pdf

Masked Autoencoders

vision transformer预训练

Hiera-MAE-Demo.zip

MAE - MetaTrader 5脚本.zip

MAE-gf:Groundforge的许多引人注目的例子

MAE-343-项目：MAE 343中分配的第一个项目的代码

mae-model:“多峰属性提取”论文的模型代码

卫星经纬高matlab代码-MAE-Spacecraft-Guidance:解决了涉及轨道的一系列问题

MAE136-MAE158-MAE159_Aircraft_Design_and_Analysis_Course_Series

使用 GPU 实现的一键快速评估显着性对象检测，包括 MAE、Max F-measure、S-measure、E-measure

微电网matlab代码-MAE-126B:E3组存储库，光伏微电网模拟项目

等级近似：使用基线预测和 SVD 降低 MAE-matlab开发

计算MAE和 MFE的脚本 - MetaTrader 4脚本.zip

SimMTM介绍.pdf

康奈尔代码MAE40605065-航天力学导论_Jupyter Notebook_M.zip

PR-F-measure-MAE:显着性指标值原始码

MAE693-reinforcement-learning

射频素

预测问题评价指标：MAE、MSE、R-Square、MAPE和RMSE

forecast-mae运行环境yaml

MAE224-Spring2021

此工具箱包含 E- measure、S - measure、加权 F 和 F- measure、MAE 和 PR 曲线或显着对象

MAE5060---Project-3:解决随附PDF（复合材料力学）中的问题的代码

国产1000M单端口Ethernet PHY MAE0621

基于CIFAR10 MAE的实现（含模型权重，TensorBoard可视化等）

mae1-e

最新资源