[정안나] Masked Autoencoders Are Scalable Vision Learners

Basic Information

Conference: CVPR 2022
Authors: Kaiming He, et al. (Facebook AI Research)
Link: https://arxiv.org/abs/2111.06377

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we

arxiv.org

Abstract

Approach: Masked autoencoders (MAE) mask random patches of the input image and reconstruct the missing pixels.
Model design:
1. Develop an asymmetric encoder-decoder architecture -> encoder operates only on visible subset of patches + lightweight decoder that reconstructs the original image from latent representation and mask tokens
2. Mask a large portion of the image (e.g., 75%) to force the model to learn meaningful patterns by predicting the masked regions.
Result: MAE works better on new tasks than models trained with labeled data and improves with scalability.

1. Introduction

Problem: Deep learning (DL) models are becoming more powerful but need extremely large labeled datasets, which are often hard to obtain.
Solution: Autoregressive language modeling (e.g., GPT) and masked autoencoding (e.g., BERT) remove a portion of the data and learn to predict the removed content.
- Autoregressive language modeling: the model predicts the next word using the previous words in a sequence.

What makes masked autoencoding different between vision and language?

Until recently, architectures were different: CNNs were dominant in vision but were hard to use with masking methods. Vision Transformer (ViT) solved this.
Information density is different between language and vision: Language contains a lot of meaning in each word (i.e., highly semantic), so predicting a few missing words helps the model gain DL understanding. Images contain repeated information nearby, so the missing parts can be guessed by surrounding pixels without understanding the entire object.
The autoencoder's decoder plays a different role between reconstructing text and images: In language, the decoder predicts missing words, whereas in vision, it reconstructs pixels.

Figure 2: (Left) Masked image, (Middle) MAE prediction, (Right) Ground-truth. The image is split into 14 x 14 = 196 patches. 80% is removed, so 39 patches remains visible. In the paper, Figure 3 uses the COCO validation images and Figure 4 uses ImageNet images but with masking ratio of 75%.

2. Related Work

Masked language modeling: BERT and GPT hold out a portion of the input sequence and train models to predict the missing content.
Autoencoding: learns representations by encoding data into a latent form and reconstructing it. MAE is a denoising version that reconstructs inputs from corrupted (masked) data.
Masked image encoding: examples below
- Content Encoder -> filling missing regions
- iGPT -> predicting missing pixels
- ViT -> predicting masked patches
- BEiT -> predicting discrete visual tokens
Self-supervised learning: includes contrastive learning, where a model learns by pushing similar samples together and pushing different samples apart.

3. Approach

MAE's encoder sees the visible patches of the image and the small decoder uses that information + mask tokens to rebuild missing parts.

Masking: We randomly remove image patches. Such random sampling with high masking ratio creates a task that cannot be easily solved by extrapolation from visible neighboring patches.
MAE encoder: Our encoder applies only on visible, unmasked patches. Masked patches are removed; no mask tokens are used.
MAE decoder: Input is the full set of tokens consisting of (i) encoded visible patches (ii) mask tokens. Each mask token is a learnable vector that indicates the presence of a missing patch to be predicted. Due to the encoder-decoder's asymmetrical design, the full set of tokens are processed by the lightweight decoder, which significantly reduces pre-training time.
Reconstruction target: Each element in the decoder's output is a vector of pixel values representing a patch. The output is then reshaped to form a reconstructed image. The loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space (compute loss on only masked patches).

4. ImageNet Experiments

The authors evaluate MAE using ImageNet-1K. The process has two stages:

Self-supervised pretraining
- Train MAE on ImageNet images without labels.
- The model learns to reconstruct masked patches.
Supervised evaluation
They test the learned representation in two ways:
- Fine-tuning: train the entire network for classification. -> “How good can the model become?”
- Linear probing: freeze the encoder and train only a linear classifier. -> “Were the learned features already good?”

They report Top-1 accuracy on ImageNet validation.

4.1 Main Properties

This section studies what design choices matter for MAE.

Masking Ratio:
- What did they do? They tested different percentages of masked patches.
- Result: 75% masking works best.
- Why does high masking help? Images contain a lot of redundancy. If only a small part is masked, the model can guess using nearby pixels. High masking forces the model to understand global object structure.
Decoder Design:
- What does it do? The decoder reconstructs pixels.
- Result: A deeper decoder improves linear probing. But fine-tuning performance changes little.
Mask token:
- Traditional masked models: mask tokens go into the encoder.
- MAE: The encoder sees only visible patches and mask tokens are added later in the decoder.
- Why is this better? It reduces training mismatch and compute.
Reconstruction Target:

Target Result

Pixels good

Normalized pixels best

PCA features worse

Discrete tokens (BEiT style) similar
Data Augmentation:
1. Result: MAE works well with very little augmentation.
2. Wny? Random masking already creates many training variations.
Mask Sampling Strategy:

Strategy Result

Random masking best

Block masking worse

Grid masking worst representation
Training Schedule:

Pretraining epoches Accuracy

100 lower

800 good

1600 best

4.2 Comparison with Previous Results

Table 3: MAE achieves state-of-the-art ImageNet accuracy across multiple ViT sizes, outperforming previous self-supervised methods.

Figure 8: X-axis: model size (parameters in millions). Y-axis: ImageNet Top-1 accuracy (%).

Dotted (JFT-300M) -> best but needs massive data
Blue (MAE) -> scales well with model size
Gray X (supervised IN1K) -> saturates quickly
Gray circles (old supervised) -> worst training method

5. Transfer Learning Experiments

Key Takeaways:
1. MAE representations transfer well to many tasks (e.g., object detection and segmentation, semantic segmentation, classification tasks).
2. MAE often outperforms supervised pretraining.
3. The method scales well with larger models.

6. Discussion and Conclusion

Self-supervised methods like autoencoders may allow computer vision models to scale and improve the same way self-supervised learning did in NLP.
Because images don't naturally contain semantc units like words, MAE masks random patches and reconstructs pixels. However, the model learns meaningful visual concepts internally.