深入了解ViT模型的代码

视觉变换器（ViT）标志着计算机视觉演进的一个显著里程碑。ViT挑战了传统的观点，即图像最好通过卷积层进行处理，证明了基于序列的注意机制可以有效地捕捉图像中复杂的模式、上下文和语义。

通过将图像分解为可管理的补丁并利用自我注意力，ViT捕捉了本地和全局关系，使其能够在各种视觉任务中表现出色，从图像分类到物体检测等等。在本文中，我们将深入探讨ViT分类的内部工作原理。引言ViT的核心思想是将图像视为一系列固定大小的补丁，然后将这些补丁展开并转换为1D向量。然后，这些补丁由一个变压器编码器进行处理，使模型能够捕获整个图像的全局上下文和依赖关系。通过将图像分成补丁，ViT有效地减少了处理大图像的计算复杂性，同时保留了模型建模复杂空间交互的能力。首先，我们从Hugging Face变换器库中导入用于分类的ViT模型：from transformers import ViTForImageClassification

import torch

import numpy as np

model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

其中patch16–224表示该模型接受大小为224x224的图像，每个补丁的宽度和高度为16像素。

以下是模型架构的示例：

ViTForImageClassification(

(vit): ViTModel(

(embeddings): ViTEmbeddings(

(patch_embeddings): PatchEmbeddings(

(projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))

)

(dropout): Dropout(p=0.0, inplace=False)

)

(encoder): ViTEncoder(

(layer): ModuleList(

(0): ViTLayer(

(attention): ViTAttention(

(attention): ViTSelfAttention(

(query): Linear(in_features=768, out_features=768, bias=True)

(key): Linear(in_features=768, out_features=768, bias=True)

(value): Linear(in_features=768, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(output): ViTSelfOutput(

(dense): Linear(in_features=768, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(intermediate): ViTIntermediate(

(dense): Linear(in_features=768, out_features=3072, bias=True)

)

(output): ViTOutput(

(dense): Linear(in_features=3072, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

)

.......

(11): ViTLayer(

(attention): ViTAttention(

(attention): ViTSelfAttention(

(query): Linear(in_features=768, out_features=768, bias=True)

(key): Linear(in_features=768, out_features=768, bias=True)

(value): Linear(in_features=768, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(output): ViTSelfOutput(

(dense): Linear(in_features=768, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(intermediate): ViTIntermediate(

(dense): Linear(in_features=768, out_features=3072, bias=True)

)

(output): ViTOutput(

(dense): Linear(in_features=3072, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

)

(layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

)

(classifier): Linear(in_features=768, out_features=1000, bias=True)

)

嵌入补丁嵌入将图像转化为补丁的操作是通过使用Conv2D层来执行的。Conv2D层在输入数据上执行二维卷积操作，以学习图像的特征和模式。然而，在这种情况下，Conv2D层用于根据步幅参数将图像分成NxN数量的补丁。步幅确定了过滤器在输入数据上滑动的步长。在这种情况下，因为我们的图像是224x224，每个补丁的大小为16，这意味着每个维度上有224/16 = 14个补丁。如果选择步幅=16，我们有效地将图像分成14个不重叠的补丁。

为了更加直观，假设图像的形状是4x4，步幅为2：

因此，例如，第一个和第二个补丁将是：

proj = model.vit.embeddings.patch_embeddings.projection

torch.allclose(torch.sum(image[0, :, 0:16, 0:16] * w[0]) + b[0],

proj(image)[0][0][0, 0], atol=1e-6)

# True

torch.allclose(torch.sum(image[0, :, 16:32, 0:16] * w[0]) + b[0],

proj(image)[0][0][1, 0], atol=1e-6)

# True

这个模式很明显 - 为了计算每个补丁，我们跳过16个像素以获得不重叠的补丁。如果我们对整个图像执行此操作，最终我们将得到一个1 x 14 x 14的张量，其中每个补丁由Conv2D的第一个滤波器计算的一个数字表示。然而，有768个滤波器，这意味着最终我们得到一个768 x 14 x 14的张量。所以现在，对于每个补丁，我们实际上有一个768维的表示，这就是我们的补丁嵌入。我们还将张量展平并转置，因此嵌入形状变为[batch_size，196，768]，其中第二维被展平为14 x 14 = 196，我们实际上有一个具有768大小的嵌入尺寸的序列。embeddings = model.vit.embeddings.patch_embeddings.projection(image)

# shape (batch_size, 196, 768)

embeddings = embeddings.flatten(2).transpose(1, 2)

如果我们想从头开始重现整个层，这是代码：batch_size = 1

F = 768 # number of filters

H1 = 14 # output dimension hight - 224/16

W1 = 14 # output dimension width - 224/16

stride = 16

HH = 16 # patch hight

WW = 16 # patch width

w = model.vit.embeddings.patch_embeddings.projection.weight

b = model.vit.embeddings.patch_embeddings.projection.bias

out = np.zeros((N, F, H1, W1))

chunks = []

for n in range(batch_size):

for f in range(F):

for i in range(H1):

for j in range(W1):

# perform convolution operation

out[n, f, i, j] = torch.sum( image[n, :, i*stride:i*stride+HH, j*stride : j*stride + WW] * w[f] ) + b[f]

np.allclose(out[0], embeddings[0].detach().numpy(), atol=1e-5)

# True

现在，如果您熟悉语言变换器（如果需要，可以在此处查看），您应该回想起[CLS]标记，其表示用作整个文本的简洁和信息丰富的摘要，使模型能够根据来自变换器编码器的提取特征做出准确的预测。在ViT中，我们也有[CLS]标记，其具有与文本相同的功能，并附加到上面计算的表示中。[CLS]标记是一个参数，我们将使用反向传播来学习：cls_token = nn.Parameter(torch.randn(1, 1, 768))

cls_tokens = cls_token.expand(batch_size, -1, -1)

# append [CLS] token

embeddings = torch.cat((cls_tokens, embeddings), dim=1)

位置嵌入

就像在语言变换器中一样，为了保留补丁的位置信息，ViT包括位置嵌入。位置嵌入帮助模型理解不同补丁之间的空间关系，使其能够捕捉图像的结构。

位置嵌入是一个与之前计算的[CLS]标记具有相同形状的张量，即[batch_size，197，768]。embeddings = embeddings + model.vit.embeddings.position_embeddings

Dropout在嵌入层之后是一个Dropout层。在Dropout中，我们将某些值替换为零，具有一定的丢失概率。Dropout有助于减少过拟合，因为我们随机地屏蔽某些神经元的信号，使网络需要找到其他路径来减少损失函数，从而学会更好地泛化，而不是依赖某些路径。我们还可以将Dropout视为一种模型集成技术，因为在训练期间，每一步我们会随机地停用某些神经元，最终在评估期间我们将这些“不同”的网络合并。在嵌入层的末尾，我们有：# compute the embedding

embeddings = model.vit.embeddings.patch_embeddings.projection(image)

embeddings = embeddings.flatten(2).transpose(1, 2)

# append [CLS] token

cls_token = model.vit.embeddings.cls_token

embeddings = torch.cat((cls_tokens, embeddings), dim=1)

# positional embedding

embeddings = embeddings + self.position_embeddings

# droput

embeddings = model.vit.embeddings.dropout(embeddings)

编码器

ViT使用一堆变换器编码块，类似于语言模型（如BERT）中使用的块。每个编码块由多头自注意力和前馈神经网络组成。自注意力机制使模型能够捕捉不同补丁之间的关系，而前馈神经网络执行非线性变换。

具体来说，每一层由自注意力、中间和输出模块组成。

(0): ViTLayer(

(attention): ViTAttention(

(attention): ViTSelfAttention(

(query): Linear(in_features=768, out_features=768, bias=True)

(key): Linear(in_features=768, out_features=768, bias=True)

(value): Linear(in_features=768, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(output): ViTSelfOutput(

(dense): Linear(in_features=768, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(intermediate): ViTIntermediate(

(dense): Linear(in_features=768, out_features=3072, bias=True)

)

(output): ViTOutput(

(dense): Linear(in_features=3072, out_features=768, bias=True)

(dropout): Dropout(p=0.0, inplace=False)

)

(layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

(layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

)

自注意力

自注意力是Vision Transformer（ViT）模型内的一个关键机制，它使其能够捕捉图像中不同补丁之间的关系和依赖关系。它在提取上下文信息和理解补丁之间的长程和短程交互方面发挥着至关重要的作用。

每个补丁与三个向量相关联：键（Key）、查询（Query）和值（Value）。通过对原始补丁嵌入进行线性变换，学习这些向量。键向量表示来自当前补丁的信息，查询向量用于询问其他补丁的信息，值向量保存与其他补丁相关的信息。由于我们已经在前一节中计算了嵌入，我们使用键、查询和值矩阵对嵌入进行投影来计算键、查询和值：import math

import torch.nn as nn

torch.manual_seed(0)

hidden_size = 768

num_attention_heads = 12

attention_head_size = hidden_size // num_attention_heads # 64

hidden_states = embeddings

# apply LayerNorm to the embeddings

hidden_states = model.vit.encoder.layer[0].layernorm_before(hidden_states)

# take first layer of the Transformer

layer_0 = model.vit.encoder.layer[0]

# shape (768, 64)

key_matrix = layer_0.attention.attention.key.weight.T[:, :attention_head_size]

key_bias = layer_0.attention.attention.key.bias[:attention_head_size]

query_matrix = layer_0.attention.attention.query.weight.T[:, :attention_head_size]

query_bias = layer_0.attention.attention.query.bias[:attention_head_size]

value_matrix = layer_0.attention.attention.value.weight.T[:, :attention_head_size]

value_bias = layer_0.attention.attention.value.bias[:attention_head_size]

# compute key, query and value for the first head attention

# all of shape (b_size, 197, 64)

key_1head = hidden_states @ key_matrix + key_bias

query_1head = hidden_states @ query_matrix + query_bias

value_1head = hidden_states @ value_matrix + value_bias

请注意，我们跳过了LayerNorm操作，稍后会讨论它。

对于每个查询向量，通过测量查询和所有其他补丁的键向量之间的兼容性或相似性来计算关注度得分。这是通过点积操作来完成的，然后应用Softmax函数以获取形状为[b_size，197，197]的规范化关注度得分。注意矩阵是方形的，因为所有补丁都相互关注，这就是为什么它被称为自注意力。这些分数指示在处理查询补丁时应在每个补丁上放置多少焦点或关注。因为下一层的每个补丁的新嵌入是根据关注度得分和所有其他补丁的值推导出来的，所以我们获得了每个补丁的上下文嵌入，因为它是根据图像中所有其他补丁推导出来的。为了进一步澄清这一点，回想一开始我们使用Conv2D层将图像分成补丁，以获得每个补丁的768维嵌入向量 - 这些嵌入是独立的，因为补丁之间没有交互（没有重叠）。然而，在变换器层中，补丁嵌入混合在一起，成为其他补丁嵌入的函数。例如，在第一层的嵌入如下：# shape (b_size, 197, 197)

# compute the attention scores by dot product of query and key

attention_scores_1head = torch.matmul(query_1head, key_1head.transpose(-1, -2))

attention_scores_1head = attention_scores_1head / math.sqrt(attention_head_size)

attention_probs_1head = nn.functional.softmax(attention_scores_1head, dim=-1)

# contextualized embedding for this layer

context_layer_1head = torch.matmul(attention_probs_1head, value_1head)

如果我们放大并查看第一个补丁：patch_n = 1

# shape (, 197)

print(attention_probs_1head[0, patch_n])

[2.4195e-01, 7.3293e-01, ..,

2.6689e-06, 4.6498e-05, 1.1380e-04, 5.1591e-06, 2.1265e-05],

对于它的新嵌入（标记索引为0的是[CLS]标记），它是不同补丁的嵌入的组合，主要关注于第一个补丁本身（0.73），[CLS]标记（0.24），其余关注其他所有补丁。但这并不总是这样。实际上，在下一层中，第一个补丁可能更多地关注周围的补丁而不是自己和[CLS]标记，甚至可能关注遥远的补丁 - 这取决于模型认为对解决特定任务有用。此外，您可能已经注意到，我仅从查询、键和值的权重矩阵中选择了前64列。这前64列代表第一个注意力头，但实际上，在这个模型尺寸中有12个注意力头。每个注意力头创建不同的补丁表示。事实上，如果我们查看第一个补丁的第三个注意力头，我们会看到第一个补丁更多地关注第二个补丁（0.26），而不是像第一个注意力头中那样更关注自己。# shape (, 197)

[2.6356e-01, 1.2783e-03, 2.6888e-01, ... , 1.8458e-02]

因此，不同的注意力头将捕捉不同类型的补丁之间的关系，帮助模型从不同的角度看待事物。

为了并行计算所有这些头，我们可以按以下方式操作：

def transpose_for_scores(x: torch.Tensor) -> torch.Tensor:

new_x_shape = x.size()[:-1] + (num_attention_heads, attention_head_size)

x = x.view(new_x_shape)

return x.permute(0, 2, 1, 3)

mixed_query_layer = layer_0.attention.attention.query(hidden_states)

key_layer = transpose_for_scores(layer_0.attention.attention.key(hidden_states))

value_layer = transpose_for_scores(layer_0.attention.attention.value(hidden_states))

query_layer = transpose_for_scores(mixed_query_layer)

# Take the dot product between "query" and "key" to get the raw attention scores.

attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

attention_scores = attention_scores / math.sqrt(attention_head_size)

# Normalize the attention scores to probabilities.

attention_probs = nn.functional.softmax(attention_scores, dim=-1)

# This is actually dropping out entire tokens to attend to, which might

# seem a bit unusual, but is taken from the original Transformer paper.

attention_probs = layer_0.attention.attention.dropout(attention_probs)

context_layer = torch.matmul(attention_probs, value_layer)

context_layer = context_layer.permute(0, 2, 1, 3).contiguous()

new_context_layer_shape = context_layer.size()[:-2] + (hidden_size,)

context_layer = context_layer.view(new_context_layer_shape)

在应用自注意力后，我们应用另一个投影层和Dropout，然后完成自注意力层！

output_weight = layer_0.attention.output.dense.weight

output_bias = layer_0.attention.output.dense.bias

attention_output = context_layer @ output_weight.T + output_bias

attention_output = layer_0.attention.output.dropout(attention_output)

哦，等一下，我答应过要解释一下LayerNorm操作。Layer Normalization（层归一化）是一种用于增强深度学习模型训练和性能的归一化技术。它解决了内部协变量转移的问题 - 在训练过程中，随着神经网络的权重发生变化，每个层的输入分布可能会发生显着变化，使模型难以收敛。层归一化通过确保每个层的输入具有一致的均值和方差来解决这个问题，从而稳定学习过程。它通过使用均值和标准差对每个补丁嵌入进行标准化，以使其具有零均值和单位方差。然后，我们应用经过训练的权重和偏差，以使其具有不同的均值和方差，以便模型在训练期间自动适应。因为我们独立地计算不同示例之间的均值和标准差，这与批归一化不同，批归一化是跨批次维度进行归一化，因此取决于批次中的其他示例。让我们以第一个补丁嵌入为例：first_patch_embed = embeddings[0][0]

# compute first patch mean

first_patch_mean = first_patch_embed.mean()

# compute first patch variance

first_patch_std = (first_patch_embed - first_patch_mean).pow(2).mean()

# standardize the first patch

first_patch_standardized = (first_patch_embed - first_patch_mean) / torch.sqrt(first_patch_std + 1e-12)

# apply trained weight and bias vectors

first_patch_norm = layer_0.layernorm_before.weight * first_patch_standardized + layer_0.layernorm_before.bias

在Intermediate类中，我们执行线性投影并应用非线性变换。

中间层在Intermediate类之前，我们进行了另一次层规范化和残差连接。现在，应该清楚为什么我们要应用另一次层规范化 — 我们需要规范化来自自注意力的上下文嵌入以提高收敛性，但您可能会想知道我提到的那个其他的残差是什么？残差连接是深度神经网络中的关键组件，有助于减轻训练非常深的体系结构所面临的挑战。当我们通过叠加更多层来增加神经网络的深度时，我们会遇到梯度消失/爆炸的问题，其中在梯度消失的情况下，模型无法再学习，因为传播的梯度接近零，初始层停止改变权重并提高性能。爆炸梯度的相反问题是权重不能稳定，因为存在极端更新，最终导致梯度爆炸（变为无穷大）。现在，适当的权重初始化和归一化有助于解决这个问题，但观察到的是，即使网络变得更加稳定，性能也会下降，因为优化更加困难。添加这些残差连接有助于提高性能，即使我们继续增加深度，网络也更容易优化。它是如何实现的呢？很简单 — 我们只需将原始输入添加到原始输入的一些变换之后的输出中：transformations = nn.Sequential([nn.Linear(), nn.ReLU(), nn.Linear()])

output = input + transformations(input)

另一个关键的见解是，如果残差连接的变换学会近似恒等函数，那么输入与学到的特征的加法将没有任何效果。实际上，如果需要，网络可以学习修改或精化特征。

在我们的情况中，残差连接是初始嵌入和自注意力层中的所有变换后的嵌入（attention_output）之间的和。

# first residual connection - NOTE the hidden_states are the

# `embeddings` here

hidden_states = attention_output + hidden_states

# in ViT, layernorm is also applied after self-attention

layer_output = layer_0.layernorm_after(hidden_states)

在Intermediate类中，我们执行线性投影并应用非线性：

layer_output_intermediate = layer_0.intermediate.dense(layer_output)

layer_output_intermediate = layer_0.intermediate.intermediate_act_fn(layer_output_intermediate)

ViT中使用的非线性是GeLU激活函数。它被定义为标准正态分布的累积分布函数：

通常，为了加快计算，它用以下公式近似：

从下面的图表中，我们可以看到，如果ReLU是通过公式max(input, 0)给出的，它在正域中是单调的、凸的和线性的，而GeLU在正域中是非单调、非凸和非线性的，因此可以更容易地近似复杂的函数。此外，GeLU函数是平滑的 — 不像ReLU函数，后者在零点处具有尖锐的过渡，GeLU在所有值之间提供平滑的过渡，使其更易于在训练期间进行梯度优化。

输出

编码器的最后部分是输出类。为了计算输出，我们已经拥有所需的所有元素 — 它是线性投影、Dropout和残差连接：

# linear projection

output_dense = layer_0.output.dense(layer_output_intermediate)

# dropout

output_drop = layer_0.output.dropout(output_dense)

# residual connection - NOTE these hidden_states are computed in

# Intermediate

output_res = output_drop + hidden_states # shape (b_size, 197, 768)

好了，我们已经完成了ViT Layer的第一层，还有其他11层要进行，这是困难的部分…

开玩笑！实际上，我们已经完成了 — 所有其他层与第一层完全相同，唯一的区别是，与第一层不同，下一层的嵌入是我们之前计算的output_res。因此，经过编码器的12层后的输出是：torch.manual_seed(0)

# masking heads in a given layer

layer_head_mask = None

# output attention probabilities

output_attentions = False

embeddings = model.vit.embeddings(image)

hidden_states = embeddings

for l in range(12):

hidden_states = model.vit.encoder.layer[l](hidden_states, layer_head_mask, output_attentions)[0]

output = model.vit.layernorm(sequence_output)

池化器通常，在Transformer模型中，池化器是用于在变压器编码器块之后聚合来自令牌嵌入序列的信息的组件。其作用是生成一个固定大小的表示，捕捉全局上下文并总结从图像块中提取的信息，就像在ViT中一样。池化器对于获取图像的紧凑且具有上下文感的表示非常重要，然后可以用于各种下游任务，如图像分类。在这种情况下，池化器非常简单 — 我们取[CLS]标记并将其用作图像的紧凑且具有上下文感的表示。pooled_output = output[:, 0, :] # shape (b_size, 768)

分类器

最后，我们准备使用pooled_output来对图像进行分类。分类器是一个简单的线性层，其输出维度等于类的数量：

logits = model.classifier(pooled_output) # shape (b_size, num_classes)

结论

ViT彻底改变了计算机视觉，几乎在每个应用中取代了卷积神经网络，这就是为什么了解它的工作原理如此重要。不要忘记，ViT的主要组件，变压器架构，起源于自然语言处理（NLP），因此您应该查看我之前关于BERT Transformer的文章。

参考资料

[1] https://github.com/huggingface/transformers

[2] [2010.11929] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (arxiv.org)

深入了解ViT模型的代码

相关阅读

磐创AI

磐创AI

举报文章问题

举报评论问题

用户登录×