Self Attention with Convolutional Projection

class vformer.attention.convvt.ConvVTAttention(dim_in, dim_out, num_heads, img_size, attn_dropout=0.0, proj_dropout=0.0, method='dw_bn', kernel_size=3, stride_kv=1, stride_q=1, padding_kv=1, padding_q=1, with_cls_token=False)[source]

Bases: Module

Attention with Convolutional Projection introduced in Paper: Introducing Convolutions to Vision Transformers

Position-wise linear projection for Multi-Head Self-Attention (MHSA) replaced by Depth-wise separable convolutions

Parameters

dim_in (int) – Dimension of input tensor
dim_out (int) – Dimension of output tensor
num_heads (int) – Number of heads in attention
img_size (int) – Size of image
attn_dropout (float) – Probability of dropout in attention
proj_dropout (float) – Probability of dropout in convolution projection
method (str) – Method of projection, 'dw_bn' for depth-wise convolution and batch norm, 'avg' for average pooling. default is 'dw_bn'
kernel_size (int) – Size of kernel
stride_kv (int) – Size of stride for key value
stride_q (int) – Size of stride for query
padding_kv (int) – Padding for key value
padding_q (int) – Padding for query
with_cls_token (bool) – Whether to include classification token, default is `False`.

forward(x)[source]

Parameters: x (torch.Tensor) – Input tensor
Returns: Returns output tensor by applying self-attention on input tensor
Return type: torch.Tensor

forward_conv(x)[source]

training: bool