Self Attention with Convolutional Projection

class vformer.attention.convvt.ConvVTAttention(dim_in, dim_out, num_heads, img_size, attn_dropout=0.0, proj_dropout=0.0, method='dw_bn', kernel_size=3, stride_kv=1, stride_q=1, padding_kv=1, padding_q=1, with_cls_token=False)[source]

Bases: Module

Attention with Convolutional Projection introduced in Paper: Introducing Convolutions to Vision Transformers

Position-wise linear projection for Multi-Head Self-Attention (MHSA) replaced by Depth-wise separable convolutions

  • dim_in (int) – Dimension of input tensor

  • dim_out (int) – Dimension of output tensor

  • num_heads (int) – Number of heads in attention

  • img_size (int) – Size of image

  • attn_dropout (float) – Probability of dropout in attention

  • proj_dropout (float) – Probability of dropout in convolution projection

  • method (str) – Method of projection, 'dw_bn' for depth-wise convolution and batch norm, 'avg' for average pooling. default is 'dw_bn'

  • kernel_size (int) – Size of kernel

  • stride_kv (int) – Size of stride for key value

  • stride_q (int) – Size of stride for query

  • padding_kv (int) – Padding for key value

  • padding_q (int) – Padding for query

  • with_cls_token (bool) – Whether to include classification token, default is `False`.


x (torch.Tensor) – Input tensor


Returns output tensor by applying self-attention on input tensor

Return type


training: bool