Self Attention with Convolutional Projection
- class vformer.attention.convvt.ConvVTAttention(dim_in, dim_out, num_heads, img_size, attn_dropout=0.0, proj_dropout=0.0, method='dw_bn', kernel_size=3, stride_kv=1, stride_q=1, padding_kv=1, padding_q=1, with_cls_token=False)[source]
Bases:
Module
Attention with Convolutional Projection introduced in Paper: Introducing Convolutions to Vision Transformers
Position-wise linear projection for Multi-Head Self-Attention (MHSA) replaced by Depth-wise separable convolutions
- Parameters
dim_in (int) – Dimension of input tensor
dim_out (int) – Dimension of output tensor
num_heads (int) – Number of heads in attention
img_size (int) – Size of image
attn_dropout (float) – Probability of dropout in attention
proj_dropout (float) – Probability of dropout in convolution projection
method (str) – Method of projection,
'dw_bn'
for depth-wise convolution and batch norm,'avg'
for average pooling. default is'dw_bn'
kernel_size (int) – Size of kernel
stride_kv (int) – Size of stride for key value
stride_q (int) – Size of stride for query
padding_kv (int) – Padding for key value
padding_q (int) – Padding for query
with_cls_token (bool) – Whether to include classification token, default is
`False`
.
- forward(x)[source]
- Parameters
x (torch.Tensor) – Input tensor
- Returns
Returns output tensor by applying self-attention on input tensor
- Return type
torch.Tensor
- training: bool