ConViT

class vformer.models.classification.convit.ConViT(img_size, patch_size, n_classes, embedding_dim=1024, head_dim=64, depth_sa=6, depth_gpsa=6, attn_heads_sa=16, attn_heads_gpsa=16, encoder_mlp_dim=2048, in_channels=3, decoder_config=None, pool='cls', p_dropout_encoder=0, p_dropout_embedding=0)[source]

Implementation of ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Parameters
  • img_size (int) – Size of the image

  • patch_size (int) – Size of a patch

  • n_classes (int) – Number of classes for classification

  • embedding_dim (int) – Dimension of hidden layer

  • head_dim (int) – Dimension of the attention head

  • depth_sa (int) – Number of attention layers in the encoder for self attention layers

  • depth_gpsa (int) – Number of attention layers in the encoder for global positional self attention layers

  • attn_heads_sa (int) – Number of the attention heads for self attention layers

  • attn_heads_gpsa (int) – Number of the attention heads for global positional self attention layers

  • encoder_mlp_dim (int) – Dimension of hidden layer in the encoder

  • in_channels (int) – Number of input channels

  • decoder_config (int or tuple or list, optional) – Configuration of the decoder. If None, the default configuration is used.

  • pool (str) – Feature pooling type, one of {cls,``mean``}

  • p_dropout_encoder (float) – Dropout probability in the encoder

  • p_dropout_embedding (float) – Dropout probability in the embedding layer

forward(x)[source]
Parameters

x (torch.Tensor) – Input tensor

Returns

Returns tensor of size n_classes

Return type

torch.Tensor