Vanilla Vision Transformer

class vformer.models.classification.vanilla.VanillaViT(img_size, patch_size, n_classes, embedding_dim=1024, head_dim=64, depth=6, num_heads=16, encoder_mlp_dim=2048, in_channels=3, decoder_config=None, pool='cls', p_dropout_encoder=0.0, p_dropout_embedding=0.0)[source]

Implementation of ‘An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale’ https://arxiv.org/abs/2010.11929

Parameters
  • img_size (int) – Size of the image

  • patch_size (int) – Size of a patch

  • n_classes (int) – Number of classes for classification

  • embedding_dim (int) – Dimension of hidden layer

  • head_dim (int) – Dimension of the attention head

  • depth (int) – Number of attention layers in the encoder

  • num_heads (int) – Number of the attention heads

  • encoder_mlp_dim (int) – Dimension of hidden layer in the encoder

  • in_channels (int) – Number of input channels

  • decoder_config (int or tuple or list, optional) – Configuration of the decoder. If None, the default configuration is used.

  • pool ({"cls","mean"}) – Feature pooling type

  • p_dropout_encoder (float) – Dropout probability in the encoder

  • p_dropout_embedding (float) – Dropout probability in the embedding layer

forward(x)[source]
Parameters

x (torch.Tensor) – Input tensor

Returns

Returns tensor of size n_classes

Return type

torch.Tensor