Vanilla Vision Transformer

class vformer.models.classification.vanilla.VanillaViT(img_size, patch_size, n_classes, embedding_dim=1024, head_dim=64, depth=6, num_heads=16, encoder_mlp_dim=2048, in_channels=3, decoder_config=None, pool='cls', p_dropout_encoder=0.0, p_dropout_embedding=0.0)[source]

Implementation of ‘An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale’ https://arxiv.org/abs/2010.11929

Parameters

img_size (int) – Size of the image
patch_size (int) – Size of a patch
n_classes (int) – Number of classes for classification
embedding_dim (int) – Dimension of hidden layer
head_dim (int) – Dimension of the attention head
depth (int) – Number of attention layers in the encoder
num_heads (int) – Number of the attention heads
encoder_mlp_dim (int) – Dimension of hidden layer in the encoder
in_channels (int) – Number of input channels
decoder_config (int or tuple or list, optional) – Configuration of the decoder. If None, the default configuration is used.
pool ({"cls","mean"}) – Feature pooling type
p_dropout_encoder (float) – Dropout probability in the encoder
p_dropout_embedding (float) – Dropout probability in the embedding layer

forward(x)[source]

Parameters: x (torch.Tensor) – Input tensor
Returns: Returns tensor of size n_classes
Return type: torch.Tensor