ConViT

class vformer.models.classification.convit.ConViT(img_size, patch_size, n_classes, embedding_dim=1024, head_dim=64, depth_sa=6, depth_gpsa=6, attn_heads_sa=16, attn_heads_gpsa=16, encoder_mlp_dim=2048, in_channels=3, decoder_config=None, pool='cls', p_dropout_encoder=0, p_dropout_embedding=0)[source]

Implementation of ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Parameters

img_size (int) – Size of the image
patch_size (int) – Size of a patch
n_classes (int) – Number of classes for classification
embedding_dim (int) – Dimension of hidden layer
head_dim (int) – Dimension of the attention head
depth_sa (int) – Number of attention layers in the encoder for self attention layers
depth_gpsa (int) – Number of attention layers in the encoder for global positional self attention layers
attn_heads_sa (int) – Number of the attention heads for self attention layers
attn_heads_gpsa (int) – Number of the attention heads for global positional self attention layers
encoder_mlp_dim (int) – Dimension of hidden layer in the encoder
in_channels (int) – Number of input channels
decoder_config (int or tuple or list, optional) – Configuration of the decoder. If None, the default configuration is used.
pool (str) – Feature pooling type, one of {cls,``mean``}
p_dropout_encoder (float) – Dropout probability in the encoder
p_dropout_embedding (float) – Dropout probability in the embedding layer

forward(x)[source]

Parameters: x (torch.Tensor) – Input tensor
Returns: Returns tensor of size n_classes
Return type: torch.Tensor