Vision-friendly Transformer

class vformer.models.classification.visformer.Visformer(img_size, n_classes, depth: tuple, config: tuple, channel_config: tuple, num_heads=8, conv_group=8, p_dropout_conv=0.0, p_dropout_attn=0.0, activation=<class 'torch.nn.modules.activation.GELU'>, pos_embedding=True)[source]

A builder to construct a Vision-Friendly transformer model as in the paper: “Visformer: A vision-friendly transformer” https://arxiv.org/abs/2104.12533

Parameters
  • img_size (int,tuple) – Size of the input image

  • n_classes (int) – Number of classes in the dataset

  • depth (tuple[int]) – Number of layers before each embedding reduction

  • config (tuple[int]) – Choice of convolution block (0) or attention block (1) for corresponding layer

  • channel_config (tuple[int]) – Number of channels for each layer

  • num_heads (int) – Number of heads for attention block, default is 8

  • conv_group (int) – Number of groups for convolution block, default is 8

  • p_dropout_conv (float) – Dropout rate for convolution block, default is 0.0

  • p_dropout_attn (float) – Dropout rate for attention block, default is 0.0

  • activation (torch.nn.Module) – Activation function between layers, default is nn.GELU

  • pos_embedding (bool) – Whether to use positional embedding, default is True

forward(x)[source]
Parameters

x (torch.Tensor) – Input tensor

Returns

Returns tensor of size n_classes

Return type

torch.Tensor

class vformer.models.classification.visformer.VisformerAttentionBlock(in_channels, num_heads=8, activation=<class 'torch.nn.modules.activation.GELU'>, p_dropout=0.0)[source]

Attention Block for Vision-Friendly transformers https://arxiv.org/abs/2104.12533

Parameters
  • in_channels (int) – Number of input channels

  • num_heads (int) – Number of heads for attention, default is 8

  • activation (torch.nn.Module) – Activation function between layers, default is nn.GELU

  • p_dropout (float) – Dropout rate, default is 0.0

forward(x)[source]
Parameters

x (torch.Tensor) – Input tensor

Returns

Returns tensor of same size as input

Return type

torch.Tensor

class vformer.models.classification.visformer.VisformerConvBlock(in_channels, group=8, activation=<class 'torch.nn.modules.activation.GELU'>, p_dropout=0.0)[source]

Convolution Block for Vision-Friendly transformers https://arxiv.org/abs/2104.12533

Parameters
  • in_channels (int) – Number of input channels

  • group (int) – Number of groups for convolution, default is 8

  • activation (torch.nn.Module) – Activation function between layers, default is nn.GELU

  • p_dropout (float) – Dropout rate, default is 0.0

forward(x)[source]
Parameters

x (torch.Tensor) – Input tensor

Returns

Returns tensor of same size as input

Return type

torch.Tensor

vformer.models.classification.visformer.VisformerV2_S(img_size, n_classes, in_channels=3)[source]

VisformerV2-S model from the paper:”Visformer: The Vision-friendly Transformer” https://arxiv.org/abs/1906.11488

Parameters
  • img_size (int,tuple) – Size of the input image

  • n_classes (int) – Number of classes in the dataset

  • in_channels (int) – Number of channels in the input

vformer.models.classification.visformer.VisformerV2_Ti(img_size, n_classes, in_channels=3)[source]

VisformerV2-Ti model from the paper:”Visformer: The Vision-friendly Transformer” https://arxiv.org/abs/1906.11488

Parameters
  • img_size (int,tuple) – Size of the input image

  • n_classes (int) – Number of classes in the dataset

  • in_channels (int) – Number of channels in the input

vformer.models.classification.visformer.Visformer_S(img_size, n_classes, in_channels=3)[source]

Visformer-S model from the paper:”Visformer: The Vision-friendly Transformer” https://arxiv.org/abs/1906.11488

Parameters
  • img_size (int,tuple) – Size of the input image

  • n_classes (int) – Number of classes in the dataset

  • in_channels (int) – Number of channels in the input

vformer.models.classification.visformer.Visformer_Ti(img_size, n_classes, in_channels=3)[source]

Visformer-Ti model from the paper:”Visformer: The Vision-friendly Transformer” https://arxiv.org/abs/1906.11488

Parameters
  • img_size (int,tuple) – Size of the input image

  • n_classes (int) – Number of classes in the dataset

  • in_channels (int) – Number of channels in the input