Pyramid Vision Transformer

class vformer.models.classification.pyramid.PVTClassification(img_size=224, patch_size=[7, 3, 3, 3], in_channels=3, n_classes=1000, embed_dims=[64, 128, 256, 512], num_heads=[1, 2, 4, 8], mlp_ratio=[4, 4, 4, 4], qkv_bias=False, qk_scale=None, p_dropout=0.0, attn_dropout=0.0, drop_path_rate=0.0, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, depths=[3, 4, 6, 3], sr_ratios=[8, 4, 2, 1], decoder_config=None, linear=False, use_dwconv=False, ape=True)[source]

Implementation of Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolution

Parameters

img_size (int) – Image size
patch_size (list(int)) – List of patch size
in_channels (int) – Input channels in image, default=3
n_classes (int) – Number of classes for classification
embed_dims (int) – Patch Embedding dimension
num_heads (tuple[int]) – Number of heads in each transformer layer
depths (tuple[int]) – Depth in each Transformer layer
mlp_ratio (float) – Ratio of mlp heads to embedding dimension
qkv_bias (bool, default= True) – Adds bias to the qkv if true
qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 Spatial Attention if set
p_dropout (float,) – Dropout rate,default is 0.0
attn_dropout (float,) – Attention dropout rate, default is 0.0
drop_path_rate (float) – Stochastic depth rate, default is 0.1
norm_layer – Normalization layer, default is nn.LayerNorm
sr_ratios (float) – Spatial reduction ratio
decoder_config (int or tuple[int], optional) – Configuration of the decoder. If None, the default configuration is used.
linear (bool) – Whether to use linear Spatial attention, default is False
use_dwconv (bool) – Whether to use Depth-wise convolutions, default is False
ape (bool) – Whether to use absolute position embedding, default is True

forward(x)[source]

Parameters: x (torch.Tensor) – Input tensor
Returns: Returns tensor of size n_classes
Return type: torch.Tensor

class vformer.models.classification.pyramid.PVTClassificationV2(img_size=224, patch_size=[7, 3, 3, 3], in_channels=3, n_classes=1000, embedding_dims=[64, 128, 256, 512], num_heads=[1, 2, 4, 8], mlp_ratio=[4, 4, 4, 4], qkv_bias=False, qk_scale=0.0, p_dropout=0.0, attn_dropout=0.0, drop_path_rate=0.0, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, depths=[3, 4, 6, 3], sr_ratios=[8, 4, 2, 1], decoder_config=None, use_dwconv=True, linear=False, ape=False)[source]

Implementation of PVT v2: Improved Baselines with Pyramid Vision Transformer

Parameters

img_size (int) – Image size
patch_size (list(int)) – List of patch size
in_channels (int) – Input channels in image, default is 3
n_classes (int) – Number of classes for classification
embedding_dims (int) – Patch Embedding dimension
num_heads (tuple[int]) – Number of heads in each transformer layer
depths (tuple[int]) – Depth in each Transformer layer
mlp_ratio (float) – Ratio of mlp heads to embedding dimension
qkv_bias (bool, default= True) – Adds bias to the qkv if true
qk_scale (float, optional) – Override default qk scale of head_dim ** -0.5 in Spatial Attention if set
p_dropout (float,) – Dropout rate,default is 0.0
attn_dropout (float,) – Attention dropout rate, default is 0.0
drop_path_rate (float) – Stochastic depth rate, default is 0.1
norm_layer (nn.Module) – Normalization layer, default is nn.LayerNorm
sr_ratios (float) – Spatial reduction ratio
decoder_config (int or tuple[int], optional) – Configuration of the decoder. If None, the default configuration is used.
linear (bool) – Whether to use linear Spatial attention, default is False
use_dwconv (bool) – Whether to use Depth-wise convolutions, default is True
ape (bool) – Whether to use absolute position embedding, default is false