Video Vision Transformer

class vformer.models.classification.vivit.ViViTModel2(img_size, in_channels, patch_size, embedding_dim, num_frames, depth, num_heads, head_dim, n_classes, mlp_dim=None, pool='cls', p_dropout=0.0, attn_dropout=0.0, drop_path_rate=0.02)[source]

Model 2 implementation of A Video vision Transformer - :param img_size: Size of single frame/ image in video :type img_size: int :param in_channels: Number of channels :type in_channels: int :param patch_size: Patch size :type patch_size: int :param embedding_dim: Embedding dimension of a patch :type embedding_dim: int :param num_frames: Number of seconds in each Video :type num_frames: int :param depth: Number of encoder layers :type depth: int :param num_heads: Number of attention heads :type num_heads: int :param head_dim: Dimension of head :type head_dim: int :param n_classes: Number of classes :type n_classes: int :param mlp_dim: Dimension of hidden layer :type mlp_dim: int :param pool: Pooling operation,must be one of {“cls”,”mean”},default is “cls” :type pool: str :param p_dropout: Dropout probability :type p_dropout: float :param attn_dropout: Dropout probability :type attn_dropout: float :param drop_path_rate: Stochastic drop path rate :type drop_path_rate: float

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class vformer.models.classification.vivit.ViViTModel3(img_size, patch_t, patch_h, patch_w, in_channels, n_classes, num_frames, embedding_dim, depth, num_heads, head_dim, p_dropout, mlp_dim=None)[source]

model 3 of A video Vision Trnasformer- https://arxiv.org/abs/2103.15691

Parameters

img_size (int or tuple[int]) – size of a frame
patch_t (int) – Temporal length of single tube/patch in tubelet embedding
patch_h (int) – Height of single tube/patch in tubelet embedding
patch_w (int) – Width of single tube/patch in tubelet embedding
in_channels (int) – Number of input channels, default is 3
n_classes (int) – Number of classes
num_frames (int) – Number of seconds in each Video
embedding_dim (int) – Embedding dimension of a patch
depth (int) – Number of Encoder layers
num_heads (int) – Number of attention heads
head_dim (int) – Dimension of attention head
p_dropout (float) – Dropout rate/probability, default is 0.0
mlp_dim (int) – Hidden dimension, optional

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.