Video Vision Transformer

class vformer.models.classification.vivit.ViViTModel2(img_size, in_channels, patch_size, embedding_dim, num_frames, depth, num_heads, head_dim, n_classes, mlp_dim=None, pool='cls', p_dropout=0.0, attn_dropout=0.0, drop_path_rate=0.02)[source]

Model 2 implementation of A Video vision Transformer - :param img_size: Size of single frame/ image in video :type img_size: int :param in_channels: Number of channels :type in_channels: int :param patch_size: Patch size :type patch_size: int :param embedding_dim: Embedding dimension of a patch :type embedding_dim: int :param num_frames: Number of seconds in each Video :type num_frames: int :param depth: Number of encoder layers :type depth: int :param num_heads: Number of attention heads :type num_heads: int :param head_dim: Dimension of head :type head_dim: int :param n_classes: Number of classes :type n_classes: int :param mlp_dim: Dimension of hidden layer :type mlp_dim: int :param pool: Pooling operation,must be one of {“cls”,”mean”},default is “cls” :type pool: str :param p_dropout: Dropout probability :type p_dropout: float :param attn_dropout: Dropout probability :type attn_dropout: float :param drop_path_rate: Stochastic drop path rate :type drop_path_rate: float

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class vformer.models.classification.vivit.ViViTModel3(img_size, patch_t, patch_h, patch_w, in_channels, n_classes, num_frames, embedding_dim, depth, num_heads, head_dim, p_dropout, mlp_dim=None)[source]

model 3 of A video Vision Trnasformer- https://arxiv.org/abs/2103.15691

Parameters
  • img_size (int or tuple[int]) – size of a frame

  • patch_t (int) – Temporal length of single tube/patch in tubelet embedding

  • patch_h (int) – Height of single tube/patch in tubelet embedding

  • patch_w (int) – Width of single tube/patch in tubelet embedding

  • in_channels (int) – Number of input channels, default is 3

  • n_classes (int) – Number of classes

  • num_frames (int) – Number of seconds in each Video

  • embedding_dim (int) – Embedding dimension of a patch

  • depth (int) – Number of Encoder layers

  • num_heads (int) – Number of attention heads

  • head_dim (int) – Dimension of attention head

  • p_dropout (float) – Dropout rate/probability, default is 0.0

  • mlp_dim (int) – Hidden dimension, optional

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.