Video Vision Transformer

class vformer.models.classification.vivit.ViViTModel2(img_size, in_channels, patch_size, embedding_dim, num_frames, depth, num_heads, head_dim, n_classes, mlp_dim=None, pool='cls', p_dropout=0.0, attn_dropout=0.0, drop_path_rate=0.02)[source]

Model 2 implementation of: ViViT: A Video Vision Transformer

Parameters
  • img_size (int) – Size of single frame/ image in video

  • in_channels (int) – Number of channels

  • patch_size (int) – Patch size

  • embedding_dim (int) – Embedding dimension of a patch

  • num_frames (int) – Number of seconds in each Video

  • depth (int) – Number of encoder layers

  • num_heads (int) – Number of attention heads

  • head_dim (int) – Dimension of head

  • n_classes (int) – Number of classes

  • mlp_dim (int) – Dimension of hidden layer

  • pool (str) – Pooling operation,must be one of {“cls”,”mean”},default is “cls”

  • p_dropout (float) – Dropout probability

  • attn_dropout (float) – Dropout probability

  • drop_path_rate (float) – Stochastic drop path rate

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class vformer.models.classification.vivit.ViViTModel3(img_size, patch_t, patch_h, patch_w, in_channels, n_classes, num_frames, embedding_dim, depth, num_heads, head_dim, p_dropout, mlp_dim=None)[source]

Model 3 Implementation from : ViViT: A Video Vision Transformer

Parameters
  • img_size (int or tuple[int]) – size of a frame

  • patch_t (int) – Temporal length of single tube/patch in tubelet embedding

  • patch_h (int) – Height of single tube/patch in tubelet embedding

  • patch_w (int) – Width of single tube/patch in tubelet embedding

  • in_channels (int) – Number of input channels, default is 3

  • n_classes (int) – Number of classes

  • num_frames (int) – Number of seconds in each Video

  • embedding_dim (int) – Embedding dimension of a patch

  • depth (int) – Number of Encoder layers

  • num_heads (int) – Number of attention heads

  • head_dim (int) – Dimension of attention head

  • p_dropout (float) – Dropout rate/probability, default is 0.0

  • mlp_dim (int) – Hidden dimension, optional

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.