Video Vision Transformer

class vformer.models.classification.vivit.ViViTModel2(img_size, in_channels, patch_size, embedding_dim, num_frames, depth, num_heads, head_dim, n_classes, mlp_dim=None, pool='cls', p_dropout=0.0, attn_dropout=0.0, drop_path_rate=0.02)[source]

Model 2 implementation of: ViViT: A Video Vision Transformer

Parameters

img_size (int) – Size of single frame/ image in video
in_channels (int) – Number of channels
patch_size (int) – Patch size
embedding_dim (int) – Embedding dimension of a patch
num_frames (int) – Number of seconds in each Video
depth (int) – Number of encoder layers
num_heads (int) – Number of attention heads
head_dim (int) – Dimension of head
n_classes (int) – Number of classes
mlp_dim (int) – Dimension of hidden layer
pool (str) – Pooling operation,must be one of {“cls”,”mean”},default is “cls”
p_dropout (float) – Dropout probability
attn_dropout (float) – Dropout probability
drop_path_rate (float) – Stochastic drop path rate

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class vformer.models.classification.vivit.ViViTModel3(img_size, patch_t, patch_h, patch_w, in_channels, n_classes, num_frames, embedding_dim, depth, num_heads, head_dim, p_dropout, mlp_dim=None)[source]

Model 3 Implementation from : ViViT: A Video Vision Transformer

Parameters

img_size (int or tuple[int]) – size of a frame
patch_t (int) – Temporal length of single tube/patch in tubelet embedding
patch_h (int) – Height of single tube/patch in tubelet embedding
patch_w (int) – Width of single tube/patch in tubelet embedding
in_channels (int) – Number of input channels, default is 3
n_classes (int) – Number of classes
num_frames (int) – Number of seconds in each Video
embedding_dim (int) – Embedding dimension of a patch
depth (int) – Number of Encoder layers
num_heads (int) – Number of attention heads
head_dim (int) – Dimension of attention head
p_dropout (float) – Dropout rate/probability, default is 0.0
mlp_dim (int) – Hidden dimension, optional

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.