Kseniya Parkhamchuk
Back to "Shape of everything inside a transformer"

Normalisation and Attention

HEAD_DIM – dimension of attention head (EMB_DIM/N_HEADS)

N_HEADS - the number of heads in the attention layer according to architecture choice (hyperparameter)

Below is a table with dimensions. This table is true for training mode.

ValueDimensionalityReasoningExample
Normalisation layer output[BATCH_SIZE, SEQ_LEN, EMB_DIM]Dimensionality is not changing. The purpose of layer - embeddings normalisation
Image loading is in progress...
β, γ - normalisation layer parameters[EMB_DIM]1D tensors of the embedding dimensionality size for effective multiplication and summation with embeddings
Image loading is in progress...
Each vector position value is scaled by the related position in the γ vector
Mean and variance inside normalisation layer[BATCH_SIZE, SEQ_LEN, 1]Custom mean and variance for each token embedding (but the same mean and variance is applied to all dimensions of one token)
Image loading is in progress...

Mean and var will be applied to every dimension of a current token

Attention block

Dimensions inside the attention mechanism are calculated in scope of 1 attention head

ValueDimensionalityReasoningExample
W_q, W_k, W_v (query, key, value tensors) - learnt parameters[EMB_DIM, HEAD_DIM]By projecting into W matrices, the initial dimensionality (EMB_DIM) is reduced to HEAD_DIM. [EMB_DIM, HEAD_DIM] shape makes it possible
Image loading is in progress...
Q, K, V matrices[BATCH_SIZE, SEQ_LEN, HEAD_DIM]The output after projecting into W_q, W_k, W_v matrices. The embedding dimensionality is reduced to allow each attention head to focus on separate features
Image loading is in progress...
Attention scores (Q x K(transposed))[BATCH_SIZE, SEQ_LEN, SEQ_LEN]Q x K_T (transposed to meet matrix multiplication rules). Point on token relevance to each other
Image loading is in progress...
Attention head output (Attn. weights x V)[BATCH_SIZE, SEQ_LEN, HEAD_DIM]Attn_weights x V = [BATCH_SIZE, SEQ_LEN, SEQ_LEN] x [BATCH_SIZE, SEQ_LEN, HEAD_DIM]. According to matrix multiplication rules, SEQ_LEN dimension is reduced to HEAD_DIM
Image loading is in progress...
Attention layer concatenation output (concatenation of all heads)[BATCH_SIZE, SEQ_LEN, EMB_DIM]All attention outputs are concatenated along the last dimension to get the initial embedding size dimensionality
Image loading is in progress...
Final projection[EMB_DIM, EMB_DIM]To preserve the dimensions of attention output. To mix learnt features from attention heads
Image loading is in progress...
Bias during the final projection[EMB_DIM]Should be aligned with a projection matrix dimensionality. Applied to each position of the projection matrix-
Attention layer output[BATCH_SIZE, SEQ_LEN, EMB_DIM]projection x attention_output matrix multiplication rules applied
Image loading is in progress...
Part 2: Normalisation and Attention | Shape of everything inside a transformer