Data Science

🧠 Essential Attention Mechanism In Deep Learning: You Need to Master Neural Network Master!

Hey there! Ready to dive into Attention Mechanism In Deep Learning? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!

SuperML Team
Share this article

Share:

🚀

💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Understanding Attention Mechanism Fundamentals - Made Simple!

The attention mechanism revolutionizes how neural networks process sequential data by implementing a dynamic weighting system. It lets you models to selectively focus on different parts of the input sequence when generating each element of the output sequence, similar to how humans pay attention to specific details.

Let’s break this down together! Here’s how we can tackle this:

import numpy as np

def attention_score(query, key):
    """
    builds basic attention scoring mechanism
    query: shape (query_len, d_k)
    key: shape (key_len, d_k)
    """
    # Compute dot product attention
    scores = np.dot(query, key.T)
    
    # Scale by sqrt(d_k) to prevent exploding gradients
    d_k = query.shape[-1]
    scaled_scores = scores / np.sqrt(d_k)
    
    # Apply softmax for probability distribution
    attention_weights = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=-1, keepdims=True)
    
    return attention_weights

# Example usage
query = np.random.randn(4, 8)  # 4 queries, dimension 8
key = np.random.randn(6, 8)    # 6 keys, dimension 8
weights = attention_score(query, key)
print("Attention Weights Shape:", weights.shape)
print("Sample Weights:\n", weights[0])  # Weights for first query

🚀

🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! Implementing Self-Attention Layer - Made Simple!

Self-attention allows a sequence to attend to itself, capturing relationships between all positions. This example shows you the core mathematical operations behind self-attention, including the query, key, and value transformations that form the foundation of modern attention mechanisms.

Let me walk you through this step by step! Here’s how we can tackle this:

class SelfAttention:
    def __init__(self, embed_dim):
        self.embed_dim = embed_dim
        # Initialize transformation matrices
        self.W_q = np.random.randn(embed_dim, embed_dim)
        self.W_k = np.random.randn(embed_dim, embed_dim)
        self.W_v = np.random.randn(embed_dim, embed_dim)
    
    def forward(self, X):
        """
        X: Input sequence (batch_size, seq_len, embed_dim)
        """
        # Generate Q, K, V matrices
        Q = np.dot(X, self.W_q)  # Query
        K = np.dot(X, self.W_k)  # Key
        V = np.dot(X, self.W_v)  # Value
        
        # Compute attention scores
        scores = np.dot(Q, K.transpose(0, 2, 1))
        scores = scores / np.sqrt(self.embed_dim)
        
        # Apply softmax
        attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
        
        # Compute weighted sum
        output = np.dot(attention_weights, V)
        return output, attention_weights

# Example usage
batch_size, seq_len, embed_dim = 2, 4, 8
X = np.random.randn(batch_size, seq_len, embed_dim)
attention = SelfAttention(embed_dim)
output, weights = attention.forward(X)
print("Output shape:", output.shape)
print("Attention weights shape:", weights.shape)

🚀

Cool fact: Many professional data scientists use this exact approach in their daily work! Multi-Head Attention Implementation - Made Simple!

Multi-head attention extends single-head attention by allowing the model to jointly attend to information from different representation subspaces. This parallel processing of attention lets you the model to capture various types of relationships within the data simultaneously.

Let’s make this super clear! Here’s how we can tackle this:

class MultiHeadAttention:
    def __init__(self, embed_dim, num_heads):
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        
        # Initialize weights for each head
        self.W_q = [np.random.randn(embed_dim, self.head_dim) for _ in range(num_heads)]
        self.W_k = [np.random.randn(embed_dim, self.head_dim) for _ in range(num_heads)]
        self.W_v = [np.random.randn(embed_dim, self.head_dim) for _ in range(num_heads)]
        self.W_o = np.random.randn(embed_dim, embed_dim)
    
    def attention(self, Q, K, V):
        scores = np.dot(Q, K.T) / np.sqrt(self.head_dim)
        weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
        return np.dot(weights, V)
    
    def forward(self, X):
        batch_size, seq_len, _ = X.shape
        
        # Process each attention head
        head_outputs = []
        for i in range(self.num_heads):
            Q = np.dot(X, self.W_q[i])
            K = np.dot(X, self.W_k[i])
            V = np.dot(X, self.W_v[i])
            head_output = self.attention(Q, K, V)
            head_outputs.append(head_output)
        
        # Concatenate and project
        multi_head_output = np.concatenate(head_outputs, axis=-1)
        final_output = np.dot(multi_head_output, self.W_o)
        return final_output

# Example usage
embed_dim, num_heads = 512, 8
mha = MultiHeadAttention(embed_dim, num_heads)
x = np.random.randn(1, 10, embed_dim)  # Batch size 1, sequence length 10
output = mha.forward(x)
print("Multi-head attention output shape:", output.shape)

🚀

🔥 Level up: Once you master this, you’ll be solving problems like a pro! Scaled Dot-Product Attention Mathematics - Made Simple!

The mathematical foundation of attention mechanisms relies on the scaled dot-product operation, which prevents gradient issues in deep networks. The scaling factor is crucial as it helps maintain stable gradients during training, especially with large dimension sizes.

Don’t worry, this is easier than it looks! Here’s how we can tackle this:

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    builds the mathematical formula:
    Attention(Q,K,V) = softmax(QK^T/sqrt(d_k))V
    """
    d_k = query.shape[-1]
    
    # Compute attention scores
    attention_logits = np.matmul(query, np.transpose(key, (0, 2, 1)))
    attention_logits = attention_logits / np.sqrt(d_k)
    
    # Apply mask if provided
    if mask is not None:
        attention_logits += (mask * -1e9)
    
    # Softmax normalization
    attention_weights = np.exp(attention_logits) / np.sum(np.exp(attention_logits), axis=-1, keepdims=True)
    
    # Compute output
    output = np.matmul(attention_weights, value)
    
    return output, attention_weights

# Example with dimensions
batch_size, num_heads, seq_length, depth = 2, 4, 6, 8
q = np.random.random((batch_size, num_heads, seq_length, depth))
k = np.random.random((batch_size, num_heads, seq_length, depth))
v = np.random.random((batch_size, num_heads, seq_length, depth))

output, weights = scaled_dot_product_attention(q, k, v)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

🚀 Positional Encoding Implementation - Made Simple!

Positional encoding adds information about the position of tokens in a sequence, enabling the attention mechanism to consider sequential order. This example shows you both sinusoidal and learned positional encodings, crucial for sequence processing.

Let me walk you through this step by step! Here’s how we can tackle this:

import numpy as np

def sinusoidal_positional_encoding(position, d_model):
    """
    builds the sinusoidal positional encoding formula:
    PE(pos,2i) = sin(pos/10000^(2i/d_model))
    PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
    """
    position_enc = np.zeros((position, d_model))
    
    for pos in range(position):
        for i in range(0, d_model, 2):
            div_term = np.exp(-(i * 2.0 * np.log(10000.0) / d_model))
            position_enc[pos, i] = np.sin(pos * div_term)
            if i + 1 < d_model:
                position_enc[pos, i + 1] = np.cos(pos * div_term)
    
    return position_enc

class LearnedPositionalEncoding:
    def __init__(self, max_position, d_model):
        self.encoding = np.random.randn(max_position, d_model) * 0.1
    
    def __call__(self, position):
        return self.encoding[position]

# Example usage
seq_length, d_model = 100, 512

# Sinusoidal encoding
sin_pos_encoding = sinusoidal_positional_encoding(seq_length, d_model)
print("Sinusoidal Positional Encoding shape:", sin_pos_encoding.shape)

# Learned encoding
learned_pos_encoding = LearnedPositionalEncoding(seq_length, d_model)
sample_position = 5
print("Learned Positional Encoding for position 5:", 
      learned_pos_encoding(sample_position).shape)

# Visualize encoding patterns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.imshow(sin_pos_encoding[:20, :20], cmap='viridis', aspect='auto')
plt.colorbar()
plt.title('First 20x20 of Positional Encoding Matrix')
plt.show()

🚀 Attention Masking for Sequence Processing - Made Simple!

Masking is essential in attention mechanisms to prevent positions from attending to subsequent positions in training. This example shows how to create and apply different types of attention masks, including padding masks and causal masks.

Let’s break this down together! Here’s how we can tackle this:

import numpy as np

def create_padding_mask(seq_length, valid_length):
    """
    Creates a mask for padding tokens in sequences
    """
    mask = np.zeros((seq_length, seq_length))
    mask[:, valid_length:] = -np.inf
    return mask

def create_causal_mask(seq_length):
    """
    Creates a mask for causal attention (cannot look at future tokens)
    """
    mask = np.triu(np.ones((seq_length, seq_length)) * -np.inf, k=1)
    return mask

def apply_attention_mask(scores, mask):
    """
    Applies the mask to attention scores
    """
    masked_scores = scores + mask
    return masked_scores

# Example usage
seq_length = 6
valid_length = 4

# Create masks
padding_mask = create_padding_mask(seq_length, valid_length)
causal_mask = create_causal_mask(seq_length)

# Generate sample attention scores
attention_scores = np.random.randn(seq_length, seq_length)

# Apply masks
masked_padding = apply_attention_mask(attention_scores, padding_mask)
masked_causal = apply_attention_mask(attention_scores, causal_mask)

print("Original Attention Scores:\n", attention_scores)
print("\nPadding Masked Scores:\n", masked_padding)
print("\nCausal Masked Scores:\n", masked_causal)

# Visualize masks
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.imshow(padding_mask, cmap='viridis')
ax1.set_title('Padding Mask')

ax2.imshow(causal_mask, cmap='viridis')
ax2.set_title('Causal Mask')

plt.show()

🚀 Implementing Cross-Attention Mechanism - Made Simple!

Cross-attention lets you a model to attend to information from different sequences, crucial for tasks like translation or question-answering. This example shows how to create a cross-attention layer that processes queries from one sequence while using keys and values from another.

Let’s make this super clear! Here’s how we can tackle this:

class CrossAttention:
    def __init__(self, query_dim, key_dim, value_dim, output_dim):
        self.W_q = np.random.randn(query_dim, output_dim) * 0.1
        self.W_k = np.random.randn(key_dim, output_dim) * 0.1
        self.W_v = np.random.randn(value_dim, output_dim) * 0.1
        self.output_dim = output_dim
    
    def forward(self, query_seq, key_value_seq):
        """
        query_seq: shape (batch_size, query_len, query_dim)
        key_value_seq: shape (batch_size, kv_len, key_dim)
        """
        # Project inputs
        Q = np.dot(query_seq, self.W_q)
        K = np.dot(key_value_seq, self.W_k)
        V = np.dot(key_value_seq, self.W_v)
        
        # Compute attention scores
        scores = np.dot(Q, K.transpose(0, 2, 1))
        scaled_scores = scores / np.sqrt(self.output_dim)
        
        # Apply softmax
        attention_weights = np.exp(scaled_scores) / np.sum(
            np.exp(scaled_scores), axis=-1, keepdims=True
        )
        
        # Compute weighted sum
        output = np.dot(attention_weights, V)
        return output, attention_weights

# Example usage
batch_size = 2
query_len, kv_len = 5, 8
query_dim, key_dim, value_dim = 64, 64, 64
output_dim = 32

# Create sample sequences
query_seq = np.random.randn(batch_size, query_len, query_dim)
key_value_seq = np.random.randn(batch_size, kv_len, key_dim)

# Initialize and apply cross-attention
cross_attention = CrossAttention(query_dim, key_dim, value_dim, output_dim)
output, weights = cross_attention.forward(query_seq, key_value_seq)

print("Output shape:", output.shape)
print("Attention weights shape:", weights.shape)

🚀 Attention Visualization Tools - Made Simple!

Understanding attention patterns is super important for model interpretation. This example provides tools to visualize attention weights and analyze how the model attends to different parts of the input sequence.

Let’s make this super clear! Here’s how we can tackle this:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

class AttentionVisualizer:
    def __init__(self):
        self.attention_maps = []
    
    def plot_attention_weights(self, weights, tokens_source=None, tokens_target=None):
        """
        Visualizes attention weights between source and target sequences
        weights: shape (target_len, source_len)
        """
        plt.figure(figsize=(10, 8))
        sns.heatmap(
            weights,
            xticklabels=tokens_source if tokens_source else 'auto',
            yticklabels=tokens_target if tokens_target else 'auto',
            cmap='viridis',
            annot=True,
            fmt='.2f'
        )
        plt.xlabel('Source Tokens')
        plt.ylabel('Target Tokens')
        plt.title('Attention Weights Visualization')
        return plt.gcf()
    
    def analyze_attention_patterns(self, weights):
        """
        Analyzes attention patterns and returns statistics
        """
        avg_attention = np.mean(weights, axis=0)
        max_attention = np.max(weights, axis=0)
        entropy = -np.sum(weights * np.log(weights + 1e-9), axis=1)
        
        return {
            'average_attention': avg_attention,
            'max_attention': max_attention,
            'attention_entropy': entropy
        }

# Example usage
source_tokens = ['The', 'cat', 'sat', 'on', 'the', 'mat']
target_tokens = ['Le', 'chat', 'est', 'assis', 'sur', 'le', 'tapis']

# Generate sample attention weights
sample_weights = np.random.rand(len(target_tokens), len(source_tokens))
sample_weights = sample_weights / sample_weights.sum(axis=1, keepdims=True)

# Create visualizer and plot
visualizer = AttentionVisualizer()
attention_fig = visualizer.plot_attention_weights(
    sample_weights,
    source_tokens,
    target_tokens
)

# Analyze patterns
patterns = visualizer.analyze_attention_patterns(sample_weights)
print("\nAttention Analysis:")
print("Average attention per source token:", patterns['average_attention'])
print("Maximum attention per source token:", patterns['max_attention'])
print("Attention entropy per target token:", patterns['attention_entropy'])

🚀 Implementing Attention with PyTorch - Made Simple!

This example shows you a production-ready attention mechanism using PyTorch, including forward and backward propagation. The implementation includes optimizations for memory efficiency and numerical stability.

Let’s make this super clear! Here’s how we can tackle this:

import torch
import torch.nn as nn
import torch.nn.functional as F

class EfficientAttention(nn.Module):
    def __init__(self, hidden_dim, num_heads, dropout=0.1):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads
        
        # Linear layers for Q, K, V projections
        self.q_linear = nn.Linear(hidden_dim, hidden_dim)
        self.k_linear = nn.Linear(hidden_dim, hidden_dim)
        self.v_linear = nn.Linear(hidden_dim, hidden_dim)
        self.out_linear = nn.Linear(hidden_dim, hidden_dim)
        
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim]))
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]
        
        # Linear projections and reshape for multi-head
        Q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = self.k_linear(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = self.v_linear(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # Apply attention to values
        context = torch.matmul(attention_weights, V)
        
        # Reshape and apply output projection
        context = context.transpose(1, 2).contiguous().view(
            batch_size, -1, self.hidden_dim
        )
        output = self.out_linear(context)
        
        return output, attention_weights

# Example usage
hidden_dim = 512
num_heads = 8
seq_length = 10
batch_size = 4

model = EfficientAttention(hidden_dim, num_heads)
x = torch.randn(batch_size, seq_length, hidden_dim)
mask = torch.ones(batch_size, num_heads, seq_length, seq_length)

output, weights = model(x, x, x, mask)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

🚀 Real-World Application - Neural Machine Translation - Made Simple!

This example shows how attention mechanisms are used in neural machine translation, including preprocessing, model implementation, and inference with attention visualization.

Let’s make this super clear! Here’s how we can tackle this:

import torch
import torch.nn as nn
import numpy as np
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class AttentionNMT(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, embed_dim)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
        
        self.encoder = nn.LSTM(embed_dim, hidden_dim, bidirectional=True, batch_first=True)
        self.decoder = nn.LSTM(embed_dim + hidden_dim*2, hidden_dim, batch_first=True)
        
        self.attention = nn.Linear(hidden_dim*3, 1)
        self.output_layer = nn.Linear(hidden_dim, tgt_vocab_size)
        
    def encode(self, src, src_lengths):
        embedded = self.encoder_embedding(src)
        packed = pack_padded_sequence(embedded, src_lengths, batch_first=True, enforce_sorted=False)
        outputs, (hidden, cell) = self.encoder(packed)
        outputs, _ = pad_packed_sequence(outputs, batch_first=True)
        return outputs, hidden, cell
    
    def attend(self, decoder_state, encoder_outputs):
        attention_inputs = torch.cat([
            decoder_state.unsqueeze(1).expand(-1, encoder_outputs.size(1), -1),
            encoder_outputs
        ], dim=-1)
        
        scores = self.attention(attention_inputs).squeeze(-1)
        weights = torch.softmax(scores, dim=1)
        context = torch.bmm(weights.unsqueeze(1), encoder_outputs)
        return context.squeeze(1), weights
    
    def decode_step(self, encoder_outputs, decoder_input, decoder_state):
        embedded = self.decoder_embedding(decoder_input)
        context, weights = self.attend(decoder_state[0][-1], encoder_outputs)
        
        lstm_input = torch.cat([embedded, context.unsqueeze(1)], dim=-1)
        output, (hidden, cell) = self.decoder(lstm_input, decoder_state)
        
        predictions = self.output_layer(output.squeeze(1))
        return predictions, (hidden, cell), weights

    def forward(self, src, tgt, src_lengths):
        encoder_outputs, hidden, cell = self.encode(src, src_lengths)
        
        decoder_state = (hidden, cell)
        outputs = []
        attentions = []
        
        for t in range(tgt.size(1) - 1):
            decoder_input = tgt[:, t:t+1]
            predictions, decoder_state, weights = self.decode_step(
                encoder_outputs, decoder_input, decoder_state
            )
            outputs.append(predictions)
            attentions.append(weights)
            
        return torch.stack(outputs, dim=1), torch.stack(attentions, dim=1)

# Example usage with toy data
src_vocab_size = 1000
tgt_vocab_size = 1000
embed_dim = 256
hidden_dim = 512

model = AttentionNMT(src_vocab_size, tgt_vocab_size, embed_dim, hidden_dim)

# Sample batch
batch_size = 4
src_seq_len = 10
tgt_seq_len = 12

src = torch.randint(0, src_vocab_size, (batch_size, src_seq_len))
tgt = torch.randint(0, tgt_vocab_size, (batch_size, tgt_seq_len))
src_lengths = torch.randint(5, src_seq_len+1, (batch_size,))

outputs, attentions = model(src, tgt, src_lengths)
print(f"Outputs shape: {outputs.shape}")
print(f"Attention weights shape: {attentions.shape}")

🚀 Real-World Application - Document Classification with Hierarchical Attention - Made Simple!

This example shows you a hierarchical attention network for document classification, which applies attention at both word and sentence levels, commonly used for sentiment analysis and text categorization.

Let’s break this down together! Here’s how we can tackle this:

import torch
import torch.nn as nn

class HierarchicalAttention(nn.Module):
    def __init__(self, vocab_size, embed_dim, word_hidden_dim, 
                 sent_hidden_dim, num_classes):
        super().__init__()
        # Word level
        self.word_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.word_gru = nn.GRU(embed_dim, word_hidden_dim, bidirectional=True, 
                              batch_first=True)
        self.word_attention = nn.Linear(2*word_hidden_dim, 2*word_hidden_dim)
        self.word_context = nn.Parameter(torch.randn(2*word_hidden_dim))
        
        # Sentence level
        self.sent_gru = nn.GRU(2*word_hidden_dim, sent_hidden_dim, 
                              bidirectional=True, batch_first=True)
        self.sent_attention = nn.Linear(2*sent_hidden_dim, 2*sent_hidden_dim)
        self.sent_context = nn.Parameter(torch.randn(2*sent_hidden_dim))
        
        # Classification
        self.fc = nn.Linear(2*sent_hidden_dim, num_classes)
        self.dropout = nn.Dropout(0.5)
    
    def word_level_attention(self, word_hidden, word_mask):
        """Apply attention at word level"""
        word_att = torch.tanh(self.word_attention(word_hidden))
        word_att = torch.matmul(word_att, self.word_context)
        
        if word_mask is not None:
            word_att = word_att.masked_fill(word_mask == 0, float('-inf'))
        
        word_att_weights = torch.softmax(word_att, dim=1)
        word_att_out = torch.bmm(word_att_weights.unsqueeze(1), word_hidden)
        return word_att_out.squeeze(1), word_att_weights
    
    def sentence_level_attention(self, sent_hidden, sent_mask):
        """Apply attention at sentence level"""
        sent_att = torch.tanh(self.sent_attention(sent_hidden))
        sent_att = torch.matmul(sent_att, self.sent_context)
        
        if sent_mask is not None:
            sent_att = sent_att.masked_fill(sent_mask == 0, float('-inf'))
        
        sent_att_weights = torch.softmax(sent_att, dim=1)
        sent_att_out = torch.bmm(sent_att_weights.unsqueeze(1), sent_hidden)
        return sent_att_out.squeeze(1), sent_att_weights
    
    def forward(self, documents, word_mask=None, sent_mask=None):
        """
        documents: (batch_size, num_sentences, num_words)
        word_mask: (batch_size, num_sentences, num_words)
        sent_mask: (batch_size, num_sentences)
        """
        batch_size, num_sentences, num_words = documents.size()
        
        # Process each sentence
        sentence_vectors = []
        word_attention_weights = []
        
        for i in range(num_sentences):
            # Word embedding
            words = documents[:, i, :]
            word_embed = self.word_embedding(words)
            
            # Word encoder
            word_hidden, _ = self.word_gru(word_embed)
            
            # Word attention
            curr_word_mask = word_mask[:, i, :] if word_mask is not None else None
            sent_vector, word_weights = self.word_level_attention(
                word_hidden, curr_word_mask)
            
            sentence_vectors.append(sent_vector)
            word_attention_weights.append(word_weights)
        
        # Stack sentence vectors
        sentence_vectors = torch.stack(sentence_vectors, dim=1)
        
        # Sentence encoder
        sent_hidden, _ = self.sent_gru(sentence_vectors)
        
        # Sentence attention
        doc_vector, sent_attention_weights = self.sentence_level_attention(
            sent_hidden, sent_mask)
        
        # Classification
        doc_vector = self.dropout(doc_vector)
        output = self.fc(doc_vector)
        
        return output, word_attention_weights, sent_attention_weights

# Example usage
vocab_size = 10000
embed_dim = 200
word_hidden_dim = 100
sent_hidden_dim = 100
num_classes = 5
batch_size = 16
num_sentences = 10
num_words = 50

model = HierarchicalAttention(vocab_size, embed_dim, word_hidden_dim, 
                            sent_hidden_dim, num_classes)

# Sample batch
documents = torch.randint(0, vocab_size, (batch_size, num_sentences, num_words))
word_mask = torch.ones(batch_size, num_sentences, num_words)
sent_mask = torch.ones(batch_size, num_sentences)

output, word_weights, sent_weights = model(documents, word_mask, sent_mask)
print(f"Output shape: {output.shape}")
print(f"Word attention weights shape: {word_weights[0].shape}")
print(f"Sentence attention weights shape: {sent_weights.shape}")

🚀 Transformer Block Implementation with cool Attention - Made Simple!

This example shows a complete transformer block including multi-head attention, layer normalization, and feed-forward networks with cool features like relative positional encoding and gradient checkpointing.

Let’s make this super clear! Here’s how we can tackle this:

import torch
import torch.nn as nn
import math

class TransformerBlock(nn.Module):
    def __init__(self, hidden_dim, num_heads, ff_dim, dropout=0.1,
                 use_relative_pos=True):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads
        self.use_relative_pos = use_relative_pos
        
        # Multi-head attention
        self.q_linear = nn.Linear(hidden_dim, hidden_dim)
        self.k_linear = nn.Linear(hidden_dim, hidden_dim)
        self.v_linear = nn.Linear(hidden_dim, hidden_dim)
        self.out_linear = nn.Linear(hidden_dim, hidden_dim)
        
        # Relative positional encoding
        if use_relative_pos:
            self.rel_pos_embed = nn.Parameter(
                torch.randn(2 * hidden_dim - 1, self.head_dim))
            
        # Layer normalization
        self.norm1 = nn.LayerNorm(hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        
        # Feed-forward network
        self.ff = nn.Sequential(
            nn.Linear(hidden_dim, ff_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, hidden_dim)
        )
        
        self.dropout = nn.Dropout(dropout)
        
    def relative_position_to_absolute(self, x):
        """Convert relative position representation to absolute"""
        batch_size, num_heads, seq_length, _ = x.size()
        
        # Pad for shifting
        col_pad = torch.zeros(batch_size, num_heads, seq_length, 1,
                            device=x.device)
        x = torch.cat([x, col_pad], dim=-1)
        
        flat_x = x.view(batch_size, num_heads, seq_length * 2 - 1)
        flat_pad = torch.zeros(batch_size, num_heads, seq_length - 1,
                             device=x.device)
        
        flat_x_padded = torch.cat([flat_x, flat_pad], dim=-1)
        
        # Reshape and slice out the padded elements
        final_x = flat_x_padded.view(batch_size, num_heads, seq_length + 1,
                                   seq_length)
        final_x = final_x[:, :, :seq_length, :]
        
        return final_x
    
    def forward(self, x, mask=None):
        batch_size, seq_length, _ = x.size()
        
        # Multi-head attention
        q = self.q_linear(x).view(batch_size, seq_length, self.num_heads,
                                 self.head_dim).transpose(1, 2)
        k = self.k_linear(x).view(batch_size, seq_length, self.num_heads,
                                 self.head_dim).transpose(1, 2)
        v = self.v_linear(x).view(batch_size, seq_length, self.num_heads,
                                 self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # Add relative positional encoding
        if self.use_relative_pos:
            rel_pos_bias = self.compute_relative_position_bias(seq_length)
            scores = scores + rel_pos_bias
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attention_weights = torch.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        context = torch.matmul(attention_weights, v)
        
        # Reshape and apply output projection
        context = context.transpose(1, 2).contiguous().view(
            batch_size, seq_length, self.hidden_dim)
        out = self.out_linear(context)
        
        # Residual connection and layer normalization
        x = self.norm1(x + self.dropout(out))
        
        # Feed-forward network
        ff_out = self.ff(x)
        
        # Residual connection and layer normalization
        x = self.norm2(x + self.dropout(ff_out))
        
        return x, attention_weights
    
    def compute_relative_position_bias(self, seq_length):
        """Compute relative positional bias"""
        if not self.use_relative_pos:
            return 0
            
        positions = torch.arange(seq_length, device=self.rel_pos_embed.device)
        relative_positions = positions.unsqueeze(1) - positions.unsqueeze(0)
        relative_positions += seq_length - 1  # Shift to all positive indices
        
        bias = self.rel_pos_embed[relative_positions]
        return bias.unsqueeze(0).unsqueeze(0)

# Example usage
hidden_dim = 512
num_heads = 8
ff_dim = 2048
seq_length = 100
batch_size = 16

transformer = TransformerBlock(hidden_dim, num_heads, ff_dim)
x = torch.randn(batch_size, seq_length, hidden_dim)
mask = torch.ones(batch_size, num_heads, seq_length, seq_length)

output, attention = transformer(x, mask)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention.shape}")

🚀 cool Attention Regularization and Loss Functions - Made Simple!

This example shows you cool techniques for regularizing attention mechanisms and specialized loss functions that improve attention learning, including entropy regularization and sparse attention penalties.

Let’s make this super clear! Here’s how we can tackle this:

import torch
import torch.nn as nn
import torch.nn.functional as F

class AttentionRegularizer:
    def __init__(self, entropy_weight=0.1, sparsity_weight=0.1, coverage_weight=0.1):
        self.entropy_weight = entropy_weight
        self.sparsity_weight = sparsity_weight
        self.coverage_weight = coverage_weight
    
    def entropy_loss(self, attention_weights):
        """
        Encourages attention to be more focused or dispersed
        Lower entropy = more focused attention
        """
        eps = 1e-7
        entropy = -(attention_weights * torch.log(attention_weights + eps)).sum(dim=-1)
        return entropy.mean()
    
    def sparsity_loss(self, attention_weights):
        """
        Encourages sparse attention distributions
        Uses L1 regularization on attention weights
        """
        return torch.norm(attention_weights, p=1, dim=-1).mean()
    
    def coverage_loss(self, attention_weights, coverage_vector):
        """
        Prevents over-attention and under-attention to input tokens
        """
        coverage_vector = coverage_vector + attention_weights
        penalty = torch.min(coverage_vector, attention_weights)
        return penalty.sum(dim=-1).mean()

class RegularizedAttention(nn.Module):
    def __init__(self, hidden_dim, num_heads, dropout=0.1):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads
        
        self.q_linear = nn.Linear(hidden_dim, hidden_dim)
        self.k_linear = nn.Linear(hidden_dim, hidden_dim)
        self.v_linear = nn.Linear(hidden_dim, hidden_dim)
        self.out_linear = nn.Linear(hidden_dim, hidden_dim)
        
        self.dropout = nn.Dropout(dropout)
        self.regularizer = AttentionRegularizer()
        self.coverage_vector = None
    
    def forward(self, query, key, value, mask=None, return_regularization=True):
        batch_size = query.shape[0]
        
        # Linear projections and reshape
        Q = self.q_linear(query).view(batch_size, -1, self.num_heads, 
                                    self.head_dim).transpose(1, 2)
        K = self.k_linear(key).view(batch_size, -1, self.num_heads, 
                                   self.head_dim).transpose(1, 2)
        V = self.v_linear(value).view(batch_size, -1, self.num_heads, 
                                    self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(
            torch.tensor(self.head_dim, dtype=torch.float))
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # Initialize coverage vector if None
        if self.coverage_vector is None:
            self.coverage_vector = torch.zeros_like(attention_weights)
        
        # Calculate regularization losses
        reg_losses = {
            'entropy': self.regularizer.entropy_loss(attention_weights),
            'sparsity': self.regularizer.sparsity_loss(attention_weights),
            'coverage': self.regularizer.coverage_loss(
                attention_weights, self.coverage_vector
            )
        }
        
        # Update coverage vector
        self.coverage_vector = self.coverage_vector + attention_weights
        
        # Apply attention to values
        context = torch.matmul(attention_weights, V)
        
        # Reshape and apply output projection
        context = context.transpose(1, 2).contiguous().view(
            batch_size, -1, self.hidden_dim)
        output = self.out_linear(context)
        
        if return_regularization:
            return output, attention_weights, reg_losses
        return output, attention_weights

class RegularizedAttentionLoss(nn.Module):
    def __init__(self, base_criterion=nn.CrossEntropyLoss()):
        super().__init__()
        self.base_criterion = base_criterion
        
    def forward(self, outputs, targets, reg_losses):
        base_loss = self.base_criterion(outputs, targets)
        
        # Combine regularization losses
        total_reg_loss = (
            reg_losses['entropy'] * 0.1 +
            reg_losses['sparsity'] * 0.1 +
            reg_losses['coverage'] * 0.1
        )
        
        return base_loss + total_reg_loss

# Example usage
hidden_dim = 512
num_heads = 8
batch_size = 16
seq_length = 20
num_classes = 10

model = RegularizedAttention(hidden_dim, num_heads)
criterion = RegularizedAttentionLoss()

# Sample data
x = torch.randn(batch_size, seq_length, hidden_dim)
targets = torch.randint(0, num_classes, (batch_size,))
mask = torch.ones(batch_size, num_heads, seq_length, seq_length)

# Forward pass
output, attention_weights, reg_losses = model(x, x, x, mask)

# Calculate loss
logits = torch.randn(batch_size, num_classes)  # Simulated classification logits
loss = criterion(logits, targets, reg_losses)

print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
print(f"Regularization losses: {reg_losses}")
print(f"Total loss: {loss.item()}")

🚀 Additional Resources - Made Simple!

  1. “Attention Is All You Need” - Original Transformer paper https://arxiv.org/abs/1706.03762
  2. “On Layer Normalization in the Transformer Architecture” https://arxiv.org/abs/2002.04745
  3. “Self-Attention with Relative Position Representations” https://arxiv.org/abs/1803.02155
  4. “Longformer: The Long-Document Transformer” https://arxiv.org/abs/2004.05150
  5. “Synthesizer: Rethinking Self-Attention in Transformer Models” https://arxiv.org/abs/2005.00743
  6. “Reformer: The Efficient Transformer” https://arxiv.org/abs/2001.04451
  7. “Sparse Transformer: Concentrated Attention Through Constrained Matrix Factorization” https://arxiv.org/abs/1912.11637

🎊 Awesome Work!

You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.

What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.

Keep coding, keep learning, and keep being awesome! 🚀

Back to Blog

Related Posts

View All Posts »