🔥 Complete Self Attention And Cross Attention In Transformers With Python: That Will Boost Your Transformer Expert!
Hey there! Ready to dive into Self Attention And Cross Attention In Transformers With Python? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!
🚀
💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Introduction to Self-Attention & Cross-Attention - Made Simple!
Self-attention and cross-attention are fundamental mechanisms in transformer architectures. They allow models to weigh the importance of different parts of the input sequence when processing each element. This slideshow will explore these concepts with practical Python examples.
Let me walk you through this step by step! Here’s how we can tackle this:
import torch
import torch.nn as nn
class Attention(nn.Module):
def __init__(self, embed_dim):
super().__init__()
self.embed_dim = embed_dim
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
def forward(self, query, key, value):
# Implement attention mechanism
pass
🚀
🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! Self-Attention: The Basics - Made Simple!
Self-attention allows a sequence to attend to itself, capturing relationships between different positions within the same sequence. It’s a key component in understanding context and dependencies in sequential data.
Let’s break this down together! Here’s how we can tackle this:
def self_attention(x):
# x: input tensor of shape (batch_size, seq_len, embed_dim)
q = self.query(x)
k = self.key(x)
v = self.value(x)
# Compute attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / (self.embed_dim ** 0.5)
attn_weights = torch.softmax(scores, dim=-1)
# Apply attention to values
return torch.matmul(attn_weights, v)
🚀
✨ Cool fact: Many professional data scientists use this exact approach in their daily work! Visualizing Self-Attention - Made Simple!
Let’s create a simple visualization of self-attention weights to better understand how it works.
Let me walk you through this step by step! Here’s how we can tackle this:
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_attention(attn_weights, tokens):
plt.figure(figsize=(10, 8))
sns.heatmap(attn_weights.detach().numpy(), annot=True, cmap='coolwarm', xticklabels=tokens, yticklabels=tokens)
plt.title('Self-Attention Weights')
plt.xlabel('Key/Value Tokens')
plt.ylabel('Query Tokens')
plt.show()
# Example usage
tokens = ['I', 'love', 'natural', 'language', 'processing']
attn_weights = torch.rand(5, 5) # Random weights for demonstration
visualize_attention(attn_weights, tokens)
🚀
🔥 Level up: Once you master this, you’ll be solving problems like a pro! Multi-Head Attention - Made Simple!
Multi-head attention allows the model to jointly attend to information from different representation subspaces, enhancing the model’s ability to capture various aspects of the input.
Don’t worry, this is easier than it looks! Here’s how we can tackle this:
class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
self.out = nn.Linear(embed_dim, embed_dim)
def forward(self, query, key, value):
batch_size = query.shape[0]
# Linear projections and reshape
q = self.query(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
k = self.key(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
v = self.value(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
# Compute attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
attn_weights = torch.softmax(scores, dim=-1)
# Apply attention to values
context = torch.matmul(attn_weights, v)
# Reshape and apply output projection
context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim)
return self.out(context)
🚀 Cross-Attention: Bridging Sequences - Made Simple!
Cross-attention allows a model to attend to a different sequence than the one being processed. This is crucial in tasks like machine translation, where the model needs to align source and target sequences.
Let’s break this down together! Here’s how we can tackle this:
def cross_attention(query, key, value):
# query: tensor of shape (batch_size, query_len, embed_dim)
# key, value: tensors of shape (batch_size, key_len, embed_dim)
scores = torch.matmul(query, key.transpose(-2, -1)) / (query.size(-1) ** 0.5)
attn_weights = torch.softmax(scores, dim=-1)
return torch.matmul(attn_weights, value)
# Example usage
query = torch.rand(1, 5, 64) # 5 query tokens
key = value = torch.rand(1, 7, 64) # 7 key/value tokens
output = cross_attention(query, key, value)
print(f"Cross-attention output shape: {output.shape}")
🚀 Real-Life Example: Text Summarization - Made Simple!
Text summarization is a practical application of self-attention and cross-attention. The model attends to important parts of the input text to generate a concise summary.
Here’s where it gets exciting! Here’s how we can tackle this:
class SummarizationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_heads):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.self_attention = MultiHeadAttention(embed_dim, num_heads)
self.cross_attention = MultiHeadAttention(embed_dim, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.ReLU(),
nn.Linear(4 * embed_dim, embed_dim)
)
self.output = nn.Linear(embed_dim, vocab_size)
def forward(self, source, target):
src_embed = self.embedding(source)
tgt_embed = self.embedding(target)
# Self-attention on source
src_attended = self.self_attention(src_embed, src_embed, src_embed)
# Cross-attention between target and source
cross_attended = self.cross_attention(tgt_embed, src_attended, src_attended)
# Feed-forward
output = self.feed_forward(cross_attended)
return self.output(output)
# Example usage
vocab_size, embed_dim, num_heads = 10000, 256, 8
model = SummarizationModel(vocab_size, embed_dim, num_heads)
source = torch.randint(0, vocab_size, (1, 100)) # 100 source tokens
target = torch.randint(0, vocab_size, (1, 20)) # 20 target tokens
output = model(source, target)
print(f"Summarization model output shape: {output.shape}")
🚀 Masked Self-Attention for Language Modeling - Made Simple!
In language modeling tasks, we use masked self-attention to prevent the model from attending to future tokens during training. This ensures the model only uses past and present information to predict the next token.
Ready for some cool stuff? Here’s how we can tackle this:
def masked_self_attention(x):
seq_len = x.size(1)
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
mask = mask.unsqueeze(0).unsqueeze(0) # Add batch and head dimensions
q = self.query(x)
k = self.key(x)
v = self.value(x)
scores = torch.matmul(q, k.transpose(-2, -1)) / (self.embed_dim ** 0.5)
scores = scores.masked_fill(mask, float('-inf'))
attn_weights = torch.softmax(scores, dim=-1)
return torch.matmul(attn_weights, v)
# Example usage
x = torch.rand(1, 10, 64) # Batch size 1, 10 tokens, 64 dimensions
output = masked_self_attention(x)
print(f"Masked self-attention output shape: {output.shape}")
🚀 Positional Encoding - Made Simple!
Transformers don’t inherently understand token positions. Positional encoding adds this information to the input embeddings, allowing the model to leverage sequence order.
Here’s where it gets exciting! Here’s how we can tackle this:
import math
def positional_encoding(seq_len, embed_dim):
pos = torch.arange(seq_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, embed_dim, 2) * -(math.log(10000.0) / embed_dim))
pe = torch.zeros(seq_len, embed_dim)
pe[:, 0::2] = torch.sin(pos * div_term)
pe[:, 1::2] = torch.cos(pos * div_term)
return pe
# Visualize positional encoding
seq_len, embed_dim = 100, 64
pe = positional_encoding(seq_len, embed_dim)
plt.figure(figsize=(10, 8))
sns.heatmap(pe.detach().numpy(), cmap='coolwarm')
plt.title('Positional Encoding')
plt.xlabel('Embedding Dimension')
plt.ylabel('Sequence Position')
plt.show()
🚀 Attention Scaling - Made Simple!
Attention scaling is super important for maintaining stable gradients, especially for large input sequences. We divide the dot product by the square root of the embedding dimension.
Let’s break this down together! Here’s how we can tackle this:
def scaled_dot_product_attention(query, key, value):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
attn_weights = torch.softmax(scores, dim=-1)
return torch.matmul(attn_weights, value)
# Compare scaled vs. unscaled attention
q = k = v = torch.rand(1, 8, 10, 64) # 8 heads, 10 tokens, 64 dimensions
unscaled_output = torch.matmul(torch.softmax(torch.matmul(q, k.transpose(-2, -1)), dim=-1), v)
scaled_output = scaled_dot_product_attention(q, k, v)
print(f"Unscaled max value: {unscaled_output.max().item():.4f}")
print(f"Scaled max value: {scaled_output.max().item():.4f}")
🚀 Real-Life Example: Named Entity Recognition - Made Simple!
Named Entity Recognition (NER) is another practical application of self-attention. The model attends to contextual information to classify words as entities like person names, organizations, or locations.
Don’t worry, this is easier than it looks! Here’s how we can tackle this:
class NERModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_heads, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.self_attention = MultiHeadAttention(embed_dim, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.ReLU(),
nn.Linear(4 * embed_dim, embed_dim)
)
self.classifier = nn.Linear(embed_dim, num_classes)
def forward(self, x):
x = self.embedding(x)
x = self.self_attention(x, x, x)
x = self.feed_forward(x)
return self.classifier(x)
# Example usage
vocab_size, embed_dim, num_heads, num_classes = 10000, 256, 8, 9 # 9 NER classes
model = NERModel(vocab_size, embed_dim, num_heads, num_classes)
input_sequence = torch.randint(0, vocab_size, (1, 50)) # Batch size 1, 50 tokens
output = model(input_sequence)
print(f"NER model output shape: {output.shape}")
🚀 Attention Visualization for NER - Made Simple!
Let’s visualize how the attention mechanism focuses on different parts of the input for Named Entity Recognition.
Don’t worry, this is easier than it looks! Here’s how we can tackle this:
def visualize_ner_attention(sentence, attention_weights, entities):
fig, ax = plt.subplots(figsize=(12, 6))
im = ax.imshow(attention_weights, cmap='YlOrRd')
ax.set_xticks(np.arange(len(sentence)))
ax.set_yticks(np.arange(len(sentence)))
ax.set_xticklabels(sentence)
ax.set_yticklabels(sentence)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
for i in range(len(sentence)):
for j in range(len(sentence)):
text = ax.text(j, i, f"{attention_weights[i, j]:.2f}",
ha="center", va="center", color="black")
for i, word in enumerate(sentence):
if entities[i] != 'O':
ax.get_xticklabels()[i].set_color('red')
ax.get_xticklabels()[i].set_fontweight('bold')
ax.set_title("Attention Weights for Named Entity Recognition")
fig.tight_layout()
plt.show()
# Example usage
sentence = ["John", "works", "at", "Google", "in", "New", "York"]
entities = ["B-PER", "O", "O", "B-ORG", "O", "B-LOC", "I-LOC"]
attention_weights = torch.rand(len(sentence), len(sentence))
visualize_ner_attention(sentence, attention_weights, entities)
🚀 Efficient Attention: Linear Attention - Made Simple!
As sequence lengths grow, the quadratic complexity of standard attention becomes problematic. Linear attention offers a more efficient alternative for long sequences.
Let’s break this down together! Here’s how we can tackle this:
def linear_attention(query, key, value):
q = torch.nn.functional.elu(query) + 1
k = torch.nn.functional.elu(key) + 1
v = value
k_sum = k.sum(dim=-2)
kv = torch.matmul(k.transpose(-2, -1), v)
z = 1 / (torch.matmul(q, k_sum.unsqueeze(-1)))
return torch.matmul(q, kv) * z
# Compare standard and linear attention
seq_len = 1000
d_model = 64
q = k = v = torch.rand(1, seq_len, d_model)
%timeit scaled_dot_product_attention(q, k, v)
%timeit linear_attention(q, k, v)
print("Standard attention shape:", scaled_dot_product_attention(q, k, v).shape)
print("Linear attention shape:", linear_attention(q, k, v).shape)
🚀 Attention in Vision Transformers - Made Simple!
Vision Transformers (ViT) apply the attention mechanism to image patches, demonstrating the versatility of attention beyond text data.
Here’s where it gets exciting! Here’s how we can tackle this:
class VisionTransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.mlp = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.GELU(),
nn.Linear(4 * embed_dim, embed_dim)
)
def forward(self, x):
x = x + self.attention(self.norm1(x), self.norm1(x), self.norm1(x))
x = x + self.mlp(self.norm2(x))
return x
# Example usage
embed_dim, num_heads = 768, 12
vit_block = VisionTransformerBlock(embed_dim, num_heads)
image_patches = torch.rand(1, 196, embed_dim) # 14x14 patches from 224x224 image
output = vit_block(image_patches)
print(f"Vision Transformer block output shape: {output.shape}")
🚀 Attention Mechanisms in Natural Language Processing - Made Simple!
Attention mechanisms have revolutionized natural language processing tasks. Let’s explore a simple sentiment analysis model using self-attention.
Here’s where it gets exciting! Here’s how we can tackle this:
class SentimentAnalysisModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_heads):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.fc = nn.Linear(embed_dim, 2) # Binary classification
def forward(self, x):
x = self.embedding(x)
x = self.attention(x, x, x)
x = x.mean(dim=1) # Global average pooling
return self.fc(x)
# Example usage
vocab_size, embed_dim, num_heads = 10000, 256, 8
model = SentimentAnalysisModel(vocab_size, embed_dim, num_heads)
input_ids = torch.randint(0, vocab_size, (1, 50)) # Batch size 1, 50 tokens
output = model(input_ids)
print(f"Sentiment analysis output shape: {output.shape}")
🚀 Additional Resources - Made Simple!
For those interested in diving deeper into self-attention and cross-attention in transformers, here are some valuable resources:
- “Attention Is All You Need” (Vaswani et al., 2017) - The original transformer paper: ArXiv link: https://arxiv.org/abs/1706.03762
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et al., 2018): ArXiv link: https://arxiv.org/abs/1810.04805
- “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (Dosovitskiy et al., 2020) - Vision Transformers: ArXiv link: https://arxiv.org/abs/2010.11929
These papers provide in-depth explanations of the concepts we’ve covered and their applications in various domains.
🎊 Awesome Work!
You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.
What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.
Keep coding, keep learning, and keep being awesome! 🚀