🔥 Positional Encodings In Transformer Llms Secrets That Will 10x Your!
Hey there! Ready to dive into Positional Encodings In Transformer Llms? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!
🚀
💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Understanding Positional Encodings Fundamentals - Made Simple!
Positional encodings form the backbone of modern transformer architectures, enabling models to understand sequential information. They inject position-dependent signals into input embeddings through smart mathematical transformations, preserving word order information during parallel processing.
Let’s break this down together! Here’s how we can tackle this:
import numpy as np
def positional_encoding(position, d_model):
# Create empty encoding matrix
encoding = np.zeros((position, d_model))
# Calculate positional encodings using sine and cosine
for pos in range(position):
for i in range(0, d_model, 2):
denominator = np.power(10000, 2 * i / d_model)
encoding[pos, i] = np.sin(pos / denominator)
encoding[pos, i + 1] = np.cos(pos / denominator)
return encoding
# Example usage
sequence_length = 10
embedding_dim = 512
encodings = positional_encoding(sequence_length, embedding_dim)
print(f"Shape of positional encodings: {encodings.shape}")
print("\nFirst position encoding (partial):")
print(encodings[0, :10]) # Show first 10 values
🚀
🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! Implementing Absolute Positional Encodings - Made Simple!
Absolute positional encodings assign unique position-dependent values to each token in the sequence. This example shows you how to create learnable position embeddings that can be trained alongside the model parameters.
This next part is really neat! Here’s how we can tackle this:
import torch
import torch.nn as nn
class AbsolutePositionalEncoding(nn.Module):
def __init__(self, max_seq_length, embed_dim):
super().__init__()
self.position_embeddings = nn.Embedding(max_seq_length, embed_dim)
def forward(self, x):
# x shape: (batch_size, seq_length, embed_dim)
batch_size, seq_length, _ = x.size()
positions = torch.arange(seq_length, device=x.device)
positions = positions.unsqueeze(0).expand(batch_size, -1)
position_embeddings = self.position_embeddings(positions)
return x + position_embeddings
# Example usage
seq_length, batch_size, embed_dim = 16, 4, 128
input_embeddings = torch.randn(batch_size, seq_length, embed_dim)
pos_encoder = AbsolutePositionalEncoding(max_seq_length=100, embed_dim=embed_dim)
output = pos_encoder(input_embeddings)
print(f"Output shape: {output.shape}")
🚀
✨ Cool fact: Many professional data scientists use this exact approach in their daily work! Relative Positional Encodings Implementation - Made Simple!
Relative positional encodings capture relationships between tokens based on their relative distances. This way offers better generalization for varying sequence lengths and creates more flexible position-aware representations.
Let’s make this super clear! Here’s how we can tackle this:
import torch
import torch.nn as nn
import torch.nn.functional as F
class RelativePositionalEncoding(nn.Module):
def __init__(self, dim, max_distance=32):
super().__init__()
self.max_distance = max_distance
self.rel_embeddings = nn.Parameter(torch.randn(2 * max_distance + 1, dim))
def forward(self, q, k):
# q, k shapes: (batch, heads, seq_length, dim)
seq_length = q.size(2)
# Create relative position matrix
positions = torch.arange(seq_length).unsqueeze(0) - torch.arange(seq_length).unsqueeze(1)
positions = positions.clamp(-self.max_distance, self.max_distance) + self.max_distance
rel_pos_emb = self.rel_embeddings[positions]
# Calculate relative attention scores
return torch.matmul(q, rel_pos_emb.transpose(-2, -1))
# Example usage
batch_size, heads, seq_length, dim = 2, 8, 20, 64
queries = torch.randn(batch_size, heads, seq_length, dim)
keys = torch.randn(batch_size, heads, seq_length, dim)
rel_pos = RelativePositionalEncoding(dim=dim)
rel_scores = rel_pos(queries, keys)
print(f"Relative attention scores shape: {rel_scores.shape}")
🚀
🔥 Level up: Once you master this, you’ll be solving problems like a pro! Sinusoidal Position Encoding Mathematics - Made Simple!
The mathematical foundation of sinusoidal position encodings relies on wavelength variations across dimensions. This example shows you the core mathematical concepts using numpy, showing how different frequency components create unique position signatures.
Let’s break this down together! Here’s how we can tackle this:
import numpy as np
import matplotlib.pyplot as plt
def sinusoidal_position_encoding(max_seq_length, d_model):
"""
Mathematical implementation showing wavelength progression
"""
position = np.arange(max_seq_length)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((max_seq_length, d_model))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
# Demonstrate wavelength variation
plt.figure(figsize=(12, 4))
for i in range(4):
plt.plot(pe[:, i], label=f'dim_{i}')
plt.legend()
plt.title('Sinusoidal Position Encoding Patterns')
plt.show()
return pe
# Generate and visualize
seq_length, d_model = 100, 64
encodings = sinusoidal_position_encoding(seq_length, d_model)
print(f"Position encoding matrix shape: {encodings.shape}")
🚀 Transformer Position-Aware Self-Attention - Made Simple!
Position-aware self-attention integrates positional information directly into the attention mechanism. This example shows how positional encodings influence token relationships during the attention computation phase.
Let me walk you through this step by step! Here’s how we can tackle this:
import torch
import torch.nn as nn
class PositionAwareSelfAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.qkv = nn.Linear(embed_dim, 3 * embed_dim)
self.pos_embedding = nn.Parameter(torch.randn(1, 512, embed_dim))
self.projection = nn.Linear(embed_dim, embed_dim)
def forward(self, x, mask=None):
batch_size, seq_length, _ = x.shape
# Add positional embeddings
positions = self.pos_embedding[:, :seq_length, :]
x = x + positions
# Transform input into Q, K, V
qkv = self.qkv(x)
qkv = qkv.reshape(batch_size, seq_length, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
# Compute attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / np.sqrt(self.head_dim)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention = torch.softmax(scores, dim=-1)
context = torch.matmul(attention, v)
# Reshape and project
context = context.permute(0, 2, 1, 3).reshape(batch_size, seq_length, -1)
return self.projection(context)
# Example usage
batch_size, seq_length, embed_dim = 8, 32, 256
x = torch.randn(batch_size, seq_length, embed_dim)
attention = PositionAwareSelfAttention(embed_dim, num_heads=8)
output = attention(x)
print(f"Output shape: {output.shape}")
🚀 Custom Learned Positional Encodings - Made Simple!
This example showcases learned positional encodings that adapt to the specific characteristics of the training data. The model learns best position representations through backpropagation.
Let’s make this super clear! Here’s how we can tackle this:
import torch
import torch.nn as nn
class LearnedPositionalEncoding(nn.Module):
def __init__(self, max_seq_length, embed_dim, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# Create learnable position embeddings
self.pos_embeddings = nn.Parameter(
torch.randn(1, max_seq_length, embed_dim)
)
# Position-dependent scaling factors
self.scale_factors = nn.Parameter(
torch.ones(1, max_seq_length, 1)
)
self.layer_norm = nn.LayerNorm(embed_dim)
def forward(self, x):
seq_length = x.size(1)
# Apply scaled positional embeddings
positions = self.pos_embeddings[:, :seq_length, :] * self.scale_factors[:, :seq_length, :]
x = x + positions
# Normalize and apply dropout
x = self.layer_norm(x)
return self.dropout(x)
# Example usage
max_length, batch_size, dim = 50, 16, 256
input_tensor = torch.randn(batch_size, max_length, dim)
pos_encoder = LearnedPositionalEncoding(max_length, dim)
encoded = pos_encoder(input_tensor)
print(f"Encoded output shape: {encoded.shape}")
# Demonstrate learning process
optimizer = torch.optim.Adam(pos_encoder.parameters())
criterion = nn.MSELoss()
# Simple training loop example
for _ in range(5):
encoded = pos_encoder(input_tensor)
loss = criterion(encoded, torch.randn_like(encoded)) # Dummy target
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Training loss: {loss.item():.4f}")
🚀 Positional Encoding Visualization Tools - Made Simple!
The visualization module provides complete tools for analyzing and understanding positional encoding patterns. This example creates detailed visualizations of encoding matrices and attention patterns for debugging and analysis.
Let’s break this down together! Here’s how we can tackle this:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
class PositionalEncodingVisualizer:
def __init__(self):
self.fig_size = (12, 8)
def visualize_encodings(self, encodings, title="Positional Encoding Heatmap"):
plt.figure(figsize=self.fig_size)
sns.heatmap(encodings, cmap='RdBu', center=0)
plt.title(title)
plt.xlabel('Encoding Dimension')
plt.ylabel('Position')
plt.show()
def compare_encoding_methods(self, seq_length=50, d_model=128):
# Generate different types of encodings
sine_cos = self._generate_sinusoidal(seq_length, d_model)
learned = self._generate_learned(seq_length, d_model)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
sns.heatmap(sine_cos[:20, :20], ax=ax1, cmap='RdBu', center=0)
sns.heatmap(learned[:20, :20], ax=ax2, cmap='RdBu', center=0)
ax1.set_title('Sinusoidal Encodings')
ax2.set_title('Learned Encodings')
plt.tight_layout()
plt.show()
def _generate_sinusoidal(self, seq_length, d_model):
position = np.arange(seq_length)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((seq_length, d_model))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
return pe
def _generate_learned(self, seq_length, d_model):
return np.random.randn(seq_length, d_model)
# Example usage
visualizer = PositionalEncodingVisualizer()
# Generate and visualize encodings
seq_length, d_model = 100, 128
sine_cos_encodings = visualizer._generate_sinusoidal(seq_length, d_model)
visualizer.visualize_encodings(sine_cos_encodings, "Sinusoidal Positional Encodings")
# Compare different encoding methods
visualizer.compare_encoding_methods()
🚀 Real-world Application: Machine Translation - Made Simple!
Implementation of a translation system demonstrating how positional encodings enhance sequence-to-sequence translation tasks. This example shows preprocessing, model implementation, and translation results.
This next part is really neat! Here’s how we can tackle this:
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
class TranslatorWithPositionalEncoding(nn.Module):
def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8):
super().__init__()
self.d_model = d_model
# Embeddings and positional encodings
self.src_embed = nn.Embedding(src_vocab_size, d_model)
self.tgt_embed = nn.Embedding(tgt_vocab_size, d_model)
self.pos_encoder = PositionalEncoding(d_model, max_seq_length=5000)
# Transformer layers
encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead)
decoder_layer = nn.TransformerDecoderLayer(d_model=d_model, nhead=nhead)
self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
self.output_layer = nn.Linear(d_model, tgt_vocab_size)
def create_mask(self, src, tgt):
src_mask = torch.ones((src.shape[1], src.shape[1]))
tgt_mask = torch.triu(torch.ones((tgt.shape[1], tgt.shape[1])), diagonal=1) == 0
return src_mask, tgt_mask
def forward(self, src, tgt):
src_mask, tgt_mask = self.create_mask(src, tgt)
# Apply embeddings and positional encodings
src = self.src_embed(src) * np.sqrt(self.d_model)
tgt = self.tgt_embed(tgt) * np.sqrt(self.d_model)
src = self.pos_encoder(src)
tgt = self.pos_encoder(tgt)
# Transform sequences
memory = self.transformer_encoder(src, src_mask)
output = self.transformer_decoder(tgt, memory, tgt_mask)
return self.output_layer(output)
# Example usage
src_vocab_size, tgt_vocab_size = 5000, 5000
model = TranslatorWithPositionalEncoding(src_vocab_size, tgt_vocab_size)
# Dummy translation data
src_tokens = torch.randint(0, src_vocab_size, (8, 32))
tgt_tokens = torch.randint(0, tgt_vocab_size, (8, 32))
# Forward pass
output = model(src_tokens, tgt_tokens)
print(f"Translation output shape: {output.shape}")
🚀 Attention Visualization with Position Information - Made Simple!
This example creates detailed visualizations of attention patterns, showing how positional information influences token relationships in transformer models. The visualization helps understand position-aware attention mechanisms.
Let’s make this super clear! Here’s how we can tackle this:
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
class AttentionVisualizer:
def __init__(self, model_dim=512, num_heads=8):
self.model_dim = model_dim
self.num_heads = num_heads
def compute_attention_patterns(self, query, key, mask=None):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / np.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
return torch.softmax(scores, dim=-1)
def plot_attention_heads(self, attention_weights, tokens=None):
fig = plt.figure(figsize=(20, 10))
for head in range(min(self.num_heads, 4)): # Plot first 4 heads
ax = fig.add_subplot(2, 2, head + 1)
# Plot attention weights
sns.heatmap(attention_weights[0, head].detach().numpy(),
xticklabels=tokens if tokens else 'auto',
yticklabels=tokens if tokens else 'auto',
cmap='viridis',
ax=ax)
ax.set_title(f'Head {head + 1} Attention Pattern')
plt.tight_layout()
plt.show()
def visualize_position_influence(self, seq_length=20):
# Generate position-aware attention pattern
positions = torch.arange(seq_length).unsqueeze(1)
rel_positions = positions - positions.T
# Create position-based attention bias
position_bias = torch.exp(-torch.abs(rel_positions).float() / 5.0)
plt.figure(figsize=(10, 8))
sns.heatmap(position_bias.numpy(),
cmap='RdBu_r',
center=0,
xticklabels=range(seq_length),
yticklabels=range(seq_length))
plt.title('Position-based Attention Bias')
plt.show()
# Example usage
visualizer = AttentionVisualizer()
# Generate sample attention patterns
query = torch.randn(1, 8, 20, 64) # (batch, heads, seq_length, dim)
key = torch.randn(1, 8, 20, 64)
attention_weights = visualizer.compute_attention_patterns(query, key)
# Visualize attention patterns
sample_tokens = [f'Token_{i}' for i in range(20)]
visualizer.plot_attention_heads(attention_weights, sample_tokens)
# Show position influence
visualizer.visualize_position_influence()
🚀 Position-Aware Text Generation Model - Made Simple!
This example shows you how positional encodings enhance text generation capabilities. The model uses position information to maintain coherence and context awareness during generation.
Let’s make this super clear! Here’s how we can tackle this:
import torch
import torch.nn as nn
import torch.nn.functional as F
class PositionAwareGenerator(nn.Module):
def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
super().__init__()
self.d_model = d_model
# Token and position embeddings
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.pos_embedding = nn.Parameter(torch.randn(1, 1024, d_model))
# Transformer decoder
decoder_layer = nn.TransformerDecoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=4*d_model
)
self.transformer = nn.TransformerDecoder(decoder_layer, num_layers)
self.output_layer = nn.Linear(d_model, vocab_size)
self.dropout = nn.Dropout(0.1)
def generate_square_subsequent_mask(self, sz):
mask = torch.triu(torch.ones(sz, sz), diagonal=1)
mask = mask.masked_fill(mask==1, float('-inf'))
return mask
def forward(self, x, memory=None):
# Apply embeddings and positional encoding
seq_len = x.size(1)
x = self.token_embedding(x) * np.sqrt(self.d_model)
x = x + self.pos_embedding[:, :seq_len, :]
x = self.dropout(x)
# Create causal mask
mask = self.generate_square_subsequent_mask(seq_len).to(x.device)
# Transform and generate
if memory is None:
memory = torch.zeros_like(x)
output = self.transformer(x.transpose(0, 1), memory.transpose(0, 1), tgt_mask=mask)
return self.output_layer(output.transpose(0, 1))
def generate(self, start_tokens, max_length=50, temperature=1.0):
self.eval()
current_sequence = start_tokens
with torch.no_grad():
for _ in range(max_length):
# Generate next token probabilities
logits = self.forward(current_sequence)
next_token_logits = logits[:, -1, :] / temperature
next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), 1)
# Append to sequence
current_sequence = torch.cat([current_sequence, next_token], dim=1)
# Check for end of sequence token
if next_token.item() == 2: # Assuming 2 is EOS token
break
return current_sequence
# Example usage
vocab_size = 10000
model = PositionAwareGenerator(vocab_size)
# Generate text
start_sequence = torch.tensor([[1, 345, 678]]) # Example start tokens
generated = model.generate(start_sequence)
print(f"Generated sequence shape: {generated.shape}")
🚀 Performance Analysis and Benchmarking - Made Simple!
This example provides tools for measuring and comparing the effectiveness of different positional encoding schemes, including metrics for sequence modeling tasks and attention pattern analysis.
Here’s a handy trick you’ll love! Here’s how we can tackle this:
import torch
import time
import numpy as np
from typing import Dict, List
from dataclasses import dataclass
@dataclass
class EncodingBenchmark:
encoding_time: float
memory_usage: int
attention_quality: float
sequence_coherence: float
class PositionalEncodingBenchmark:
def __init__(self, max_seq_length: int, d_model: int):
self.max_seq_length = max_seq_length
self.d_model = d_model
def benchmark_encoding(self, encoding_fn, num_trials=100):
total_time = 0
max_memory = 0
for _ in range(num_trials):
torch.cuda.empty_cache()
start_mem = torch.cuda.memory_allocated() if torch.cuda.is_available() else 0
start_time = time.time()
encoded = encoding_fn(self.max_seq_length, self.d_model)
end_time = time.time()
end_mem = torch.cuda.memory_allocated() if torch.cuda.is_available() else 0
total_time += (end_time - start_time)
max_memory = max(max_memory, end_mem - start_mem)
# Calculate attention quality metric
attention_quality = self._compute_attention_quality(encoded)
# Calculate sequence coherence
sequence_coherence = self._measure_sequence_coherence(encoded)
return EncodingBenchmark(
encoding_time=total_time / num_trials,
memory_usage=max_memory,
attention_quality=attention_quality,
sequence_coherence=sequence_coherence
)
def _compute_attention_quality(self, encoded_tensor):
# Compute cosine similarity between positions
encoded_norm = torch.nn.functional.normalize(encoded_tensor, dim=-1)
similarity = torch.matmul(encoded_norm, encoded_norm.transpose(-2, -1))
# Calculate average attention quality metric
diagonal_mask = torch.eye(similarity.size(0))
off_diagonal = similarity * (1 - diagonal_mask)
return float(off_diagonal.abs().mean())
def _measure_sequence_coherence(self, encoded_tensor):
# Measure how well positions are distinguished
positions = torch.arange(encoded_tensor.size(0))
position_diffs = positions.unsqueeze(1) - positions.unsqueeze(0)
# Calculate correlation with position differences
encoded_flat = encoded_tensor.view(encoded_tensor.size(0), -1)
encoding_diffs = torch.cdist(encoded_flat, encoded_flat)
correlation = np.corrcoef(
position_diffs.abs().flatten().numpy(),
encoding_diffs.flatten().numpy()
)[0, 1]
return float(correlation)
# Example usage
def run_benchmarks():
seq_length, d_model = 1024, 512
benchmark = PositionalEncodingBenchmark(seq_length, d_model)
# Define encoding methods for comparison
encodings = {
'sinusoidal': lambda s, d: torch.tensor([
[pos / np.power(10000, 2 * (j // 2) / d) for j in range(d)]
for pos in range(s)
]),
'learned': lambda s, d: torch.randn(s, d),
'relative': lambda s, d: torch.triu(torch.ones(s, s)) * \
torch.randn(d).unsqueeze(0).unsqueeze(0)
}
results = {}
for name, enc_fn in encodings.items():
results[name] = benchmark.benchmark_encoding(enc_fn)
print(f"\nResults for {name} encoding:")
print(f"Average encoding time: {results[name].encoding_time:.6f} seconds")
print(f"Memory usage: {results[name].memory_usage} bytes")
print(f"Attention quality: {results[name].attention_quality:.4f}")
print(f"Sequence coherence: {results[name].sequence_coherence:.4f}")
# Run benchmarks
run_benchmarks()
🚀 Dynamic Position Adaptation System - Made Simple!
This example showcases a dynamic positional encoding system that adapts to varying sequence lengths and content types, demonstrating cool position-aware processing capabilities.
Let’s make this super clear! Here’s how we can tackle this:
import torch
import torch.nn as nn
import torch.nn.functional as F
class DynamicPositionalEncoder(nn.Module):
def __init__(self, d_model: int, max_seq_length: int = 5000):
super().__init__()
self.d_model = d_model
self.max_seq_length = max_seq_length
# Learnable components
self.content_scale = nn.Parameter(torch.ones(1, 1, d_model))
self.position_scale = nn.Parameter(torch.ones(1, 1, d_model))
# Position embedding generators
self.pos_embedding_generator = nn.Sequential(
nn.Linear(d_model, d_model * 2),
nn.GELU(),
nn.Linear(d_model * 2, d_model)
)
# Adaptive components
self.length_factor = nn.Parameter(torch.ones(1))
self.content_factor = nn.Parameter(torch.ones(1))
def generate_position_codes(self, seq_length: int):
position = torch.arange(seq_length, dtype=torch.float32)
omega = torch.exp(
torch.arange(0, self.d_model, 2) *
-(np.log(10000.0) / self.d_model)
)
out = torch.zeros(seq_length, self.d_model)
out[:, 0::2] = torch.sin(position.unsqueeze(1) * omega)
out[:, 1::2] = torch.cos(position.unsqueeze(1) * omega)
return out
def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
batch_size, seq_length, _ = x.shape
# Generate base positional codes
pos_codes = self.generate_position_codes(seq_length).to(x.device)
# Apply content-based scaling
content_importance = torch.sigmoid(
self.pos_embedding_generator(x) * self.content_factor
)
# Combine with input taking sequence length into account
length_scale = torch.sigmoid(seq_length / self.max_seq_length * self.length_factor)
position_embedding = pos_codes.unsqueeze(0) * self.position_scale
content_embedding = x * self.content_scale
output = content_embedding + position_embedding * content_importance * length_scale
if mask is not None:
output = output.masked_fill(mask.unsqueeze(-1) == 0, 0)
return output
# Example usage and testing
def test_dynamic_encoder():
d_model = 256
encoder = DynamicPositionalEncoder(d_model)
# Test with different sequence lengths
test_lengths = [10, 50, 100, 500]
for length in test_lengths:
x = torch.randn(2, length, d_model)
encoded = encoder(x)
print(f"\nTesting sequence length: {length}")
print(f"Input shape: {x.shape}")
print(f"Output shape: {encoded.shape}")
# Verify position sensitivity
pos_correlation = torch.corrcoef(
encoded[0].flatten(),
torch.arange(length).repeat_interleave(d_model).float()
)
print(f"Position correlation: {pos_correlation[0,1]:.4f}")
# Run tests
test_dynamic_encoder()
🚀 cool Position-Aware Attention Mechanism - Made Simple!
This example introduces a smart attention mechanism that dynamically adjusts to both local and global positional relationships, demonstrating enhanced context awareness in sequence processing.
Let me walk you through this step by step! Here’s how we can tackle this:
import torch
import torch.nn as nn
import math
from typing import Optional, Tuple
class AdvancedPositionAwareAttention(nn.Module):
def __init__(self, dim: int, num_heads: int = 8, window_size: int = 16):
super().__init__()
self.dim = dim
self.num_heads = num_heads
self.window_size = window_size
self.head_dim = dim // num_heads
self.scale = self.head_dim ** -0.5
# Multi-scale position embeddings
self.local_pos_embedding = nn.Parameter(
torch.randn(2 * window_size - 1, self.head_dim)
)
self.global_pos_embedding = nn.Parameter(
torch.randn(1024, self.head_dim)
)
# Attention projections
self.qkv = nn.Linear(dim, dim * 3, bias=False)
self.proj = nn.Linear(dim, dim)
# Dynamic position-aware components
self.pos_scale = nn.Parameter(torch.ones(num_heads, 1, 1))
self.content_scale = nn.Parameter(torch.ones(num_heads, 1, 1))
def get_relative_positions(self, seq_length: int) -> torch.Tensor:
positions = torch.arange(seq_length)
relative_positions = positions.unsqueeze(1) - positions.unsqueeze(0)
relative_positions += self.window_size - 1 # Shift to positive indices
return relative_positions
def forward(
self,
x: torch.Tensor,
mask: Optional[torch.Tensor] = None
) -> Tuple[torch.Tensor, torch.Tensor]:
batch_size, seq_length, _ = x.shape
# Generate QKV representations
qkv = self.qkv(x).chunk(3, dim=-1)
q, k, v = map(
lambda t: t.reshape(batch_size, seq_length, self.num_heads, self.head_dim)
.transpose(1, 2),
qkv
)
# Compute attention scores
content_scores = (q @ k.transpose(-2, -1)) * self.scale
# Add positional bias
relative_positions = self.get_relative_positions(seq_length).to(x.device)
local_pos_bias = self.local_pos_embedding[
relative_positions.clamp(-self.window_size + 1, self.window_size - 1)
+ self.window_size - 1
]
# Combine local and global position information
position_scores = (
(q.unsqueeze(-2) @ local_pos_bias.transpose(-2, -1))
.squeeze(-2)
* self.pos_scale
)
# Final attention scores
attention_scores = (
content_scores * self.content_scale
+ position_scores
)
if mask is not None:
attention_scores = attention_scores.masked_fill(
mask.unsqueeze(1).unsqueeze(2) == 0,
float('-inf')
)
attention_probs = torch.softmax(attention_scores, dim=-1)
# Apply attention to values
output = (attention_probs @ v).transpose(1, 2).reshape(
batch_size, seq_length, self.dim
)
return self.proj(output), attention_probs
# Example usage and testing
def test_advanced_attention():
batch_size = 4
seq_length = 32
dim = 256
attention = AdvancedPositionAwareAttention(dim)
x = torch.randn(batch_size, seq_length, dim)
mask = torch.ones(batch_size, seq_length)
output, attention_weights = attention(x, mask)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
# Analyze position sensitivity
avg_attention_by_distance = []
for dist in range(seq_length):
diag_indices = torch.arange(seq_length - dist)
attention_at_distance = attention_weights[0, 0, diag_indices, diag_indices + dist]
avg_attention_by_distance.append(attention_at_distance.mean().item())
print("\nAttention decay with distance:")
for dist, avg_attn in enumerate(avg_attention_by_distance[:5]):
print(f"Distance {dist}: {avg_attn:.4f}")
# Run tests
test_advanced_attention()
🚀 Additional Resources - Made Simple!
- “Attention Is All You Need” - Original Transformer Paper https://arxiv.org/abs/1706.03762
- “On Position Embeddings in BERT” https://arxiv.org/abs/2010.15099
- “RoFormer: Enhanced Transformer with Rotary Position Embedding” https://arxiv.org/abs/2104.09864
- “Position Information in Transformers: An Overview” https://arxiv.org/abs/2102.11090
- “Realformer: Transformer Likes Residual Attention” https://arxiv.org/abs/2012.11747
🎊 Awesome Work!
You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.
What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.
Keep coding, keep learning, and keep being awesome! 🚀