From LLMs to Agentic RAG Architecture
A practical architecture guide to moving from standalone LLMs to RAG and agentic RAG systems with retrieval, planning, tool use, evaluation, and governance controls.
Table of Contents
Introduction to Large Language Models
Large Language Models are foundation models trained on large text corpora to understand, transform, and generate language. They are useful for summarization, classification, extraction, reasoning assistance, code generation, and conversational interfaces.
In production systems, an LLM by itself is rarely enough. The model needs grounding, retrieval, guardrails, evaluation, observability, and integration with deterministic application logic.
The following example shows a basic local text-generation pattern using a pretrained model.
import transformers
# Load a pre-trained LLM
model_name = "gpt2"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
# Generate text
input_text = "The future of AI is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Large Language Models
LLMs are commonly trained with self-supervised objectives such as next-token prediction across large-scale text datasets. This allows the model to learn statistical language patterns, representations, and task behavior that can later be adapted through prompting, fine-tuning, instruction tuning, or retrieval augmentation.
For most enterprise teams, training a foundation model from scratch is not practical. The more realistic pattern is to select a base model, adapt it for domain behavior, and wrap it with retrieval, evaluation, and governance controls.
The following example shows a simplified fine-tuning setup.
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Prepare dataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="path/to/text/file.txt",
block_size=128
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
# Set up training arguments
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=4,
save_steps=10_000,
save_total_limit=2,
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
# Start training
trainer.train()
Limitations of Standalone LLMs
Standalone LLMs are powerful, but they have important production limitations: stale knowledge, no native access to private enterprise data, weak source attribution, hallucination risk, inconsistent output structure, and limited auditability.
These limitations are why production LLM applications usually require retrieval, tool access, structured outputs, monitoring, and human review for high-risk workflows.
The following example illustrates the risk of asking a model for current or changing facts without grounding.
import openai
openai.api_key = 'your-api-key'
def query_llm(prompt):
response = openai.Completion.create(
engine="text-davinci-002",
prompt=prompt,
max_tokens=100
)
return response.choices[0].text.strip()
# Example of a limitation: outdated information
prompt = "What is the current population of New York City?"
result = query_llm(prompt)
print(f"LLM Response: {result}")
print("Note: This information might be outdated or inaccurate.")
Introduction to Retrieval-Augmented Generation
Retrieval-Augmented Generation combines an LLM with an external retrieval layer. Instead of relying only on model memory, the application retrieves relevant documents, chunks, metadata, or records and passes that evidence into the prompt before generation.
RAG is a practical architecture pattern for grounding model responses in enterprise knowledge, reducing unsupported claims, improving freshness, and enabling source-aware answers.
The following example demonstrates a basic RAG-style generation flow.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import load_dataset
# Load model and tokenizer
model_name = "facebook/rag-token-nq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Load dataset
dataset = load_dataset("nq_open", split="train[:100]")
# Function to generate answer
def generate_answer(question):
inputs = tokenizer(question, return_tensors="pt")
outputs = model.generate(**inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
question = dataset[0]["question"]
answer = generate_answer(question)
print(f"Question: {question}")
print(f"Generated Answer: {answer}")
Components of RAG Systems
A RAG system usually includes ingestion, chunking, embedding, indexing, retrieval, reranking, context assembly, generation, citation handling, and evaluation. The retriever optimizes recall, the reranker improves precision, and the generator converts grounded context into a user-facing response.
The following example shows a simplified retriever and generator abstraction.
import faiss
import numpy as np
from transformers import DPRQuestionEncoder, DPRContextEncoder
# Simplified RAG components
class Retriever:
def __init__(self, context_encoder, passages):
self.context_encoder = context_encoder
self.passages = passages
self.index = self._build_index()
def _build_index(self):
embeddings = self.context_encoder(self.passages)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)
return index
def retrieve(self, query, k=5):
query_embedding = self.context_encoder([query])
_, indices = self.index.search(query_embedding, k)
return [self.passages[i] for i in indices[0]]
class Generator:
def __init__(self, model):
self.model = model
def generate(self, query, retrieved_passages):
context = " ".join(retrieved_passages)
input_text = f"Query: {query}\nContext: {context}\nAnswer:"
return self.model(input_text)
# Usage example (pseudo-code)
# retriever = Retriever(context_encoder, passages)
# generator = Generator(llm_model)
# query = "What is the capital of France?"
# retrieved_passages = retriever.retrieve(query)
# answer = generator.generate(query, retrieved_passages)
Implementing RAG with Hugging Face Transformers
Hugging Face provides models and utilities that can be used to experiment with retrieval-augmented generation. For production systems, this baseline should be extended with domain indexing, retrieval evaluation, latency controls, and source governance.
The following example uses a Hugging Face RAG model for question answering.
from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
from datasets import load_dataset
# Load RAG components
model_name = "facebook/rag-token-nq"
tokenizer = RagTokenizer.from_pretrained(model_name)
retriever = RagRetriever.from_pretrained(model_name, index_name="exact", use_dummy_dataset=True)
model = RagSequenceForGeneration.from_pretrained(model_name, retriever=retriever)
# Load a sample dataset
dataset = load_dataset("nq_open", split="train[:5]")
# Function to generate answer using RAG
def generate_rag_answer(question):
input_dict = tokenizer(question, return_tensors="pt")
generated = model.generate(**input_dict)
return tokenizer.decode(generated[0], skip_special_tokens=True)
# Example usage
for sample in dataset:
question = sample["question"]
answer = generate_rag_answer(question)
print(f"Question: {question}")
print(f"RAG Answer: {answer}\n")
Advantages of RAG over Standalone LLMs
RAG systems can provide fresher information, reduce hallucination risk, support source attribution, and make enterprise knowledge available to the model without retraining. RAG also gives engineering teams more control over what information is eligible for generation.
The following example contrasts a standalone LLM response with a retrieval-grounded response.
import random
from datetime import datetime
class TraditionalLLM:
def generate(self, prompt):
return "Generated response based on training data up to 2022."
class RAGSystem:
def __init__(self):
self.knowledge_base = {
"AI advancements": "Latest AI models achieve human-level performance in various tasks.",
"Climate change": "Global temperature rise of 1.1°C observed since pre-industrial times.",
"COVID-19": "New variants continue to emerge, highlighting the importance of vaccination."
}
def retrieve(self, query):
return random.choice(list(self.knowledge_base.values()))
def generate(self, prompt):
retrieved_info = self.retrieve(prompt)
current_date = datetime.now().strftime("%Y-%m-%d")
return f"As of {current_date}, {retrieved_info}"
# Compare traditional LLM and RAG system
llm = TraditionalLLM()
rag = RAGSystem()
prompt = "Tell me about recent developments in AI."
print(f"Traditional LLM: {llm.generate(prompt)}")
print(f"RAG System: {rag.generate(prompt)}")
Fine-Tuning RAG Models
Fine-tuning can adapt model behavior for domain language, answer style, and task-specific patterns. However, fine-tuning should not be treated as a replacement for retrieval when the issue is factual freshness, private knowledge, or source grounding.
The following example shows a simplified RAG fine-tuning workflow.
from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
# Load pre-trained RAG model
model_name = "facebook/rag-token-nq"
tokenizer = RagTokenizer.from_pretrained(model_name)
retriever = RagRetriever.from_pretrained(model_name, index_name="exact", use_dummy_dataset=True)
model = RagSequenceForGeneration.from_pretrained(model_name, retriever=retriever)
# Prepare dataset (example using a QA dataset)
dataset = load_dataset("squad", split="train[:1000]")
def preprocess_function(examples):
inputs = tokenizer(examples["question"], truncation=True, padding="max_length")
outputs = tokenizer(examples["answers"]["text"][0], truncation=True, padding="max_length")
return {
"input_ids": inputs.input_ids,
"attention_mask": inputs.attention_mask,
"labels": outputs.input_ids,
}
processed_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)
# Set up training arguments
training_args = TrainingArguments(
output_dir="./rag_finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=1000,
save_total_limit=2,
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_dataset,
)
# Start fine-tuning
trainer.train()
Evaluating RAG Systems
RAG evaluation should assess both retrieval quality and generation quality. Useful signals include recall@k, precision@k, groundedness, answer relevance, faithfulness, citation accuracy, latency, cost, and failure rates across representative test sets.
The following example demonstrates basic generation-quality evaluation with ROUGE and BLEU.
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import nltk
nltk.download('punkt')
def evaluate_rag(rag_model, test_data):
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
rouge_scores = []
bleu_scores = []
for sample in test_data:
question = sample['question']
reference = sample['answer']
# Generate answer using RAG model
generated = rag_model.generate(question)
# Calculate ROUGE scores
rouge_score = scorer.score(reference, generated)
rouge_scores.append(rouge_score)
# Calculate BLEU score
reference_tokens = nltk.word_tokenize(reference)
generated_tokens = nltk.word_tokenize(generated)
bleu_score = sentence_bleu([reference_tokens], generated_tokens)
bleu_scores.append(bleu_score)
# Calculate average scores
avg_rouge = {key: sum(score[key].fmeasure for score in rouge_scores) / len(rouge_scores) for key in rouge_scores[0]}
avg_bleu = sum(bleu_scores) / len(bleu_scores)
return {
"ROUGE": avg_rouge,
"BLEU": avg_bleu
}
# Example usage (pseudo-code)
# test_data = load_test_data()
# rag_model = load_rag_model()
# evaluation_results = evaluate_rag(rag_model, test_data)
# print(evaluation_results)
Introduction to Agentic RAG
Agentic RAG extends traditional RAG by adding planning, decision-making, tool use, task decomposition, memory, and multi-step execution. Instead of only retrieving and answering, the system can decide what information to retrieve, which tool to call, how to sequence actions, and when to ask for clarification.
This increases capability, but it also increases operational risk. Agentic RAG needs stronger controls for tool permissions, action validation, audit logs, retries, and human approval for high-impact actions.
The following example shows a simplified action-selection pattern.
import random
class AgenticRAG:
def __init__(self):
self.knowledge_base = {
"weather": "It's sunny today with a high of 25°C.",
"schedule": "You have a meeting at 2 PM.",
"email": "You have 3 unread emails."
}
def retrieve(self, query):
return self.knowledge_base.get(query.lower(), "No information found.")
def decide_action(self, user_input):
if "weather" in user_input.lower():
return "check_weather"
elif "schedule" in user_input.lower():
return "check_schedule"
elif "email" in user_input.lower():
return "check_email"
else:
return "ask_for_clarification"
def execute_action(self, action):
if action == "check_weather":
return self.retrieve("weather")
elif action == "check_schedule":
return self.retrieve("schedule")
elif action == "check_email":
return self.retrieve("email")
else:
return "I'm not sure what you're asking. Can you please clarify?"
def interact(self, user_input):
action = self.decide_action(user_input)
return self.execute_action(action)
# Example usage
agent = AgenticRAG()
user_queries = [
"What's the weather like?",
"Do I have any meetings today?",
"Check my emails",
"What's for lunch?"
]
for query in user_queries:
response = agent.interact(query)
print(f"User: {query}")
print(f"Agent: {response}\n")
Components of Agentic RAG Systems
Agentic RAG systems typically include a planner, retriever, generator, tool router, memory layer, policy engine, evaluator, and execution controller. The system should separate reasoning, retrieval, generation, and action execution so each step can be monitored and governed.
The following example shows a simplified planner-retriever-generator structure.
import random
class Planner:
def create_plan(self, goal):
# Simplified planning logic
steps = ["research", "analyze", "summarize"]
return steps
class Retriever:
def retrieve(self, query):
# Simulated retrieval
documents = [
"Document about AI advancements.",
"Paper on machine learning algorithms.",
"Article on natural language processing."
]
return random.choice(documents)
class Generator:
def generate(self, context, query):
# Simulated text generation
return f"Generated response based on {context} and query: {query}"
class AgenticRAG:
def __init__(self):
self.planner = Planner()
self.retriever = Retriever()
self.generator = Generator()
def execute_task(self, goal):
plan = self.planner.create_plan(goal)
result = ""
for step in plan:
retrieved_info = self.retriever.retrieve(step)
result += self.generator.generate(retrieved_info, step) + " "
return result.strip()
# Example usage
agentic_rag = AgenticRAG()
task_goal = "Explain recent advancements in AI"
result = agentic_rag.execute_task(task_goal)
print(f"Task: {task_goal}")
print(f"Result: {result}")
Use Case: Personal Assistant
A personal assistant is a common Agentic RAG use case because it requires retrieval from personal data, task planning, schedule awareness, and controlled action execution. The assistant should distinguish between reading information, recommending action, and taking action on the user’s behalf.
The following example builds a simple daily planning assistant.
import random
from datetime import datetime, timedelta
class PersonalAssistantRAG:
def __init__(self):
self.knowledge_base = {
"weather": {"condition": "sunny", "temperature": 25},
"calendar": [
{"event": "Team Meeting", "time": "14:00"},
{"event": "Dentist Appointment", "time": "10:00"}
],
"tasks": ["Buy groceries", "Finish report", "Call mom"]
}
def retrieve(self, query):
return self.knowledge_base.get(query, "No information found.")
def plan_day(self):
weather = self.retrieve("weather")
calendar = self.retrieve("calendar")
tasks = self.retrieve("tasks")
plan = f"Today's weather: {weather['condition']}, {weather['temperature']}°C\n\n"
plan += "Schedule:\n"
for event in calendar:
plan += f"- {event['time']}: {event['event']}\n"
plan += "\nTasks:\n"
for task in tasks:
plan += f"- {task}\n"
return plan
assistant = PersonalAssistantRAG()
daily_plan = assistant.plan_day()
print(daily_plan)
Use Case: Automated Research Assistant
An automated research assistant can retrieve relevant sources, summarize findings, compare evidence, and produce structured research notes. For production use, it should preserve citations, separate facts from synthesis, and expose uncertainty where sources disagree.
The following example shows a simplified research assistant workflow.
class ResearchAssistantRAG:
def __init__(self):
self.knowledge_base = {
"AI": ["Recent advancements in neural networks",
"Applications of machine learning in healthcare",
"Ethical considerations in AI development"],
"Climate": ["Impact of greenhouse gases on global warming",
"Renewable energy technologies",
"Climate change mitigation strategies"]
}
def retrieve(self, topic):
return self.knowledge_base.get(topic, [])
def summarize(self, texts):
# Simulated summarization
return "Summary of key findings from multiple sources."
def conduct_research(self, topic):
relevant_texts = self.retrieve(topic)
summary = self.summarize(relevant_texts)
return f"Research on {topic}:\n{summary}"
assistant = ResearchAssistantRAG()
research_topic = "AI"
research_report = assistant.conduct_research(research_topic)
print(research_report)
Challenges and Future Directions
Agentic RAG systems introduce challenges around multi-step coherence, ambiguity handling, retrieval drift, tool misuse, hallucinated plans, cost control, latency, security, and governance. Future improvements will depend on stronger planners, better retrieval evaluation, more reliable tool use, and policy-aware execution.
The following example includes a simplified ethical approval step before execution.
import random
class FutureAgenticRAG:
def __init__(self):
self.knowledge_base = {"AI Ethics": "Principles for responsible AI development"}
def retrieve(self, query):
return self.knowledge_base.get(query, "No information found.")
def generate(self, context):
return f"Generated response based on: {context}"
def ethical_check(self, action):
ethics_guidelines = self.retrieve("AI Ethics")
# Simulated ethical decision-making
return random.choice([True, False])
def execute_task(self, task):
retrieved_info = self.retrieve(task)
proposed_action = self.generate(retrieved_info)
if self.ethical_check(proposed_action):
return f"Executing: {proposed_action}"
else:
return "Action not taken due to ethical concerns."
future_rag = FutureAgenticRAG()
task = "Develop a new AI model"
result = future_rag.execute_task(task)
print(result)
Additional Resources
For more information on LLMs, RAG, and Agentic RAG, consider exploring these resources:
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020) ArXiv: https://arxiv.org/abs/2005.11401
- “Language Models are Few-Shot Learners” (Brown et al., 2020) ArXiv: https://arxiv.org/abs/2005.14165
- “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al., 2022) ArXiv: https://arxiv.org/abs/2201.11903
These papers provide useful background on language models, retrieval-augmented generation, and reasoning-oriented prompting patterns.
Closing Thoughts
The progression from standalone LLMs to RAG and then to Agentic RAG reflects a broader shift in production AI architecture. Standalone LLMs are useful for language generation, RAG adds grounded retrieval, and Agentic RAG adds planning and controlled action execution.
The architecture decision should be based on the task risk and operational need. Use standalone LLMs for low-risk language tasks, RAG when answers must be grounded in external knowledge, and Agentic RAG when the system must plan, retrieve, reason, and execute across multiple steps. The more autonomy the system has, the stronger the governance, evaluation, and observability controls must be.
Related Reading
Enterprise AI Architecture
Want more enterprise AI architecture breakdowns?
Subscribe to SuperML.