Agentic + LLM Systems

From LLMs to Agentic RAG Architecture

A practical architecture guide to moving from standalone LLMs to RAG and agentic RAG systems with retrieval, planning, tool use, evaluation, and governance controls.

Share this article
Comments
Share:
Table of Contents

Introduction to Large Language Models

Large Language Models are foundation models trained on large text corpora to understand, transform, and generate language. They are useful for summarization, classification, extraction, reasoning assistance, code generation, and conversational interfaces.

In production systems, an LLM by itself is rarely enough. The model needs grounding, retrieval, guardrails, evaluation, observability, and integration with deterministic application logic.

The following example shows a basic local text-generation pattern using a pretrained model.

import transformers

# Load a pre-trained LLM
model_name = "gpt2"
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

# Generate text
input_text = "The future of AI is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Large Language Models

LLMs are commonly trained with self-supervised objectives such as next-token prediction across large-scale text datasets. This allows the model to learn statistical language patterns, representations, and task behavior that can later be adapted through prompting, fine-tuning, instruction tuning, or retrieval augmentation.

For most enterprise teams, training a foundation model from scratch is not practical. The more realistic pattern is to select a base model, adapt it for domain behavior, and wrap it with retrieval, evaluation, and governance controls.

The following example shows a simplified fine-tuning setup.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load pre-trained model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Prepare dataset
dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="path/to/text/file.txt",
    block_size=128
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Start training
trainer.train()

Limitations of Standalone LLMs

Standalone LLMs are powerful, but they have important production limitations: stale knowledge, no native access to private enterprise data, weak source attribution, hallucination risk, inconsistent output structure, and limited auditability.

These limitations are why production LLM applications usually require retrieval, tool access, structured outputs, monitoring, and human review for high-risk workflows.

The following example illustrates the risk of asking a model for current or changing facts without grounding.

import openai

openai.api_key = 'your-api-key'

def query_llm(prompt):
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=100
    )
    return response.choices[0].text.strip()

# Example of a limitation: outdated information
prompt = "What is the current population of New York City?"
result = query_llm(prompt)
print(f"LLM Response: {result}")
print("Note: This information might be outdated or inaccurate.")

Introduction to Retrieval-Augmented Generation

Retrieval-Augmented Generation combines an LLM with an external retrieval layer. Instead of relying only on model memory, the application retrieves relevant documents, chunks, metadata, or records and passes that evidence into the prompt before generation.

RAG is a practical architecture pattern for grounding model responses in enterprise knowledge, reducing unsupported claims, improving freshness, and enabling source-aware answers.

The following example demonstrates a basic RAG-style generation flow.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import load_dataset

# Load model and tokenizer
model_name = "facebook/rag-token-nq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load dataset
dataset = load_dataset("nq_open", split="train[:100]")

# Function to generate answer
def generate_answer(question):
    inputs = tokenizer(question, return_tensors="pt")
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
question = dataset[0]["question"]
answer = generate_answer(question)
print(f"Question: {question}")
print(f"Generated Answer: {answer}")

Components of RAG Systems

A RAG system usually includes ingestion, chunking, embedding, indexing, retrieval, reranking, context assembly, generation, citation handling, and evaluation. The retriever optimizes recall, the reranker improves precision, and the generator converts grounded context into a user-facing response.

The following example shows a simplified retriever and generator abstraction.

import faiss
import numpy as np
from transformers import DPRQuestionEncoder, DPRContextEncoder

# Simplified RAG components

class Retriever:
    def __init__(self, context_encoder, passages):
        self.context_encoder = context_encoder
        self.passages = passages
        self.index = self._build_index()

    def _build_index(self):
        embeddings = self.context_encoder(self.passages)
        index = faiss.IndexFlatIP(embeddings.shape[1])
        index.add(embeddings)
        return index

    def retrieve(self, query, k=5):
        query_embedding = self.context_encoder([query])
        _, indices = self.index.search(query_embedding, k)
        return [self.passages[i] for i in indices[0]]

class Generator:
    def __init__(self, model):
        self.model = model

    def generate(self, query, retrieved_passages):
        context = " ".join(retrieved_passages)
        input_text = f"Query: {query}\nContext: {context}\nAnswer:"
        return self.model(input_text)

# Usage example (pseudo-code)
# retriever = Retriever(context_encoder, passages)
# generator = Generator(llm_model)
# query = "What is the capital of France?"
# retrieved_passages = retriever.retrieve(query)
# answer = generator.generate(query, retrieved_passages)

Implementing RAG with Hugging Face Transformers

Hugging Face provides models and utilities that can be used to experiment with retrieval-augmented generation. For production systems, this baseline should be extended with domain indexing, retrieval evaluation, latency controls, and source governance.

The following example uses a Hugging Face RAG model for question answering.

from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
from datasets import load_dataset

# Load RAG components
model_name = "facebook/rag-token-nq"
tokenizer = RagTokenizer.from_pretrained(model_name)
retriever = RagRetriever.from_pretrained(model_name, index_name="exact", use_dummy_dataset=True)
model = RagSequenceForGeneration.from_pretrained(model_name, retriever=retriever)

# Load a sample dataset
dataset = load_dataset("nq_open", split="train[:5]")

# Function to generate answer using RAG
def generate_rag_answer(question):
    input_dict = tokenizer(question, return_tensors="pt")
    generated = model.generate(**input_dict)
    return tokenizer.decode(generated[0], skip_special_tokens=True)

# Example usage
for sample in dataset:
    question = sample["question"]
    answer = generate_rag_answer(question)
    print(f"Question: {question}")
    print(f"RAG Answer: {answer}\n")

Advantages of RAG over Standalone LLMs

RAG systems can provide fresher information, reduce hallucination risk, support source attribution, and make enterprise knowledge available to the model without retraining. RAG also gives engineering teams more control over what information is eligible for generation.

The following example contrasts a standalone LLM response with a retrieval-grounded response.

import random
from datetime import datetime

class TraditionalLLM:
    def generate(self, prompt):
        return "Generated response based on training data up to 2022."

class RAGSystem:
    def __init__(self):
        self.knowledge_base = {
            "AI advancements": "Latest AI models achieve human-level performance in various tasks.",
            "Climate change": "Global temperature rise of 1.1°C observed since pre-industrial times.",
            "COVID-19": "New variants continue to emerge, highlighting the importance of vaccination."
        }

    def retrieve(self, query):
        return random.choice(list(self.knowledge_base.values()))

    def generate(self, prompt):
        retrieved_info = self.retrieve(prompt)
        current_date = datetime.now().strftime("%Y-%m-%d")
        return f"As of {current_date}, {retrieved_info}"

# Compare traditional LLM and RAG system
llm = TraditionalLLM()
rag = RAGSystem()

prompt = "Tell me about recent developments in AI."
print(f"Traditional LLM: {llm.generate(prompt)}")
print(f"RAG System: {rag.generate(prompt)}")

Fine-Tuning RAG Models

Fine-tuning can adapt model behavior for domain language, answer style, and task-specific patterns. However, fine-tuning should not be treated as a replacement for retrieval when the issue is factual freshness, private knowledge, or source grounding.

The following example shows a simplified RAG fine-tuning workflow.

from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load pre-trained RAG model
model_name = "facebook/rag-token-nq"
tokenizer = RagTokenizer.from_pretrained(model_name)
retriever = RagRetriever.from_pretrained(model_name, index_name="exact", use_dummy_dataset=True)
model = RagSequenceForGeneration.from_pretrained(model_name, retriever=retriever)

# Prepare dataset (example using a QA dataset)
dataset = load_dataset("squad", split="train[:1000]")

def preprocess_function(examples):
    inputs = tokenizer(examples["question"], truncation=True, padding="max_length")
    outputs = tokenizer(examples["answers"]["text"][0], truncation=True, padding="max_length")
    return {
        "input_ids": inputs.input_ids,
        "attention_mask": inputs.attention_mask,
        "labels": outputs.input_ids,
    }

processed_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./rag_finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset,
)

# Start fine-tuning
trainer.train()

Evaluating RAG Systems

RAG evaluation should assess both retrieval quality and generation quality. Useful signals include recall@k, precision@k, groundedness, answer relevance, faithfulness, citation accuracy, latency, cost, and failure rates across representative test sets.

The following example demonstrates basic generation-quality evaluation with ROUGE and BLEU.

from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import nltk
nltk.download('punkt')

def evaluate_rag(rag_model, test_data):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = []
    bleu_scores = []

    for sample in test_data:
        question = sample['question']
        reference = sample['answer']
        
        # Generate answer using RAG model
        generated = rag_model.generate(question)
        
        # Calculate ROUGE scores
        rouge_score = scorer.score(reference, generated)
        rouge_scores.append(rouge_score)
        
        # Calculate BLEU score
        reference_tokens = nltk.word_tokenize(reference)
        generated_tokens = nltk.word_tokenize(generated)
        bleu_score = sentence_bleu([reference_tokens], generated_tokens)
        bleu_scores.append(bleu_score)

    # Calculate average scores
    avg_rouge = {key: sum(score[key].fmeasure for score in rouge_scores) / len(rouge_scores) for key in rouge_scores[0]}
    avg_bleu = sum(bleu_scores) / len(bleu_scores)

    return {
        "ROUGE": avg_rouge,
        "BLEU": avg_bleu
    }

# Example usage (pseudo-code)
# test_data = load_test_data()
# rag_model = load_rag_model()
# evaluation_results = evaluate_rag(rag_model, test_data)
# print(evaluation_results)

Introduction to Agentic RAG

Agentic RAG extends traditional RAG by adding planning, decision-making, tool use, task decomposition, memory, and multi-step execution. Instead of only retrieving and answering, the system can decide what information to retrieve, which tool to call, how to sequence actions, and when to ask for clarification.

This increases capability, but it also increases operational risk. Agentic RAG needs stronger controls for tool permissions, action validation, audit logs, retries, and human approval for high-impact actions.

The following example shows a simplified action-selection pattern.

import random

class AgenticRAG:
    def __init__(self):
        self.knowledge_base = {
            "weather": "It's sunny today with a high of 25°C.",
            "schedule": "You have a meeting at 2 PM.",
            "email": "You have 3 unread emails."
        }

    def retrieve(self, query):
        return self.knowledge_base.get(query.lower(), "No information found.")

    def decide_action(self, user_input):
        if "weather" in user_input.lower():
            return "check_weather"
        elif "schedule" in user_input.lower():
            return "check_schedule"
        elif "email" in user_input.lower():
            return "check_email"
        else:
            return "ask_for_clarification"

    def execute_action(self, action):
        if action == "check_weather":
            return self.retrieve("weather")
        elif action == "check_schedule":
            return self.retrieve("schedule")
        elif action == "check_email":
            return self.retrieve("email")
        else:
            return "I'm not sure what you're asking. Can you please clarify?"

    def interact(self, user_input):
        action = self.decide_action(user_input)
        return self.execute_action(action)

# Example usage
agent = AgenticRAG()
user_queries = [
    "What's the weather like?",
    "Do I have any meetings today?",
    "Check my emails",
    "What's for lunch?"
]

for query in user_queries:
    response = agent.interact(query)
    print(f"User: {query}")
    print(f"Agent: {response}\n")

Components of Agentic RAG Systems

Agentic RAG systems typically include a planner, retriever, generator, tool router, memory layer, policy engine, evaluator, and execution controller. The system should separate reasoning, retrieval, generation, and action execution so each step can be monitored and governed.

The following example shows a simplified planner-retriever-generator structure.

import random

class Planner:
    def create_plan(self, goal):
        # Simplified planning logic
        steps = ["research", "analyze", "summarize"]
        return steps

class Retriever:
    def retrieve(self, query):
        # Simulated retrieval
        documents = [
            "Document about AI advancements.",
            "Paper on machine learning algorithms.",
            "Article on natural language processing."
        ]
        return random.choice(documents)

class Generator:
    def generate(self, context, query):
        # Simulated text generation
        return f"Generated response based on {context} and query: {query}"

class AgenticRAG:
    def __init__(self):
        self.planner = Planner()
        self.retriever = Retriever()
        self.generator = Generator()

    def execute_task(self, goal):
        plan = self.planner.create_plan(goal)
        result = ""
        for step in plan:
            retrieved_info = self.retriever.retrieve(step)
            result += self.generator.generate(retrieved_info, step) + " "
        return result.strip()

# Example usage
agentic_rag = AgenticRAG()
task_goal = "Explain recent advancements in AI"
result = agentic_rag.execute_task(task_goal)
print(f"Task: {task_goal}")
print(f"Result: {result}")

Use Case: Personal Assistant

A personal assistant is a common Agentic RAG use case because it requires retrieval from personal data, task planning, schedule awareness, and controlled action execution. The assistant should distinguish between reading information, recommending action, and taking action on the user’s behalf.

The following example builds a simple daily planning assistant.

import random
from datetime import datetime, timedelta

class PersonalAssistantRAG:
    def __init__(self):
        self.knowledge_base = {
            "weather": {"condition": "sunny", "temperature": 25},
            "calendar": [
                {"event": "Team Meeting", "time": "14:00"},
                {"event": "Dentist Appointment", "time": "10:00"}
            ],
            "tasks": ["Buy groceries", "Finish report", "Call mom"]
        }

    def retrieve(self, query):
        return self.knowledge_base.get(query, "No information found.")

    def plan_day(self):
        weather = self.retrieve("weather")
        calendar = self.retrieve("calendar")
        tasks = self.retrieve("tasks")

        plan = f"Today's weather: {weather['condition']}, {weather['temperature']}°C\n\n"
        plan += "Schedule:\n"
        for event in calendar:
            plan += f"- {event['time']}: {event['event']}\n"
        plan += "\nTasks:\n"
        for task in tasks:
            plan += f"- {task}\n"

        return plan

assistant = PersonalAssistantRAG()
daily_plan = assistant.plan_day()
print(daily_plan)

Use Case: Automated Research Assistant

An automated research assistant can retrieve relevant sources, summarize findings, compare evidence, and produce structured research notes. For production use, it should preserve citations, separate facts from synthesis, and expose uncertainty where sources disagree.

The following example shows a simplified research assistant workflow.

class ResearchAssistantRAG:
    def __init__(self):
        self.knowledge_base = {
            "AI": ["Recent advancements in neural networks",
                   "Applications of machine learning in healthcare",
                   "Ethical considerations in AI development"],
            "Climate": ["Impact of greenhouse gases on global warming",
                        "Renewable energy technologies",
                        "Climate change mitigation strategies"]
        }

    def retrieve(self, topic):
        return self.knowledge_base.get(topic, [])

    def summarize(self, texts):
        # Simulated summarization
        return "Summary of key findings from multiple sources."

    def conduct_research(self, topic):
        relevant_texts = self.retrieve(topic)
        summary = self.summarize(relevant_texts)
        return f"Research on {topic}:\n{summary}"

assistant = ResearchAssistantRAG()
research_topic = "AI"
research_report = assistant.conduct_research(research_topic)
print(research_report)

Challenges and Future Directions

Agentic RAG systems introduce challenges around multi-step coherence, ambiguity handling, retrieval drift, tool misuse, hallucinated plans, cost control, latency, security, and governance. Future improvements will depend on stronger planners, better retrieval evaluation, more reliable tool use, and policy-aware execution.

The following example includes a simplified ethical approval step before execution.

import random

class FutureAgenticRAG:
    def __init__(self):
        self.knowledge_base = {"AI Ethics": "Principles for responsible AI development"}

    def retrieve(self, query):
        return self.knowledge_base.get(query, "No information found.")

    def generate(self, context):
        return f"Generated response based on: {context}"

    def ethical_check(self, action):
        ethics_guidelines = self.retrieve("AI Ethics")
        # Simulated ethical decision-making
        return random.choice([True, False])

    def execute_task(self, task):
        retrieved_info = self.retrieve(task)
        proposed_action = self.generate(retrieved_info)
        
        if self.ethical_check(proposed_action):
            return f"Executing: {proposed_action}"
        else:
            return "Action not taken due to ethical concerns."

future_rag = FutureAgenticRAG()
task = "Develop a new AI model"
result = future_rag.execute_task(task)
print(result)

Additional Resources

For more information on LLMs, RAG, and Agentic RAG, consider exploring these resources:

  1. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020) ArXiv: https://arxiv.org/abs/2005.11401
  2. “Language Models are Few-Shot Learners” (Brown et al., 2020) ArXiv: https://arxiv.org/abs/2005.14165
  3. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al., 2022) ArXiv: https://arxiv.org/abs/2201.11903

These papers provide useful background on language models, retrieval-augmented generation, and reasoning-oriented prompting patterns.

Closing Thoughts

The progression from standalone LLMs to RAG and then to Agentic RAG reflects a broader shift in production AI architecture. Standalone LLMs are useful for language generation, RAG adds grounded retrieval, and Agentic RAG adds planning and controlled action execution.

The architecture decision should be based on the task risk and operational need. Use standalone LLMs for low-risk language tasks, RAG when answers must be grounded in external knowledge, and Agentic RAG when the system must plan, retrieve, reason, and execute across multiple steps. The more autonomy the system has, the stronger the governance, evaluation, and observability controls must be.

Enterprise AI Architecture

Want more enterprise AI architecture breakdowns?

Subscribe to SuperML.

Comments

Sign in to leave a comment

Back to Blog

Related Posts

View All Posts »