NL-to-SQL Deep Dive: Schema Linking, Table Clustering, Volumetric Routing, and Why ReAct Changes Everything
Most NL-to-SQL demos work on toy schemas with 5 tables. Real databases have 500. This deep dive covers schema linking, table clustering, volumetric routing, domain bootstrapping, ReAct reasoning loops, SQL safety, and what benchmark scores actually tell you — with a detailed walk-through of a production-grade reasoning agent.
Table of Contents
There is a party trick that every NL-to-SQL vendor performs at their demo. They show you a clean schema with four tables — orders, customers, products, regions — ask a question in plain English, and the model produces perfect SQL in one shot. The audience applauds. The investor wire transfers.
Then you try to connect it to your actual database. You have 400 tables. Half the column names are abbreviated relics of a 2009 ETL job. Your customer_id is a VARCHAR in three tables and an INT in two others. The “domain knowledge” required to know that net_rev means net revenue after returns but rev means gross revenue is stored entirely in the heads of your three most senior analysts. The demo dies on contact.
This is the honest starting point for understanding NL-to-SQL in 2026. Not as a solved problem — the research community has been claiming imminent perfection since 2018 — but as a genuinely hard engineering challenge that requires more than a prompt and a schema dump.
This post covers the real architecture: schema linking, table clustering, volumetric reasoning, domain bootstrapping, the ReAct loop, SQL safety, and what the benchmarks actually measure. Along the way we’ll walk through a concrete open-source implementation — nl-2-sql-agent — that applies these ideas in production.
The Two Hard Problems That Most Implementations Ignore
Text-to-SQL has two distinct hard problems, and most implementations only solve one of them.
Hard Problem 1: Linguistic ambiguity. Natural language is underspecified. “Show me the top customers this month” is at least five different SQL queries depending on what “top” means (by revenue? by order count? by margin?), what “this month” means (calendar month? rolling 30 days? fiscal month?), and whether you want the top 5 or the top 10 or the top 1%. A model that generates syntactically correct SQL is not the same as a model that generates semantically correct SQL. The industry has gotten quite good at hard problem 1 on clean schemas with curated examples.
Hard Problem 2: Schema navigation at scale. Real enterprise schemas have 50 to 5,000 tables. Before you can write a single line of SQL, you need to figure out which tables are relevant to the question, how those tables are connected through foreign keys, which columns carry the information you need, and whether the data volumes and distributions in those tables are actually trustworthy for the metric you’re computing. This is hard problem 2, and it’s where production NL-to-SQL systems fail in ways that no demo will ever show you.
The elegant insight underlying modern approaches — including the one we’ll examine — is that these two problems are best solved in sequence, not simultaneously. First route, then reason.
Schema Linking: The Foundational Step Everything Else Depends On
Schema linking is the process of aligning the tokens in a natural language question with the relevant tables and columns in a database schema. It sounds deceptively simple. It is not.
Consider the question: “Which product line had the lowest return rate in the Northeast region last quarter?”
The schema linking task has to figure out:
product line→ maps to which column?product_category?product_family?segment? All three might exist in different tables.return rate→ is there areturn_ratecolumn, or does this require dividingreturnsbysalesfrom two different tables?Northeast region→ isregiona string ('Northeast') or a region ID that joins to a separateregionstable?last quarter→ is there aquartercolumn, or does this require date arithmetic on atransaction_datecolumn?
The naive approach — embed the question, embed every column name, take the top-k closest matches — breaks badly on abbreviations (nr, rev_adj, cust_seg_cd), business-specific synonyms (NE region = Northeast), and derived metrics that don’t map to a single column at all.
The 2025-2026 research literature has converged on a few approaches that actually work at scale.
Bidirectional retrieval (RSL-SQL, the context-aware bidirectional approach from arXiv 2510.14296) runs two parallel retrieval passes: table-first then column selection, and column-first then table selection. The intersection of the two passes gives a much more robust schema subset than either alone. RSL-SQL achieves a strict schema linking recall of 94% on BIRD — meaning 94% of the time, it catches all the relevant tables and columns. That sounds high until you realise that 6% miss rate in a multi-table query often means a completely wrong SQL.
Graph-based pathfinding (SchemaGraphSQL, 2025) models the schema as a graph where tables are nodes and foreign key relationships are edges. When the question mentions entities from two different corners of the schema, the pathfinding algorithm computes the shortest join path between the relevant tables. This is important because valid SQL for multi-table questions requires that all predicted tables have a connecting path of foreign keys — a table cluster that isn’t connected cannot be joined without a Cartesian product, which is almost always wrong.
Autonomous schema exploration (AutoLink, arXiv 2511.17190) takes an agentic approach: instead of predicting the full schema subset in one shot, it starts with a small seed set of likely tables and columns and iteratively expands it based on what the SQL intermediate results look like. AutoLink achieves 97.4% strict recall on BIRD-Dev, the highest reported at the time of writing, precisely because the iterative exploration catches join paths that a single-pass retrieval would miss.
The key insight across all three: schema linking is not a lookup problem. It is a reasoning problem about the relational structure of the database.
Volumetric Analysis: The Dimension Nobody Talks About
Here’s something that almost no NL-to-SQL paper covers but every data engineer knows: the schema tells you what tables exist; it doesn’t tell you which ones you should actually use.
Volumetric analysis is the practice of examining data density, row counts, cardinality distributions, and table freshness before committing to a routing decision. It matters for several reasons:
Dead tables. Enterprise databases accumulate tables that haven’t been written to in three years. They’re still in the schema. The column names look relevant. But if you join against a table that has 0 rows in the relevant time window, your query returns an empty result and the LLM concludes “there are no orders” when the reality is that the data is in a different table.
Cardinality surprises. If a customer_id column has 3 million distinct values in the orders table but only 12,000 in the customer table, that’s a data quality signal. Either there’s a massive number of guest orders not linked to customer accounts, or the foreign key relationship is broken, or the definition of “customer” is different between the two tables. A query that joins them without understanding this will produce a result that looks plausible but is statistically wrong.
Partition and time-window awareness. Many analytical questions are implicitly time-bounded (“last quarter,” “this week,” “since the product launch”). If a table has 500 million rows but your question is about the past 7 days and there’s a created_at partition column, the router needs to know that filtering by created_at > NOW() - INTERVAL '7 days' will use an index. Running an unfiltered scan of 500M rows to answer a question about 50,000 rows is not just slow — it may time out and be interpreted as “no data.”
Metric reliability. Some columns have high null rates, or are populated only for specific product lines, or contain sentinel values (-1 for “unknown”) that look like real data to an LLM. A well-designed NL-to-SQL system inspects the actual data distribution — or at minimum, uses domain metadata to encode these facts — before generating SQL that will silently aggregate over null or sentinel values.
The nl-2-sql-agent addresses volumetric concerns through its domain YAML structure, which is designed to capture metric definitions, known data quality issues, and business-approved column interpretations. This is the mechanism through which a human’s knowledge of “rev means gross revenue before returns” gets encoded into the system, rather than being rediscovered — incorrectly — by the LLM on every query.
Domain Bootstrap: The Semantic Layer You Can’t Skip
There’s a reason that every serious enterprise BI vendor — dbt, Looker, AtScale, Oracle Cloud — has built a concept of a “semantic layer.” The semantic layer is the machinery that translates raw database schema into business-facing concepts. It’s the difference between cust_acq_dt and “the date the customer first purchased.”
The dirty secret of most NL-to-SQL systems is that they skip the semantic layer entirely and hope the LLM figures it out from schema metadata. Sometimes it does. Often it doesn’t. When it doesn’t, you get queries that are syntactically valid and semantically wrong — the most dangerous kind of failure because the output looks like an answer.
Domain bootstrapping is the process of building a lightweight semantic model automatically, typically by:
- Reading the raw database schema (table names, column names, data types, foreign keys)
- Providing a business description (“this is a stock portfolio monitoring database”)
- Using an LLM to generate semantic annotations: what each table represents, what each metric means, what the canonical relationships are between entities
The result is not a full enterprise ontology. It’s a working hypothesis — a starting point that is much better than raw schema, but still requires human review and refinement before production use. Think of it as the difference between a first-draft document and a published standard.
In nl-2-sql-agent, this is implemented in domain_bootstrap.py. The bootstrap process generates a domain YAML that encodes table-level and metric-level semantics that the router and the ReAct engine use downstream. The workflow is intentionally progressive: bootstrap → test with --trace → refine the YAML → promote to production. This mirrors how a new analyst actually learns a database — not by reading every column description, but by asking questions, getting unexpected results, and updating their mental model.
The bootstrap YAML serves as the grounding layer that makes NL-to-SQL reliable rather than probabilistic. An LLM can write grammatically correct SQL that returns the wrong answer because it doesn’t understand what “net revenue” means in this specific business. The domain YAML is how you prevent that — by encoding the definition once and reusing it across every query.
Table Clustering: Which Tables Actually Need to Join?
Given a natural language question, you don’t just need to know which tables are relevant — you need to know which tables need to join to produce the answer. This is the table clustering problem, and it’s harder than it sounds.
For the question “What was the customer acquisition cost by channel last month?” a naive schema linker might return: customers, marketing_spend, channels, campaigns, campaign_channels, customer_acquisition_events.
But not all of these need to join in the same query. You might need:
marketing_spend→channels(to get spend by channel)customer_acquisition_events→channels(to get new customer count by channel)- Then divide to get cost-per-acquisition
That’s two separate subqueries, each with its own table cluster, combined in an outer query. Feeding all six tables to the LLM at once and asking it to write the JOIN is likely to produce either a wrong result or a correct result that’s catastrophically slow because it joins tables that don’t need to be joined.
The right approach is graph-based: model the schema as a node-edge graph where edges are foreign key relationships, then for each question, identify the minimal connected subgraph that includes all the necessary tables. This minimal spanning subgraph is the table cluster for that query.
This is why tools like SchemaGraphSQL use pathfinding algorithms (Dijkstra’s, BFS over the FK graph) rather than semantic retrieval alone. Semantic retrieval tells you which tables are probably relevant by topic. Graph pathfinding tells you how to connect them into a valid JOIN chain.
The failure mode of skipping this step is what the 2026 research has documented as the “multi-hop bottleneck”: for questions that require three or more table joins, even frontier models like GPT-4o and Claude 3.7 produce incorrect SQL at rates that would be unacceptable in production. On Spider 2.0, the hardest publicly available NL-to-SQL benchmark (enterprise-scale schemas, multi-hop queries), GPT-4o achieves only 10.1% task success. O1-preview maxes out at 17.1%. For context, Spider 1.0 — the simpler academic benchmark that most demos are built on — sees >86% on the same models.
The gap between 86% on Spider 1.0 and 10-17% on Spider 2.0 is, in large part, the table clustering problem.
The ReAct Loop: From SQL Autocomplete to Analytical Reasoning
Here is the most important architectural idea in modern NL-to-SQL systems, and the one that differentiates an agent from a translator.
Traditional text-to-SQL: question → SQL → answer
ReAct-based NL-to-SQL: question → hypothesis → SQL → observation → refined hypothesis → SQL → observation → ... → answer
The ReAct (Reasoning + Acting) framework, originally proposed by Google Research in 2022, interleaves verbal reasoning traces with tool actions. In the SQL context, the “tool” is the database, and the “actions” are SQL queries. The loop looks like this:
Thought: “The user wants to know why their portfolio is down. First I should look at overall position-level P&L for the past 7 days to identify which holdings are negative.”
Action: SELECT symbol, sum(unrealized_pnl) as total_pnl FROM positions WHERE date >= current_date - 7 GROUP BY symbol ORDER BY total_pnl ASC LIMIT 10
Observation: [{symbol: 'NVDA', total_pnl: -18420}, {symbol: 'TSLA', total_pnl: -12300}, ...]
Thought: “NVDA and TSLA are the biggest contributors. Let me check whether this is price movement or position size change.”
Action: SELECT symbol, avg(price_at_open) as avg_entry, current_price, shares FROM positions WHERE symbol IN ('NVDA', 'TSLA') AND date = current_date
Observation: [{symbol: 'NVDA', avg_entry: 875.20, current_price: 812.40, shares: 200}, ...]
Thought: “NVDA is down 7.2% from average entry at 200 shares. That accounts for the loss. Now let me check if there’s a macro correlation — did the whole semiconductor sector move?”
And so on, until the agent has enough evidence to synthesize a grounded answer: “Your portfolio is down primarily due to a 7.2% decline in NVDA, which represents 34% of your total position value. This aligns with a broader semiconductor sector sell-off following [relevant macro event].”
This is qualitatively different from one-shot SQL generation. The agent is doing what an analyst would do: forming a hypothesis, testing it against the data, revising, drilling down, and building an evidence-backed conclusion. The SQL it generates at each step is simpler and more targeted than a single monster query that tries to answer everything at once — which also makes it more likely to be correct.
In nl-2-sql-agent, this is implemented in react_engine.py. The loop runs iteratively: form a hypothesis, generate a SQL step, validate it through the SQL guard, execute it, observe the results, and decide whether to continue or synthesize an answer. The --trace flag exposes the full thought-action-observation chain, which is invaluable for debugging incorrect answers or verifying the reasoning quality.
The ReAct pattern also handles a category of questions that one-shot SQL simply cannot: root-cause questions. “Why did fulfillment delays spike this week?” is not answerable by a single SQL query. The answer emerges through an investigation: check delay rates, slice by warehouse, check staffing levels, check carrier performance, check order volume spikes. Each step informs the next. ReAct makes that iterative investigation possible.
The Two-Stage Router: How nl-2-sql-agent Narrows the Search Space
Before the ReAct engine starts its SQL investigation, nl-2-sql-agent’s router.py performs a two-stage narrowing operation that is critical for correctness and performance.
Stage 1: Table routing. Given the question and the domain YAML, the router identifies which tables are likely to be relevant. This is not schema linking in the full academic sense — it’s a coarser first pass that narrows a database with hundreds of tables down to the 5-15 tables most likely to be involved in the answer. The domain YAML is essential here: it encodes business-level semantic descriptions that help the router distinguish between transactions (operational ledger) and transaction_summary (pre-aggregated for reporting) when the question is about raw events vs. derived metrics.
Stage 2: Metric routing. Once the candidate tables are identified, the router selects which domain metrics are likely needed. Domain metrics are defined in the YAML and represent business-approved calculations — things like “return rate = count(returns) / count(sales) WHERE sales IS NOT NULL.” This prevents the ReAct engine from reinventing metric definitions on each query, which is a major source of inconsistency in production systems.
This two-stage routing approach mirrors the bidirectional retrieval pattern from the academic literature, but grounds it in a domain-specific semantic model rather than generic schema embeddings. The result is a much smaller, more targeted context that gets passed to the ReAct engine — which matters both for answer quality (less noise to reason over) and for cost (fewer tokens in the LLM context).
SQL Safety: The Non-Negotiable Layer
Let’s talk about the part that most NL-to-SQL papers skip entirely but that every production engineer loses sleep over: safety.
An NL-to-SQL system that connects directly to a production database is a potential vulnerability of the first order. The threat model has several dimensions:
Hallucinated destructive queries. An LLM can, under the right (or wrong) prompt conditions, generate a DELETE, UPDATE, or DROP statement. If your NL-to-SQL system executes it without validation, you may not find out until you’re restoring from backup. This is not theoretical — there are documented cases of LLM-generated SQL causing data corruption in production systems.
SQL injection via natural language. Research published in 2025 demonstrated backdoor attacks on LLM-based text-to-SQL models, where adversarially crafted prompts can cause the model to generate malicious SQL. The attack surface is the natural language input itself — a user can craft a question that, through the LLM’s translation, produces SQL that reads sensitive data or modifies records.
Schema exfiltration. Generated SQL can expose schema metadata (table names, column structures) through error messages or through direct INFORMATION_SCHEMA queries. In environments with strict data governance, this is a compliance issue.
Silent wrong answers. The least discussed safety issue, but arguably the most dangerous for business decisions. An LLM can generate SQL that executes without error, returns plausible-looking numbers, and is wrong. It aggregates over null values, ignores a filter condition, or joins on the wrong key. The result looks like an answer. It’s not.
nl-2-sql-agent addresses the first class of risks through a SQL guard in db.py: every generated SQL statement is validated as single-statement, read-only before execution. DDL/DML statements (INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE) are blocked at the AST level, not by string matching. Result rows are capped per query to prevent runaway result sets. This is the correct approach — AST-based validation catches obfuscated statements that string matching would miss.
For the silent-wrong-answer class, the ReAct loop provides a partial mitigation: because the engine synthesizes answers from multiple SQL observations rather than a single query, egregious errors tend to surface as inconsistencies in the evidence chain. If step 2 contradicts step 1, the reasoning trace will show it. But this is not a complete solution — a reasoning agent can be consistently wrong in a direction that doesn’t create internal contradictions.
The honest answer is that production NL-to-SQL systems require a validation layer that goes beyond what any fully autonomous system can provide. The nl-2-sql-agent README is explicit about this: it’s an evidence-backed analytical reasoning tool, not a production analytics system. You should review the domain YAML, test with --trace enabled, and treat the output as a grounded starting point for analysis, not a ground truth.
What the Benchmarks Actually Tell You (and What They Don’t)
Since we’re doing a proper deep dive, let’s address the benchmark situation honestly, because it has become genuinely misleading.
Spider 1.0 is the benchmark that most NL-to-SQL demos are tested against. It has 10,181 questions across 200 databases, each with 5-6 tables. Frontier models score 86-92% execution accuracy on Spider 1.0. This sounds impressive. It is also largely useless as a predictor of real-world performance.
BIRD (Big Bench for Large-Scale Database Grounded Text-to-SQL Evaluation) is harder: denser schemas, more ambiguous questions, evidence columns that require domain knowledge to interpret. Frontier models score 58-68% on BIRD. A 2025 study found that BIRD’s strict execution accuracy scoring disagrees with human judgment nearly 40% of the time, and the scoring method only agrees with domain experts 62% of the time. So even the 64% number is an overestimate of production reliability.
Spider 2.0 is the current frontier benchmark: enterprise-scale schemas, multi-hop analytical queries, complex data environments that mirror real production databases. GPT-4o: 10.1% task success. O1-preview: 17.1%. Claude 3.7 Sonnet: approximately 15%. These numbers should be tattooed somewhere visible by anyone evaluating NL-to-SQL vendors.
BIRD-Interact introduces conversational and agentic modes. O3-mini achieves 24.4% on conversational interaction. Claude 3.7 Sonnet reaches 17.78% on agentic interaction. These are the numbers closest to what a ReAct-style agent would see in real deployment — still low, but directionally higher for agentic approaches than for single-shot approaches.
The takeaway is not that NL-to-SQL is broken. It’s that the performance gap between “clean demo schema” and “your actual database” is large enough that you should plan for it architecturally. A ReAct agent that can iterate and self-correct will fail less catastrophically on hard cases than a one-shot translator, because its errors are visible in the reasoning trace rather than buried in a plausible-looking wrong number.
A Walk-Through of nl-2-sql-agent’s Architecture
Let’s put all of this together by walking through the nl-2-sql-agent codebase component by component.
db.py — Database connection and safety layer. This module handles SQLAlchemy connections, schema introspection (tables, columns, data types, foreign key relationships), and the SQL safety guard. The introspection step builds a runtime schema snapshot that the router and ReAct engine work from. The safety guard validates every SQL statement before execution — single-statement, read-only, no DDL/DML. Result rows are capped. This is where the system’s trustworthiness is enforced.
domain_bootstrap.py — Semantic domain generation. Given a business description and the schema snapshot, this module generates a domain YAML that describes what each table represents, what key metrics mean, and how entities relate. For a stock portfolio database, it might generate entries like: "positions: current open positions keyed by symbol and account_id. net_pnl = market_value - cost_basis. Use this table for portfolio-level P&L, not for historical trades." This is the semantic grounding layer that makes subsequent routing and reasoning much more accurate.
router.py — Two-stage table and metric routing. Given the question and the domain YAML, the router performs the first stage of narrowing: table routing (which tables are relevant?) and metric routing (which defined metrics apply?). The output is a focused table set and metric set that gets passed to the ReAct engine. This step prevents the ReAct engine from drowning in irrelevant schema context.
react_engine.py — The iterative reasoning loop. This is the core. The engine takes the question, the routed table set, and the metric definitions, and runs the ReAct loop: think → act (generate SQL) → validate → execute → observe → think again. The loop terminates when the agent has sufficient evidence to synthesize a confident answer, or when it hits a maximum step count (to prevent runaway loops). The reasoning trace at each step is available via --trace and is invaluable for debugging.
agent.py — CLI entry point. Supports interactive mode (REPL), single-question mode (-q), and bootstrap mode (--bootstrap). The interactive mode is particularly useful for exploratory analysis — you can ask follow-up questions that build on prior context, much like a conversation with an analyst.
api.py — FastAPI server. Exposes three endpoints: /healthz (liveness check), /v1/query (answer a question given a domain path), and /v1/bootstrap (generate a domain YAML from a description). This makes nl-2-sql-agent embeddable as a microservice in larger data infrastructure, with the domain YAML managed independently of the service.
The complete flow for a production query:
User Question ("Why is my portfolio down?")
↓
router.py
[table routing: positions, trades, market_data, accounts]
[metric routing: net_pnl, return_pct, position_weight]
↓
react_engine.py
Step 1 — Thought: check overall P&L by symbol
SQL: SELECT symbol, sum(net_pnl) FROM positions GROUP BY symbol ORDER BY net_pnl
Observe: NVDA -18k, TSLA -12k ...
Step 2 — Thought: verify if price movement or sizing
SQL: SELECT symbol, avg_entry_price, current_price, shares FROM positions WHERE symbol IN ('NVDA','TSLA')
Observe: NVDA down 7.2% at 200 shares ...
Step 3 — Thought: check sector-level context
SQL: SELECT sector, sum(net_pnl) FROM positions JOIN sector_map USING (symbol) GROUP BY sector
Observe: Technology -31k, Energy +4k ...
Synthesize: Portfolio down due to semiconductor selloff, primary exposure NVDA (34% of NAV)
↓
db.py SQL guard (validates each step before execution)
↓
Grounded Answer + Optional Trace
The Hard Parts That Remain Unsolved
Being honest about what doesn’t work yet is as important as celebrating what does.
Ambiguous metric definitions. If your business uses revenue to mean three different things in three different departments, no amount of schema linking fixes that. The domain YAML is only as good as the human who curates it. Bootstrap generates a working hypothesis; production requires human review.
Complex temporal reasoning. “Compared to the same period last year, adjusted for the acquisition we made in March” requires understanding fiscal calendars, business events, and time-adjusted comparisons that are genuinely hard for LLMs to get right in SQL. The ReAct loop helps — you can break it into steps — but the temporal logic still requires careful domain encoding.
Write operations and state updates. Read-only SQL is the right default for an analytical agent. But the moment a user wants to “update the target allocation” or “flag this trade for review,” you’re in a different risk category. That’s not this tool’s scope, and it shouldn’t be bolted on lightly.
Explainability for wrong answers. When the agent is wrong, the reasoning trace tells you what it did, not why it was wrong. Debugging a subtle join error or a null-handling issue from a reasoning trace is still hard. Better tooling for SQL diff and answer auditing is needed.
Token cost at scale. A 5-step ReAct loop over a domain with 200 tables, even after routing, can consume significant token budget per query. At low query volumes this is fine; at high query volumes you need to think carefully about caching domain context, pre-computing common query patterns, and routing model choices (smaller/faster models for simple routing, larger models for reasoning).
What to Watch Next
The NL-to-SQL research landscape is moving fast. A few directions worth following:
Autonomous schema linking with self-expanding search (AutoLink’s approach) is likely to become the standard for large-schema environments. The iterative, agent-driven schema exploration that achieves 97.4% recall on BIRD-Dev represents a significant step beyond static retrieval.
Graph-RAG for schema understanding. Constructing knowledge graphs from database schema metadata and using them to ground LLM reasoning over schema relationships — rather than feeding raw schema text — is showing strong early results. Expect this to be the dominant approach in production systems within 12-18 months.
Domain YAML as a managed artifact. The shift from “LLM figures it out” to “human-curated domain model grounds the LLM” is the key architectural insight that separates reliable systems from unreliable ones. Tooling for versioning, testing, and validating domain YAMLs — similar to how dbt handles metric definitions — will become an important part of the NL-to-SQL stack.
RLHF for SQL correctness. Fine-tuning models on user feedback signals (“this answer was wrong, here’s why”) is an underexplored path for domain-specific NL-to-SQL improvement. As more production systems deploy, the labeled feedback data required for this approach is starting to accumulate.
Multi-modal BI agents. The next frontier is agents that can combine SQL over structured data with vision models over charts, documents, and dashboards, and web search over external context — a unified reasoning loop that draws evidence from structured and unstructured sources simultaneously.
The Core Insight You Should Walk Away With
NL-to-SQL is not a translation problem. It never was. The framing as a translation problem — question in, SQL out — is what produces systems that work on demos and fail in production.
It is a grounded reasoning problem. The agent needs to understand the business domain (semantic layer), identify the relevant tables and how they connect (schema linking and table clustering), examine what data is actually there (volumetric analysis), build SQL incrementally to test hypotheses (ReAct loop), validate each step before it touches production data (SQL safety), and synthesize evidence into an answer that a business user can act on.
Done right, this is genuinely powerful. The question “Why is my portfolio down?” answered with a grounded, evidence-backed, multi-step investigation is materially better than a dashboard someone built three quarters ago when the portfolio composition was different. The question “What changed in fulfillment this week?” answered through an iterative SQL investigation is how an analyst would approach it — minus the four hours of manual query writing.
That’s the real value proposition. Not SQL autocomplete. Grounded analytical reasoning over structured data, at the speed of a question.
nl-2-sql-agent is open-source and available at github.com/crazyaiml/nl-2-sql-agent. It supports any SQLAlchemy-compatible database and uses Gemini for LLM reasoning. Star it if this post was useful.
Sources
- nl-2-sql-agent GitHub repository
- AutoLink: Autonomous Schema Exploration for Scalable Schema Linking (arXiv 2511.17190)
- RSL-SQL: Robust Schema Linking in Text-to-SQL Generation (arXiv 2411.00073)
- Rethinking Schema Linking: Bidirectional Retrieval for Text-to-SQL (arXiv 2510.14296)
- SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph (arXiv 2505.18363)
- Spider 2.0: Enterprise-Scale Text-to-SQL Benchmark
- BIRD Benchmark for Large-Scale Text-to-SQL
- ReAct: Synergizing Reasoning and Acting in Language Models (Google Research)
- In-depth Analysis of LLM-based Schema Linking (EDBT 2026)
- Text-to-SQL Performance Cliff 2026 — Medium
- NL2SQL System Design Guide 2025 — Medium
- Grounding LLMs for Database Exploration — VLDB 2025
- DBCopilot: Natural Language Querying over Massive Databases via Schema Routing (EDBT 2025)
- SQL Injection via Backdoor Attacks on LLM Text-to-SQL Models (ACM 2025)
- Why Your AI Initiatives Fail Without a Semantic Layer
- Text-to-SQL Benchmarks for Enterprise Realities (OpenReview)
- Natural Language to SQL: State of the Art and Open Problems — VLDB 2025 (Tsinghua)