AI & Machine Learning

Ontology-Driven NL-to-SQL Architecture: BigQuery, Gemini, Looker, and Semantic Guardrails

Google's BigQuery + Gemini text-to-SQL guidance exposes the core problem every team hits: LLMs generate syntactically valid SQL that is semantically wrong. The fix is not a bigger prompt — it is a governed semantic layer that maps business language to schema, metrics, joins, filters, and entitlements.

Share this article
Comments
Share:
Google's BigQuery + Gemini text-to-SQL guidance exposes the core problem every team hits: LLMs generate syntactically valid SQL that is semantically wrong. The fix is not a bigger prompt — it is a governed semantic layer that maps business language to schema, metrics, joins, filters, and entitlements.
Table of Contents

Why NL2SQL Agents Need a Semantic and Ontology Layer

Learn how semantic-model grounding, ontology-backed schema mediation, and SQL validation reduce hallucinations when using LLM agents over large analytical warehouses.

There is a class of AI failure that is uniquely insidious because it looks exactly like success.

A business analyst asks: “What was our revenue from enterprise customers last quarter?” An NL-to-SQL agent processes the question, writes a SQL query, executes it against the data warehouse, and returns a number. The number is formatted correctly. It ran without errors. The pipeline logs show a green checkmark. And the number is completely wrong — because the agent joined on the customers table but filtered by account_type = 'enterprise' instead of the correct segment_tier = 'ENT', pulled from gross_revenue instead of net_revenue_after_returns, and used the calendar quarter instead of the fiscal quarter your company runs on.

This is the NL-2-SQL agent trap. And it is not a problem you solve by writing a better prompt.

Google’s public guidance on NL2SQL with BigQuery and Gemini is one of the clearest explanations of why natural language to SQL is difficult in real enterprise data environments: ambiguity checks, semantic search, vector embeddings, contribution analysis, and user clarification all matter before SQL generation becomes trustworthy. Google’s Conversational Analytics in Looker makes the semantic-layer point even more directly: Gemini-powered analytics is grounded in the Looker semantic model, where business definitions, fields, joins, permissions, and formatting rules already live.

The short version: you need a semantic layer — and, for complex domains, an ontology layer — that connects the language users speak to the schema your data actually lives in before any of the LLM magic becomes trustworthy. Without it, you are essentially asking the model to guess your business logic from column names, and column names are one of the worst-named things in software engineering.

This post breaks down why that’s true, what an ontology layer actually is in this context, how Google’s architecture addresses the problem, and what the remaining failure modes look like even when you do it right.

Why LLMs Are Dangerous at SQL Generation (But Not for the Reason People Think)

The common assumption is that LLMs are bad at SQL because they make syntax errors or write queries for the wrong database dialect. This is almost entirely wrong. Modern LLMs are excellent at SQL syntax. Given a schema dump, they can write structurally valid queries for BigQuery, Postgres, Snowflake, or DuckDB with very high reliability. They understand CTEs, window functions, complex aggregations, nested subqueries. The syntax problem is largely solved.

The problem is semantic. And it has three roots.

First: LLMs don’t know what your business means by its words. “Revenue” means something specific at your company — it might be sum(line_item_amount) with a particular join to exclude returns, credits, and intra-company transactions. The model doesn’t know that. It will write what looks reasonable based on column name similarity, which is often wrong. “Customer” might mean accounts, end_users, organizations, or legal_entities depending on which team built which table, which product line you’re reporting on, and what quarter the schema was designed in.

Second: Enterprise schemas are enormous and inconsistently named. A mid-size company’s data warehouse easily has 300–1,000+ tables, many with overlapping semantic territory built by different teams at different times. A question about “sales performance” could plausibly draw from salesforce_opportunities, revenue_recognition_events, deal_pipeline_v2, closed_won_historical, and quarterly_targets — and the correct answer uses exactly two of those five, joined in a specific way, with a filter that isn’t derivable from the schema alone. The model, without grounding, will pick the most name-similar tables and write a query that runs but measures the wrong thing.

Third: Schema metadata is not the same as semantic metadata. A schema tells you that a column called acv exists in the contracts table and is of type FLOAT64. It does not tell you that ACV means Annual Contract Value, that it’s net of discounts but gross of professional services, that it should never be aggregated directly because it’s a point-in-time snapshot, and that the correct metric for revenue reporting is normalized_arr in the finance_metrics table instead. That knowledge lives in the heads of your data team, not in the database.

These three properties combine to create a failure mode that is almost impossible to catch automatically: the query runs, returns a plausible-looking number, and is wrong in a way that requires domain expertise to detect.

What a Semantic and Ontology Layer Actually Is (And Isn’t)

In the context of NL-to-SQL, “ontology” gets used loosely. It’s worth being precise about what it means and what forms it can take.

In the strictest sense, an ontology is a formal specification of concepts, relationships, and constraints in a domain — the kind of thing you’d express in OWL/RDF, with class hierarchies, property restrictions, and inference rules. Academic NL2SQL systems sometimes use this form. In practice, enterprise NL2SQL teams almost never do, because maintaining a formal OWL ontology for a live data warehouse is a significant ongoing engineering investment that most organizations cannot sustain.

What actually works in production is a semantic layer — a governed mapping between business concepts and schema entities, maintained by the data team, queried at inference time by the NL2SQL agent. For harder domains, that semantic layer starts behaving like a lightweight ontology because it must encode concepts, relationships, constraints, allowed dimensions, metric grain, access rules, and canonical join paths. This layer can take several forms along a spectrum of formality:

A business glossary at the lightweight end maps terms to definitions and canonical column references. “Revenue” → “sum of net_revenue from finance_revenue_events table, filtered to exclude transaction_type IN ('credit', 'reversal', 'internal')”. Simple, maintainable, powerful.

A metric catalog extends this to pre-defined aggregations: named metrics with their precise SQL definitions, dimensions they can be sliced by, and grain at which they’re defined. This is what tools like dbt metrics, Looker’s LookML, and Google’s Semantic Models formalize.

A full semantic ontology adds concept hierarchies, cross-concept relationships, and inference rules — “Enterprise Customer is a subclass of Customer, where segment_tier = ‘ENT’ in the accounts table” — which allows the agent to reason about whether a question about “enterprise customers” should use a filter, a join, or a subquery depending on how the schema is structured.

Google’s BigQuery + Gemini text-to-SQL guidance operates at the prompt, ambiguity-check, semantic-search, and execution-assistance level. Looker’s Gemini-powered conversational analytics shows the more governed semantic-model pattern: natural language questions are interpreted against LookML definitions rather than directly asking the model to invent SQL over raw warehouse tables. The key insight is that you don’t need full OWL/RDF machinery to get most of the benefit — a well-maintained business glossary plus a metric catalog covers the overwhelming majority of enterprise reporting queries.

The Five Failure Modes of Naive NL-to-SQL

Before going into the architecture, it’s worth naming the specific failure modes that an ontology layer is designed to address. Each one has a different mechanism and a different mitigation.

Schema explosion. When the full schema (300+ tables, 10,000+ columns) is included in the prompt, two things happen: the context window fills up and performance degrades, and the model performs poorly at identifying the relevant subset of tables for the query. Naive implementations try to squeeze the whole schema into the prompt and get inconsistent results. The ontology layer addresses this through semantic retrieval — embedding the question and doing similarity search over ontology nodes to surface only the relevant tables, columns, and metrics before prompting.

Term ambiguity. When a business term maps to multiple possible schema interpretations, the model picks the most syntactically plausible one, which is often not the semantically correct one. The ontology layer addresses this by explicitly disambiguating: the business glossary has one canonical definition per term, and the agent uses that definition rather than inferring from column names.

Aggregation mismatch. Metrics like revenue, churn rate, or customer lifetime value have specific aggregation logic that’s not inferrable from the schema. The model might write SUM(revenue) when the correct answer requires SUM(DISTINCT revenue) / COUNT(DISTINCT month) with specific fiscal calendar adjustments. The metric catalog addresses this by providing pre-defined aggregation SQL for named metrics, which the agent uses verbatim rather than generating.

Join hallucination. When the model needs to join tables, it often guesses join conditions based on column name similarity — joining on customer_id columns that happen to have the same name but actually reference different entity types in your schema. The ontology layer addresses this by encoding the correct join relationships explicitly, so the agent can look up “how do I join contracts to accounts?” and get the right answer.

Dialect drift. Even when the model knows it’s generating BigQuery SQL, subtle dialect differences — ARRAY_AGG vs STRING_AGG, DATE_DIFF argument order, QUALIFY clause availability — cause failures that are hard to reproduce and debug. The ontology layer can address this by encoding dialect-specific SQL patterns for common operations, which the agent retrieves and uses rather than generating from scratch.

How Google Cloud’s BigQuery, Gemini, and Looker Direction Addresses This

Google Cloud’s public direction is not a single model doing end-to-end translation over a raw schema dump. Across BigQuery + Gemini text-to-SQL guidance and Looker’s Gemini-powered conversational analytics, the pattern is multi-step: classify the request, reduce ambiguity, ground the question in governed metadata, generate or compose the query, validate it, execute it, and interpret the result. The core components are:

Schema and semantic metadata extraction. Before any query is processed, the system needs a governed representation of the warehouse schema and the business semantics around it. In a BigQuery-native implementation, this can mean schema metadata, descriptions, sample values, embeddings, curated SQL examples, and metric definitions. In a Looker-backed implementation, this means LookML definitions for dimensions, measures, joins, filters, formatting, and permissions. This is the semantic/ontology layer: a structured, queryable representation of what the data means, not just what it contains.

Semantic schema retrieval. When a user submits a question, the agent should retrieve the most relevant tables, columns, metrics, examples, dimensions, and join paths before SQL planning begins. This can use embeddings and semantic search over schema documentation, metric catalogs, LookML metadata, and golden query examples. This is retrieval-augmented generation applied to schema — the model only sees the schema elements that are semantically relevant to the question, not the entire warehouse. This addresses schema explosion and improves the signal-to-noise ratio of the prompt.

Explicit SQL planning before generation. With the retrieved semantic context injected into the prompt, the model should produce an auditable plan before SQL is executed: which entities are involved, which metric is requested, what grain applies, what filters are required, what join path is allowed, and what ambiguity remains. The SQL should be generated from that plan, not directly from the user’s sentence. This planning step catches many of the semantic errors that direct generation misses.

SQL validation and self-correction. The generated SQL is validated before execution — both syntactically (parse it, check BigQuery compatibility) and semantically (verify that the tables and columns used actually match the semantic context that was retrieved). If validation fails, the agent enters a self-correction loop: it takes the error, the original question, the semantic context, and the failed SQL, and generates a corrected version. Google’s pipeline supports multiple rounds of self-correction before falling back to a human-readable error.

Execution and result interpretation. After BigQuery executes the query, the result is optionally passed back through Gemini for natural language interpretation — translating a table of numbers into a prose explanation, flagging potential data quality issues, or suggesting follow-up questions. This last step is optional but significantly improves the user experience for non-technical analysts.

The Role of Looker’s Semantic Model

One piece of Google’s ecosystem that often gets underemphasized is Looker’s LookML as a governed semantic foundation. LookML is essentially a business-logic definition language for analytical schemas — it defines dimensions, measures, join relationships, filters, access rules, formatting, and drill paths in a way that maps business concepts to governed query behavior. When Gemini-powered conversational analytics in Looker uses the Looker semantic model, it gets a business-team-maintained semantic layer instead of asking the model to infer everything from raw table names.

This is significant for two reasons. First, it means the ontology is owned by the data team, not the ML team — the people who know what “revenue” means maintain the canonical definition, and the NL2SQL agent consumes it. This is the right ownership model. ML teams are good at building agents; they’re not good at knowing that fiscal Q4 ends February 28th for this particular company.

Second, it provides a natural integration point with existing governance workflows. If your company already has a Looker semantic layer, you have much of the raw material for an NL2SQL semantic layer without building from scratch. The incremental investment is in retrieval, planning, validation, entitlement checks, and evaluation — not in asking the LLM to rediscover business logic every time someone asks a question.

For teams without Looker, the equivalent is a dbt project with well-defined metrics and thorough documentation, or a standalone business glossary maintained in something like DataHub, Atlan, or Alation. The format matters less than the practice: having a team-maintained, structured mapping between business language and schema that the NL2SQL agent can query at inference time.

Building the Ontology for Your NL2SQL Agent: What to Include

If you’re building this from scratch, the minimum viable semantic/ontology layer for an NL2SQL agent has six components.

Business term definitions. A mapping from natural language terms to canonical SQL expressions. Every ambiguous term your users are likely to ask about should have an entry: what table it lives in, what column or expression computes it, what filters apply, and what it does NOT include (negative definitions are often more important than positive ones — “revenue excludes internal transfers” is critical, not obvious).

Metric definitions. Pre-defined SQL aggregations for named metrics, with grain specification and allowed dimensions. “Monthly Active Users” → specific SQL, grain = day, dimensions = [region, product_line, user_segment]. The agent retrieves these definitions and uses them verbatim rather than deriving them.

Entity relationship map. The correct join conditions between tables, expressed as bidirectional relationships with join type, key columns, and any filter conditions on the join. “accounts JOIN contracts ON accounts.id = contracts.account_id WHERE contracts.status = ‘active’” — not just the column names, the full correct join condition.

Canonical entity resolution. A mapping from natural language entity references (“enterprise customers”, “US customers”, “paying users”) to their correct SQL filter expressions. This is where a lot of NL2SQL agents fail — the question says “enterprise customers” and the correct filter is a combination of segment tier AND minimum contract value AND account status that’s nowhere in the column names.

Entitlement and masking rules. A mapping from user roles, data domains, columns, and result shapes to allowed access. The SQL generator should not decide whether a user can see customer PII, compensation data, revenue by account, or regulated attributes. That decision belongs in the policy layer and must be enforced before execution.

Golden query examples. Verified natural-language question and SQL pairs for common analytical patterns. These examples anchor the model to approved query shapes and give the evaluation pipeline a regression set. In practice, golden queries are often more useful than generic prompt examples because they encode how your organization actually answers business questions.

The Production Gap: What Still Breaks Even With an Ontology

An ontology layer dramatically improves NL2SQL reliability, but it does not eliminate all failure modes. It’s worth being honest about what still breaks.

Ontology staleness. The semantic layer is only as good as its last update. If the data team adds a new table, renames a column, or changes the definition of a metric, the ontology needs to be updated in sync — or the NL2SQL agent will use the old definition and generate wrong queries. This is an ongoing maintenance burden that teams frequently underestimate. The ontology is not a one-time artifact; it’s a living document that needs to evolve with the schema.

Coverage gaps. No ontology covers every possible question. Users will ask things that fall outside the defined terms and metrics, and the agent will fall back to schema-based inference — with all the failure modes that entails. Monitoring which questions fall outside ontology coverage, and using that signal to prioritize ontology expansion, is a necessary operational loop.

Complex multi-hop reasoning. Some questions require reasoning that even a well-structured ontology can’t short-circuit. “Which sales reps had the highest net revenue growth in Q2 but the lowest customer satisfaction scores?” requires correlating across at least three different data domains (revenue, quotas, support tickets), and the correct join path may involve intermediate tables that the ontology’s entity relationship map doesn’t anticipate. For these queries, even Google’s pipeline can produce wrong answers.

Prompt injection via user input. Natural language questions from users can contain inadvertent (or deliberate) content that distorts the SQL generation — questions that reference tables that shouldn’t be accessible, filter conditions that bypass intended restrictions, or phrasing that causes the model to ignore ontology grounding in favor of direct schema inference. Sanitization and guardrails at the input layer are necessary, especially for enterprise deployments.

Permission leakage. Even semantically correct SQL can be unsafe if the agent can reference unauthorized tables or columns. The execution layer must validate generated SQL against the authenticated user’s permissions, row-level rules, column-level rules, and masking policies before the query reaches BigQuery.

Cost blowups. A query can be semantically correct and still operationally dangerous if it scans too much data. NL2SQL systems need dry-run cost estimation, row-scan limits, query timeouts, and workload isolation so exploratory natural-language questions do not compete with production BI or batch workloads.

What to Build If You’re Doing This Today

The practical implication of all of this is that NL-to-SQL is an engineering problem, not a prompting problem. Getting it right in production requires three investments that most teams resist making until they’ve shipped a half-working prototype and seen the failure modes in production.

Invest in the semantic layer first, not last. Most teams build the NL2SQL pipeline first and bolt on the ontology after they hit the semantic failure modes. This is backwards. The semantic layer should be the foundation — even a lightweight business glossary document is enough to start. Build the retrieval pipeline around it from day one, and you’ll avoid rewriting the prompt architecture three times.

Instrument for semantic failures, not just execution failures. An NL2SQL agent that returns wrong answers without errors is worse than one that fails loudly. You need instrumentation that captures the question, the retrieved semantic context, the generated SQL, and the query result — and periodic human review to catch semantic failures that don’t surface as exceptions. Eval datasets of known question/answer pairs, with expected SQL, are the only reliable way to measure this.

Own the ontology with the data team, not the ML team. The knowledge about what your data means lives with the analysts and data engineers who built it. The NL2SQL ontology should be maintained in the same workflow as your data documentation — not in a separate AI system owned by the ML team. The closer the ontology is to the source of truth about your data semantics, the more likely it is to stay accurate over time.

Google’s BigQuery + Gemini architecture is one of the most sophisticated public implementations of this approach, and it still has gaps. That’s not a criticism — it’s an honest signal about the difficulty of the problem. Natural language is imprecise by design, and enterprise schemas are complicated by history. The ontology layer is the mechanism that mediates between the two, and building it well is the difference between an NL2SQL agent that your data team trusts and one that your data team disables after a month because it kept returning wrong revenue numbers to the CFO.

The technology stack — Gemini, BigQuery, Vertex AI, embeddings, semantic retrieval — is available to everyone now. The semantic layer is the hard part. Start there.

C4 Architecture Diagram (Container View)

flowchart LR
    U[Business User] --> API[NL2SQL API Gateway]
    API --> CLASSIFY[Intent & Risk Classifier]
    CLASSIFY --> AMBIG[Ambiguity Check]
    AMBIG --> PLAN[Query Planning Service]
    PLAN --> ONT[Semantic & Ontology Layer]
    PLAN --> SCHEMA[Schema and Metric Retriever]
    PLAN --> LLM[Gemini SQL Planner]
    LLM --> VAL[SQL Validation Service]
    VAL --> DRYRUN[BigQuery Dry Run & Cost Gate]
    DRYRUN --> ENT[Security & Entitlement Gate]
    ENT --> BQ[BigQuery Execution Engine]
    BQ --> RESP[Result Interpreter]
    RESP --> API

Ontology-to-Schema Mapping Pattern

Use explicit mappings from business terms to canonical schema definitions. Example:

business_terms:
  revenue:
    canonical_metric: net_revenue_after_returns
    source_table: finance_revenue_events
    dimensions: [region, product_line, fiscal_quarter]
    constraints:
      - exclude_transaction_types: [credit, reversal, internal_transfer]
  enterprise_customer:
    source_table: accounts
    filter: "segment_tier = 'ENT' AND account_status = 'active'"

This mapping should be versioned with data-model changes and reviewed by both analytics and platform teams.

Query Planning JSON Contract

Before SQL generation, build an explicit planning object:

{
  "question": "What was enterprise net revenue last fiscal quarter?",
  "intent": "metric_lookup",
  "risk_level": "medium",
  "ambiguities": [],
  "resolved_entities": ["enterprise_customer", "revenue"],
  "candidate_tables": ["finance_revenue_events", "accounts", "fiscal_calendar"],
  "selected_metric": "net_revenue_after_returns",
  "grain": "fiscal_quarter",
  "allowed_dimensions": ["region", "product_line", "segment_tier"],
  "join_path": [
    "finance_revenue_events.account_id = accounts.id",
    "finance_revenue_events.date_key = fiscal_calendar.date_key"
  ],
  "required_filters": [
    "accounts.segment_tier = 'ENT'",
    "accounts.account_status = 'active'",
    "fiscal_calendar.is_last_fiscal_quarter = true"
  ],
  "entitlement_scope": {
    "requires_pii": false,
    "allowed_data_domains": ["finance", "accounts"],
    "masking_required": false
  },
  "execution_guardrails": {
    "dry_run_required": true,
    "max_bytes_billed": "10GB",
    "timeout_seconds": 30
  }
}

This contract makes planning auditable and easier to evaluate than raw prompt-only generation.

SQL Validation Flow

  1. Parse and lint generated SQL.
  2. Validate table/column references against the authorized schema graph.
  3. Verify metric, grain, and aggregation compatibility.
  4. Check join-path consistency with semantic and ontology definitions.
  5. Enforce row-level, column-level, and domain-level entitlements.
  6. Enforce execution guardrails such as cost, row-scan, and timeout limits.
  7. Run dry-run validation in BigQuery before execution.
  8. Log the question, plan, SQL, policy result, dry-run estimate, and final outcome.

Security and Entitlement Gate

Enforce policy prior to query execution:

  • Row-level and column-level access checks per user role.
  • Query-template allowlist for high-risk domains.
  • PII masking and redaction in result payloads.
  • Audit log record containing prompt, plan, SQL, and policy decisions.

Without this gate, NL-to-SQL systems become privilege-escalation surfaces.

BigQuery-Specific Implementation Notes

  • Use dry runs for cost estimation before execution.
  • Prefer parameterized SQL and avoid dynamic string interpolation.
  • Encode fiscal calendar logic in canonical views, not prompts.
  • Materialize frequently used semantic joins into curated datasets.
  • Track slot usage and query priority for NL-to-SQL workloads separately from batch BI jobs.

Common Failure Modes

  • Term ambiguity: business terms map to multiple metrics.
  • Join hallucination: syntactically valid joins with wrong semantics.
  • Metric drift: updated definitions not reflected in ontology.
  • Entitlement bypass: generated SQL reaches unauthorized fields.
  • Prompt leakage: user asks for hidden schema internals.

Build defenses for these explicitly; they do not disappear with better prompts.

Evaluation Checklist

  • Accuracy on gold business questions, measured by semantic correctness, not just SQL validity.
  • Match rate between generated SQL and approved metric definitions.
  • Join-path correctness against the semantic relationship map.
  • Policy compliance rate for entitlement and masking rules.
  • Query execution success rate, dry-run rejection rate, and median latency.
  • Cost per successful business answer.
  • Regression stability across model, prompt, ontology, and schema-version upgrades.
  • Reviewer agreement for high-impact finance and compliance questions.

Try The Lab

Use the hands-on implementation while reading this article:

Enterprise AI Architecture

Want more enterprise AI architecture breakdowns?

Subscribe to SuperML.

Comments

Sign in to leave a comment

Back to Blog

Related Posts

View All Posts »

The Ontology Layer Every Enterprise AI System Needs (But Almost None Have)

Most enterprise AI systems fail in production not because the models are wrong, but because nobody defined what 'customer', 'transaction', or 'risk' means consistently across systems. This is a practical implementation guide for building the semantic layer that makes AI grounded, governed, and production-ready.