Lakehouse for AI Agents

AI agents that need to reason over data are only as reliable as the data infrastructure underneath them. An agent that hallucinates schema details, queries stale data, or returns results from tables it was never supposed to access is not useful. A lakehouse built for AI agents addresses each of those failure modes.

What AI Agents Need from Data Infrastructure

graph TD A["AI Agent Requirements"] A --> B["Documented data: Agent must understand table meanings, not just column names"] A --> C["Governed access: Agent is authenticated and sees only what it is allowed to"] A --> D["Consistent data: No partial writes, no phantom reads, no stale caches"] A --> E["Queryable interface: SQL, MCP resource, or API the agent framework understands"]

Component 1: Apache Iceberg as the Data Layer

Snapshot isolation: every query runs against a complete, immutable snapshot. The agent never sees a table mid-write.
Time travel: if an agent's answer needs to be reproduced or audited, you can re-query the exact snapshot the agent used.
Branching: agents that write data can write to a branch rather than production, allowing review before publishing.

Component 2: A Governed Catalog

The catalog is the access control point. When an agent queries a table, the catalog checks what the agent's principal is authorized to access and vends scoped storage credentials. You create a principal for each agent identity, assign it a role with appropriate table-level grants, and Apache Polaris enforces those boundaries on every catalog request.

Component 3: The Semantic Layer

Raw schemas are not enough for agents. Column names like rev, cnt, or flag_b are opaque. A semantic layer provides the business vocabulary that makes agent SQL generation accurate: table descriptions, column meanings with units and valid values, pre-defined metric calculations, relationship declarations, and business filter rules.

Component 4: Agent Connection Interfaces

graph LR A["AI Agent (Claude, GPT-4, custom LLM)"] B["MCP Client"] C["MCP Server (Dremio MCP Server)"] D["Query Engine (Dremio)"] E["Iceberg Tables (via Apache Polaris)"] A --> B --> C --> D --> E

# Claude Desktop: MCP settings for Dremio
{
  "mcpServers": {
    "dremio": {
      "command": "uvx",
      "args": ["dremio-mcp"],
      "env": {
        "DREMIO_BASE_URL": "https://your-dremio-host",
        "DREMIO_TOKEN": "your-pat-token"
      }
    }
  }
}

Write Safety: WAP Pattern

flowchart TD A["Agent computes result"] --> B["Write to staging branch"] B --> C{"Automated validation: Row count plausible? No nulls in key columns? Schema unchanged?"} C -->|"Pass"| D["Fast-forward main branch: Production table updated"] C -->|"Fail"| E["Drop staging branch, log failure, alert reviewer"]

A Reference Stack

Layer	Component	What it provides to agents
Storage	S3 / GCS / ADLS	Cheap durable object storage
Table format	Apache Iceberg	ACID, time travel, branching, snapshot isolation
Catalog	Apache Polaris	RBAC, credential vending, multi-engine access
Query engine	Dremio	SQL execution + AI Semantic Layer
Semantic layer	Dremio Virtual Datasets	Business context, documented metrics, filter rules
Agent interface	Dremio MCP Server	MCP resources and tools for any LLM client
Agent framework	LangChain / custom / Claude	The agent loop itself