AI Gateway Governance

What is AI Gateway?

Mosaic AI Gateway is Databricks' centralized service for governing and monitoring access to generative AI models and their serving endpoints.

                    Key Capabilities:

                    Governance, monitoring, and production readiness for model serving endpoints—whether 
                    serving Databricks-hosted models, external LLMs, or your own custom models/agents.
                

All telemetry data flows into Delta tables in Unity Catalog, enabling SQL queries, notebooks, dashboards, and alerts using the full Databricks platform.

AI Gateway Features

AI Gateway provides a comprehensive set of governance and monitoring features:

Permission & Rate Limiting — Control who has access and how much
Payload Logging — Monitor requests/responses via inference tables
Usage Tracking — Monitor operational usage and costs via system tables
AI Guardrails — Prevent unsafe/harmful content (Public Preview)
Fallbacks — Minimize outages with automatic model failover
Traffic Splitting — Load balance traffic across models

💰 Paid features: Payload logging, usage tracking. Free features: Permissions, rate limiting, fallbacks, traffic splitting.

Which Endpoints Support AI Gateway?

AI Gateway can be configured on various Model Serving endpoint types, with different feature availability:

External Models — Full support (OpenAI, Anthropic, Cohere, Bedrock, Vertex)
Foundation Model APIs (PT) — Provisioned throughput endpoints
Foundation Model APIs (Pay-per-token) — On-demand endpoints
Deployed AI Agents — Payload logging supported
Custom Model Endpoints — Most features except guardrails

External Models via AI Gateway

External models are third-party LLMs hosted outside Databricks. AI Gateway provides a unified interface for managing multiple providers:

OpenAI — GPT-4, GPT-4o, embeddings
Anthropic — Claude models
Cohere — Command, embeddings
Amazon Bedrock — Claude, Titan, Llama
Google Cloud Vertex AI — Gemini, PaLM
Azure OpenAI — With Entra ID support
Custom Providers — Any OpenAI-compatible endpoint

                    Centralized Credential Management: API keys stored securely in one location, 
                    never exposed in code or to end users.
                

Rate Limiting

Control request volume with Queries Per Minute (QPM) or Tokens Per Minute (TPM) limits at multiple levels:

Endpoint Level — Global maximum for all traffic
User Default — Per-user limit for all users
Custom User/SP — Specific limits for individuals (overrides default)
User Groups — Shared limit across group members

If both QPM and TPM are specified, the more restrictive limit is enforced. Max 20 rate limits per endpoint.

AI Guardrails

Enforce data compliance and block harmful content at the endpoint level:

Safety Filtering — Block violent, self-harm, hate speech content (powered by Meta Llama Guard 2)
PII Detection — Detect or mask sensitive data:

Credit card numbers
Email addresses
Phone numbers (US)
Bank account numbers
Social security numbers

PII Options: Block (reject request), Mask (redact sensitive data), or None

Traffic Splitting & Fallbacks

Load balance and ensure high availability for external model endpoints:

                    Traffic Splitting: Route percentages of traffic to different models. 
                    Useful for A/B testing, gradual rollouts, or cost optimization.
                

Fallbacks automatically redirect on errors:

Triggered on 429 (rate limit) or 5XX errors
Falls back in order of served entities
Set 0% traffic for fallback-only models
Maximum of 2 fallback models

Inference Tables (Payload Logging)

Automatically log all requests and responses to Unity Catalog Delta tables:

request — Raw JSON request body
response — Raw JSON response
status_code — HTTP status
execution_duration_ms — Model inference time
databricks_request_id — Unique request ID
requester — User or SP who made the call

For AI agents, additional tables capture MLflow traces, assessment logs, and formatted request/response logs.

Usage Tracking (System Tables)

Monitor operational metrics and costs via system tables:

system.serving.endpoint_usage — Token counts, request times, status codes
system.serving.served_entities — Model metadata, configurations

                    Cost Attribution: Use usage_context parameter to track 
                    end-user or project-specific usage for chargeback.
                

Join these tables with inference tables for complete observability. Create dashboards, set alerts, and optimize model performance.

Querying Agent Endpoints (API)

Deployed agents (Agent Bricks) are accessible as Model Serving endpoints with multiple query methods:

Databricks OpenAI Client — Recommended for new apps, native integration
MLflow Deployments Client — For existing MLflow workflows
REST API — OpenAI-compatible, language-agnostic
AI Functions (ai_query) — Query from SQL

                    ResponsesAgent vs ChatAgent: Use responses.create() for new 
                    ResponsesAgent, chat.completions.create() for legacy ChatAgent.
                

1. Databricks OpenAI Client

Recommended for new applications — native SDK integration with streaming support:

                    from databricks.sdk import WorkspaceClient

                    w = WorkspaceClient()

                    client = w.serving_endpoints.get_open_ai_client()

                    response = client.responses.create(

                      model="my-agent-endpoint",

                      input=[{"role": "user", "content": "..."}],

                      stream=True

                    )

Pass custom_inputs and databricks_options via extra_body for additional parameters like return_trace.

2. MLflow Deployments Client

Best for existing MLflow workflows and experiment tracking integration:

                    from mlflow.deployments import get_deploy_client

                    client = get_deploy_client("databricks")

                    response = client.predict(

                      endpoint="my-agent-endpoint",

                      inputs={

                        "messages": [{"role": "user", "content": "..."}]

                      }

                    )

💡 MLflow client integrates with experiment tracking for logging predictions and model versions.

3. REST API (OpenAI-compatible)

Language-agnostic — use from any HTTP client (curl, requests, fetch):

                    POST /serving-endpoints/{endpoint}/invocations

                    curl -X POST \

                      -H "Authorization: Bearer $TOKEN" \

                      -H "Content-Type: application/json" \

                      -d '{"messages": [{"role": "user", "content": "..."}]}' \

                      https://<workspace>/serving-endpoints/my-agent/invocations

🔗 OpenAI-compatible format — easily swap between Databricks agents and OpenAI models.

4. AI Functions (ai_query)

Query agents directly from SQL — great for data pipelines and notebooks:

                    SELECT ai_query(

                      'my-agent-endpoint',

                      'What is the summary of this document?'

                    ) AS response

                    -- With structured input

                    SELECT ai_query(

                      'my-agent-endpoint',

                      named_struct('messages', array(

                        named_struct('role', 'user', 'content', prompt_col)

                      ))

                    ) FROM my_table

📊 Process entire tables with AI — apply LLM to every row in a DataFrame.

AI Gateway in the Architecture

AI Gateway sits at the center of your AI infrastructure, providing:

Unified Interface — Single endpoint for multiple LLM providers
Centralized Governance — Consistent policies across all AI traffic
Observability — Complete audit trail in Unity Catalog
Production Readiness — Rate limits, fallbacks, guardrails

                    Best Practice: Route ALL external LLM traffic through AI Gateway to 
                    unify governance, tracking, and cost management.
                

Requests

🌐

External Apps

🤖

Agents

📊

Notebooks

↓

Mosaic AI Gateway

🌉
Governance
Unified Control

Rate Limits Guardrails Logging Tracking

↓

Model Serving Endpoints

🧠

Foundation

🔗

External

⚙️

Custom

🗄️

Unity Catalog

Inference Tables • System Tables

⚡

Rate Limiting

QPM / TPM controls

📝

Payload Logging

Inference tables

📊

Usage Tracking

System tables

🛡️

AI Guardrails

Safety + PII

🔄

Fallbacks

Auto failover

⚖️

Traffic Split

Load balance

Feature

External

FM (PT)

FM (PPT)

Agents

Custom

Rate Limiting

✓

—

✓

Payload Logging

✓

Usage Tracking

✓

—

✓

AI Guardrails

✓

—

Fallbacks

✓

—

Traffic Split

✓

—

✓

Supported External Model Providers

🤖

OpenAI

GPT-4, GPT-4o

🧠

Anthropic

Claude

💬

Cohere

Command

☁️

Amazon Bedrock

Claude, Titan

🌐

Google Vertex

Gemini, PaLM

🔷

Azure OpenAI

+ Entra ID

⚙️

Custom Provider

Any OpenAI-compatible API

🌐 Endpoint Level

Global maximum for ALL traffic (e.g., 10,000 TPM)

👤 User Default

Per-user limit for all users (e.g., 1,000 TPM)

⭐ Custom User/SP

Specific limits that override defaults (e.g., 5,000 TPM)

👥 User Groups

Shared limit across group members (e.g., 3,000 TPM total)

📨

Request

→

🛡️

Guardrails

→

🤖

Model

⚠️

Safety Filter

Block harmful content
(Llama Guard 2)

🔐

PII Detection

Block or Mask
sensitive data

Violence Hate Speech Credit Cards SSN Emails

📨

Incoming Requests

Traffic Split

60%

30%

10%

🤖

GPT-4o

Primary

🧠

Claude

Secondary

🔄

Backup

Fallback Only

🔄 Fallback Flow: On 429/5XX → Try next model in order

📝 AI Gateway Inference Table Schema

databricks_request_id

STRING

Unique request ID

request

STRING

Raw JSON request

response

STRING

Raw JSON response

status_code

INT

HTTP status code

execution_duration_ms

BIGINT

Inference time

requester

STRING

User or SP ID

request_time

TIMESTAMP

When received

📊 endpoint_usage

• input_token_count
• output_token_count
• request_time
• status_code
• usage_context

🔧 served_entities

• endpoint_name
• entity_type
• model config
• task type
• provider info

Cost Attribution with usage_context

{"project": "proj1", "end_user": "abc123"}

Query Deployed Agents via API

🐍
Databricks OpenAI Client
Native SDK integration, streaming support
Recommended

🔄

MLflow Deployments Client

Existing MLflow workflows, experiment tracking

🌐

REST API

OpenAI-compatible, language-agnostic

📊

AI Functions (ai_query)

Query agents directly from SQL

🐍 Databricks OpenAI Client Recommended

                            from databricks.sdk import WorkspaceClient
                            
                            w = WorkspaceClient()
                            
                            client = w.serving_endpoints.get_open_ai_client()
                            
                            # ResponsesAgent (recommended)
                            
                            response = client.responses.create(
                            
                              model="my-agent",
                            
                              input=[{"role": "user", ...}],
                            
                              stream=True,
                            
                              extra_body={"databricks_options": ...}
                            
                            )

🔄 MLflow Deployments Client

                            from mlflow.deployments import get_deploy_client
                            
                            # Connect to Databricks
                            
                            client = get_deploy_client("databricks")
                            
                            # Query the agent endpoint
                            
                            response = client.predict(
                            
                              endpoint="my-agent-endpoint",
                            
                              inputs={
                            
                                "messages": [
                            
                                  {"role": "user", "content": "..."}
                            
                                ]
                            
                              }
                            
                            )

🌐 REST API (OpenAI-compatible)

                            # Using curl
                            
                            curl -X POST \
                            
                              -H "Authorization: Bearer $TOKEN" \
                            
                              -H "Content-Type: application/json" \
                            
                              -d '{"messages": [{"role": "user", "content": "..."}]}' \
                            
                              https://<workspace>/serving-endpoints/my-agent/invocations
                            
                            # Using Python requests
                            
                            response = requests.post(
                            
                              url, headers=headers, json=payload
                            
                            )

📊 AI Functions (ai_query) — SQL

                            -- Simple query
                            
                            SELECT ai_query(
                            
                              'my-agent-endpoint',
                            
                              'Summarize this document...'
                            
                            ) AS response;
                            
                            -- Process entire table with AI
                            
                            SELECT
                            
                              id,
                            
                              ai_query('my-agent', content_col) AS ai_summary
                            
                            FROM my_table;

Your Applications

🌐

External Apps

🤖

Agents

📊

Notebooks

↓

🌉 Mosaic AI Gateway

⚡

Rate Limits

🛡️

Guardrails

📝

Logging

📊

Tracking

↓

All AI Traffic

🧠

Foundation

🔗

External LLMs

⚙️

Custom/Agents

🌉 AI Gateway Governance

Mosaic AI Gateway

What is AI Gateway?

AI Gateway Features

Which Endpoints Support AI Gateway?

External Models via AI Gateway

Rate Limiting

AI Guardrails

Traffic Splitting & Fallbacks

Inference Tables (Payload Logging)

Usage Tracking (System Tables)

Querying Agent Endpoints (API)

1. Databricks OpenAI Client

2. MLflow Deployments Client

3. REST API (OpenAI-compatible)

4. AI Functions (ai_query)

AI Gateway in the Architecture

Quick Reference

🌉 What is AI Gateway?

🔧 Supported Endpoints

🛡️ Governance Features

📊 Observability

🔗 External Providers

📡 Query Methods