πŸŒ‰ AI Gateway Governance

← Orchestration Hub

Mosaic AI Gateway

Centralized governance, monitoring, and production readiness for all your AI model serving endpoints. Control access, enforce policies, and gain observability across Databricks and external LLMs.

Run, secure, and govern AI traffic to democratize and accelerate AI adoption

↓ Scroll to explore AI Gateway

What is AI Gateway?

Mosaic AI Gateway is Databricks' centralized service for governing and monitoring access to generative AI models and their serving endpoints.

Key Capabilities:
Governance, monitoring, and production readiness for model serving endpointsβ€”whether serving Databricks-hosted models, external LLMs, or your own custom models/agents.

All telemetry data flows into Delta tables in Unity Catalog, enabling SQL queries, notebooks, dashboards, and alerts using the full Databricks platform.

AI Gateway Features

AI Gateway provides a comprehensive set of governance and monitoring features:

  • Permission & Rate Limiting β€” Control who has access and how much
  • Payload Logging β€” Monitor requests/responses via inference tables
  • Usage Tracking β€” Monitor operational usage and costs via system tables
  • AI Guardrails β€” Prevent unsafe/harmful content (Public Preview)
  • Fallbacks β€” Minimize outages with automatic model failover
  • Traffic Splitting β€” Load balance traffic across models
πŸ’° Paid features: Payload logging, usage tracking. Free features: Permissions, rate limiting, fallbacks, traffic splitting.

Which Endpoints Support AI Gateway?

AI Gateway can be configured on various Model Serving endpoint types, with different feature availability:

  • External Models β€” Full support (OpenAI, Anthropic, Cohere, Bedrock, Vertex)
  • Foundation Model APIs (PT) β€” Provisioned throughput endpoints
  • Foundation Model APIs (Pay-per-token) β€” On-demand endpoints
  • Deployed AI Agents β€” Payload logging supported
  • Custom Model Endpoints β€” Most features except guardrails

External Models via AI Gateway

External models are third-party LLMs hosted outside Databricks. AI Gateway provides a unified interface for managing multiple providers:

  • OpenAI β€” GPT-4, GPT-4o, embeddings
  • Anthropic β€” Claude models
  • Cohere β€” Command, embeddings
  • Amazon Bedrock β€” Claude, Titan, Llama
  • Google Cloud Vertex AI β€” Gemini, PaLM
  • Azure OpenAI β€” With Entra ID support
  • Custom Providers β€” Any OpenAI-compatible endpoint
Centralized Credential Management: API keys stored securely in one location, never exposed in code or to end users.

Rate Limiting

Control request volume with Queries Per Minute (QPM) or Tokens Per Minute (TPM) limits at multiple levels:

  • Endpoint Level β€” Global maximum for all traffic
  • User Default β€” Per-user limit for all users
  • Custom User/SP β€” Specific limits for individuals (overrides default)
  • User Groups β€” Shared limit across group members
If both QPM and TPM are specified, the more restrictive limit is enforced. Max 20 rate limits per endpoint.

AI Guardrails

Enforce data compliance and block harmful content at the endpoint level:

  • Safety Filtering β€” Block violent, self-harm, hate speech content (powered by Meta Llama Guard 2)
  • PII Detection β€” Detect or mask sensitive data:
    • Credit card numbers
    • Email addresses
    • Phone numbers (US)
    • Bank account numbers
    • Social security numbers
PII Options: Block (reject request), Mask (redact sensitive data), or None

Traffic Splitting & Fallbacks

Load balance and ensure high availability for external model endpoints:

Traffic Splitting: Route percentages of traffic to different models. Useful for A/B testing, gradual rollouts, or cost optimization.

Fallbacks automatically redirect on errors:

  • Triggered on 429 (rate limit) or 5XX errors
  • Falls back in order of served entities
  • Set 0% traffic for fallback-only models
  • Maximum of 2 fallback models

Inference Tables (Payload Logging)

Automatically log all requests and responses to Unity Catalog Delta tables:

  • request β€” Raw JSON request body
  • response β€” Raw JSON response
  • status_code β€” HTTP status
  • execution_duration_ms β€” Model inference time
  • databricks_request_id β€” Unique request ID
  • requester β€” User or SP who made the call
For AI agents, additional tables capture MLflow traces, assessment logs, and formatted request/response logs.

Usage Tracking (System Tables)

Monitor operational metrics and costs via system tables:

  • system.serving.endpoint_usage β€” Token counts, request times, status codes
  • system.serving.served_entities β€” Model metadata, configurations
Cost Attribution: Use usage_context parameter to track end-user or project-specific usage for chargeback.

Join these tables with inference tables for complete observability. Create dashboards, set alerts, and optimize model performance.

Querying Agent Endpoints (API)

Deployed agents (Agent Bricks) are accessible as Model Serving endpoints with multiple query methods:

  • Databricks OpenAI Client β€” Recommended for new apps, native integration
  • MLflow Deployments Client β€” For existing MLflow workflows
  • REST API β€” OpenAI-compatible, language-agnostic
  • AI Functions (ai_query) β€” Query from SQL
ResponsesAgent vs ChatAgent: Use responses.create() for new ResponsesAgent, chat.completions.create() for legacy ChatAgent.

1. Databricks OpenAI Client

Recommended for new applications β€” native SDK integration with streaming support:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

response = client.responses.create(
  model="my-agent-endpoint",
  input=[{"role": "user", "content": "..."}],
  stream=True
)

Pass custom_inputs and databricks_options via extra_body for additional parameters like return_trace.

2. MLflow Deployments Client

Best for existing MLflow workflows and experiment tracking integration:

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

response = client.predict(
  endpoint="my-agent-endpoint",
  inputs={
    "messages": [{"role": "user", "content": "..."}]
  }
)
πŸ’‘ MLflow client integrates with experiment tracking for logging predictions and model versions.

3. REST API (OpenAI-compatible)

Language-agnostic β€” use from any HTTP client (curl, requests, fetch):

POST /serving-endpoints/{endpoint}/invocations

curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "..."}]}' \
  https://<workspace>/serving-endpoints/my-agent/invocations
πŸ”— OpenAI-compatible format β€” easily swap between Databricks agents and OpenAI models.

4. AI Functions (ai_query)

Query agents directly from SQL β€” great for data pipelines and notebooks:

SELECT ai_query(
  'my-agent-endpoint',
  'What is the summary of this document?'
) AS response

-- With structured input
SELECT ai_query(
  'my-agent-endpoint',
  named_struct('messages', array(
    named_struct('role', 'user', 'content', prompt_col)
  ))
) FROM my_table
πŸ“Š Process entire tables with AI β€” apply LLM to every row in a DataFrame.

AI Gateway in the Architecture

AI Gateway sits at the center of your AI infrastructure, providing:

  • Unified Interface β€” Single endpoint for multiple LLM providers
  • Centralized Governance β€” Consistent policies across all AI traffic
  • Observability β€” Complete audit trail in Unity Catalog
  • Production Readiness β€” Rate limits, fallbacks, guardrails
Best Practice: Route ALL external LLM traffic through AI Gateway to unify governance, tracking, and cost management.
Requests
🌐
External Apps
πŸ€–
Agents
πŸ“Š
Notebooks
↓
Mosaic AI Gateway
πŸŒ‰
Governance
Unified Control
Rate Limits Guardrails Logging Tracking
↓
Model Serving Endpoints
🧠
Foundation
πŸ”—
External
βš™οΈ
Custom
πŸ—„οΈ
Unity Catalog
Inference Tables β€’ System Tables
⚑
Rate Limiting
QPM / TPM controls
πŸ“
Payload Logging
Inference tables
πŸ“Š
Usage Tracking
System tables
πŸ›‘οΈ
AI Guardrails
Safety + PII
πŸ”„
Fallbacks
Auto failover
βš–οΈ
Traffic Split
Load balance
Feature
External
FM (PT)
FM (PPT)
Agents
Custom
Rate Limiting
βœ“
βœ“
βœ“
β€”
βœ“
Payload Logging
βœ“
βœ“
βœ“
βœ“
βœ“
Usage Tracking
βœ“
βœ“
βœ“
β€”
βœ“
AI Guardrails
βœ“
βœ“
βœ“
β€”
β€”
Fallbacks
βœ“
β€”
β€”
β€”
β€”
Traffic Split
βœ“
βœ“
β€”
β€”
βœ“
Supported External Model Providers
πŸ€–
OpenAI
GPT-4, GPT-4o
🧠
Anthropic
Claude
πŸ’¬
Cohere
Command
☁️
Amazon Bedrock
Claude, Titan
🌐
Google Vertex
Gemini, PaLM
πŸ”·
Azure OpenAI
+ Entra ID
βš™οΈ
Custom Provider
Any OpenAI-compatible API
🌐 Endpoint Level
Global maximum for ALL traffic (e.g., 10,000 TPM)
πŸ‘€ User Default
Per-user limit for all users (e.g., 1,000 TPM)
⭐ Custom User/SP
Specific limits that override defaults (e.g., 5,000 TPM)
πŸ‘₯ User Groups
Shared limit across group members (e.g., 3,000 TPM total)
πŸ“¨
Request
β†’
πŸ›‘οΈ
Guardrails
β†’
πŸ€–
Model
⚠️
Safety Filter
Block harmful content
(Llama Guard 2)
πŸ”
PII Detection
Block or Mask
sensitive data
Violence Hate Speech Credit Cards SSN Emails
πŸ“¨
Incoming Requests
Traffic Split
60%
30%
10%
πŸ€–
GPT-4o
Primary
🧠
Claude
Secondary
πŸ”„
Backup
Fallback Only
πŸ”„ Fallback Flow: On 429/5XX β†’ Try next model in order
πŸ“ AI Gateway Inference Table Schema
databricks_request_id
STRING
Unique request ID
request
STRING
Raw JSON request
response
STRING
Raw JSON response
status_code
INT
HTTP status code
execution_duration_ms
BIGINT
Inference time
requester
STRING
User or SP ID
request_time
TIMESTAMP
When received
πŸ“Š endpoint_usage
β€’ input_token_count
β€’ output_token_count
β€’ request_time
β€’ status_code
β€’ usage_context
πŸ”§ served_entities
β€’ endpoint_name
β€’ entity_type
β€’ model config
β€’ task type
β€’ provider info
Cost Attribution with usage_context
{"project": "proj1", "end_user": "abc123"}
Query Deployed Agents via API
🐍
Databricks OpenAI Client
Native SDK integration, streaming support
Recommended
πŸ”„
MLflow Deployments Client
Existing MLflow workflows, experiment tracking
🌐
REST API
OpenAI-compatible, language-agnostic
πŸ“Š
AI Functions (ai_query)
Query agents directly from SQL
🐍 Databricks OpenAI Client Recommended
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# ResponsesAgent (recommended)
response = client.responses.create(
  model="my-agent",
  input=[{"role": "user", ...}],
  stream=True,
  extra_body={"databricks_options": ...}
)
πŸ”„ MLflow Deployments Client
from mlflow.deployments import get_deploy_client

# Connect to Databricks
client = get_deploy_client("databricks")

# Query the agent endpoint
response = client.predict(
  endpoint="my-agent-endpoint",
  inputs={
    "messages": [
      {"role": "user", "content": "..."}
    ]
  }
)
🌐 REST API (OpenAI-compatible)
# Using curl
curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "..."}]}' \
  https://<workspace>/serving-endpoints/my-agent/invocations

# Using Python requests
response = requests.post(
  url, headers=headers, json=payload
)
πŸ“Š AI Functions (ai_query) β€” SQL
-- Simple query
SELECT ai_query(
  'my-agent-endpoint',
  'Summarize this document...'
) AS response;

-- Process entire table with AI
SELECT
  id,
  ai_query('my-agent', content_col) AS ai_summary
FROM my_table;
Your Applications
🌐
External Apps
πŸ€–
Agents
πŸ“Š
Notebooks
↓
πŸŒ‰ Mosaic AI Gateway
⚑
Rate Limits
πŸ›‘οΈ
Guardrails
πŸ“
Logging
πŸ“Š
Tracking
↓
All AI Traffic
🧠
Foundation
πŸ”—
External LLMs
βš™οΈ
Custom/Agents

Quick Reference

πŸŒ‰ What is AI Gateway?

  • Centralized governance for AI model endpoints
  • Rate limiting, guardrails, logging, tracking
  • Unified interface for all LLM providers
  • All data logged to Unity Catalog Delta tables

πŸ”§ Supported Endpoints

  • External Models β€” Full feature support
  • Foundation Models β€” PT and pay-per-token
  • Custom Models β€” Most features
  • Deployed Agents β€” Payload logging

πŸ›‘οΈ Governance Features

  • Rate Limiting β€” QPM/TPM at multiple levels
  • AI Guardrails β€” Safety + PII detection
  • Traffic Splitting β€” Load balance across models
  • Fallbacks β€” Auto failover on errors

πŸ“Š Observability

  • Inference Tables β€” Request/response logging
  • System Tables β€” Usage and cost tracking
  • usage_context β€” Custom cost attribution
  • SQL queries, dashboards, alerts

πŸ”— External Providers

  • OpenAI, Anthropic, Cohere
  • Amazon Bedrock, Google Vertex AI
  • Azure OpenAI (with Entra ID)
  • Custom OpenAI-compatible providers

πŸ“‘ Query Methods

  • Databricks OpenAI Client β€” Recommended
  • MLflow Deployments β€” MLflow workflows
  • REST API β€” OpenAI-compatible
  • ai_query() β€” SQL-based