Centralized governance, monitoring, and production readiness for all your AI model serving endpoints. Control access, enforce policies, and gain observability across Databricks and external LLMs.
Run, secure, and govern AI traffic to democratize and accelerate AI adoption
Mosaic AI Gateway is Databricks' centralized service for governing and monitoring access to generative AI models and their serving endpoints.
All telemetry data flows into Delta tables in Unity Catalog, enabling SQL queries, notebooks, dashboards, and alerts using the full Databricks platform.
AI Gateway provides a comprehensive set of governance and monitoring features:
AI Gateway can be configured on various Model Serving endpoint types, with different feature availability:
External models are third-party LLMs hosted outside Databricks. AI Gateway provides a unified interface for managing multiple providers:
Control request volume with Queries Per Minute (QPM) or Tokens Per Minute (TPM) limits at multiple levels:
Enforce data compliance and block harmful content at the endpoint level:
Block (reject request), Mask (redact sensitive data), or None
Load balance and ensure high availability for external model endpoints:
Fallbacks automatically redirect on errors:
429 (rate limit) or 5XX errorsAutomatically log all requests and responses to Unity Catalog Delta tables:
request β Raw JSON request bodyresponse β Raw JSON responsestatus_code β HTTP statusexecution_duration_ms β Model inference timedatabricks_request_id β Unique request IDrequester β User or SP who made the callMonitor operational metrics and costs via system tables:
system.serving.endpoint_usage β Token counts, request times, status codessystem.serving.served_entities β Model metadata, configurationsusage_context parameter to track
end-user or project-specific usage for chargeback.
Join these tables with inference tables for complete observability. Create dashboards, set alerts, and optimize model performance.
Deployed agents (Agent Bricks) are accessible as Model Serving endpoints with multiple query methods:
responses.create() for new
ResponsesAgent, chat.completions.create() for legacy ChatAgent.
Recommended for new applications β native SDK integration with streaming support:
from databricks.sdk import WorkspaceClientw = WorkspaceClient()client = w.serving_endpoints.get_open_ai_client()response = client.responses.create( model="my-agent-endpoint", input=[{"role": "user", "content": "..."}], stream=True)
Pass custom_inputs and databricks_options via extra_body
for additional parameters like return_trace.
Best for existing MLflow workflows and experiment tracking integration:
from mlflow.deployments import get_deploy_clientclient = get_deploy_client("databricks")response = client.predict( endpoint="my-agent-endpoint", inputs={ "messages": [{"role": "user", "content": "..."}] })
Language-agnostic β use from any HTTP client (curl, requests, fetch):
POST /serving-endpoints/{endpoint}/invocationscurl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "..."}]}' \ https://<workspace>/serving-endpoints/my-agent/invocations
Query agents directly from SQL β great for data pipelines and notebooks:
SELECT ai_query( 'my-agent-endpoint', 'What is the summary of this document?') AS response-- With structured inputSELECT ai_query( 'my-agent-endpoint', named_struct('messages', array( named_struct('role', 'user', 'content', prompt_col) ))) FROM my_table
AI Gateway sits at the center of your AI infrastructure, providing:
{"project": "proj1", "end_user": "abc123"}