Purpose: Detailed traffic flow diagrams for Azure Databricks deployment patterns
This document provides comprehensive traffic flow diagrams showing how Databricks clusters communicate within different network architectures. Understanding these flows is essential for:
┌──────────────┐
│ User / API │
└──────┬───────┘
│
│ 1. Create Cluster (HTTPS)
│ POST /api/2.0/clusters/create
↓
┌─────────────────────────────────────────────────────────────────┐
│ Databricks Control Plane (Public - Azure Region) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cluster Manager │ │
│ │ - Validates request │ │
│ │ - Allocates cluster ID │ │
│ │ - Initiates provisioning │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────┬────────────────────────────────────────────────────────┘
│
│ 2. Provision VMs in customer VNet
│ (Azure Resource Manager API)
↓
┌─────────────────────────────────────────────────────────────────┐
│ Customer VNet (VNet Injection) │
│ │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ Driver Node VM │ │ Worker Node VMs │ │
│ │ (Public Subnet) │ │ (Private Subnet) │ │
│ │ - No Public IP (NPIP) │ │ - No Public IP (NPIP) │ │
│ └──────────┬───────────────┘ └────────┬─────────────────┘ │
│ │ │ │
│ │ 3. Establish secure tunnel │ │
│ │ to Control Plane │ │
│ │ (Outbound HTTPS) │ │
│ └────────────────────────────┘ │
│ │ │
│ │ Via NAT Gateway │
│ ↓ │
│ ┌─────────────────────┐ │
│ │ NAT Gateway │ │
│ │ IP: 203.0.113.45 │ │
│ └─────────┬────────────┘ │
└────────────────────────────┼────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
│ 4a. Heartbeat │ 4b. Download │ 4c. Access Storage
│ to Control │ User Libs │ DBFS/UC/External
│ Plane │ (PyPI/Maven) │ + DBR Images
│ (NSG: AzureDB) │ (NAT Gateway) │ (NSG: Storage)
↓ ↓ ↓
┌──────────────────┐ ┌──────────────┐ ┌────────────────────────┐
│ Databricks │ │ Internet │ │ Azure Storage │
│ Control Plane │ │ - PyPI │ │ (via Service Endpoint) │
│ (NSG Service Tag)│ │ - Maven │ │ (NSG: Storage tag) │
│ - Receives │ │ - Custom │ │ │
│ heartbeats │ │ repos │ │ ┌────────────────────┐ │
│ - Sends commands │ │ │ │ │ DBFS Root Storage │ │
│ - Monitors state │ │ NAT Gateway │ │ │ - Init scripts │ │
│ - NO NAT used! │ │ ONLY for │ │ │ - Cluster logs │ │
└──────────────────┘ │ user libs! │ │ │ - Libraries │ │
└──────────────┘ │ └────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ UC Metastore │ │
┌──────────────────────────────┤ │ - Table metadata │ │
│ 5. Worker-to-Worker │ │ - Schemas │ │
│ Communication │ └────────────────────┘ │
│ (Within VNet) │ │
↓ │ ┌────────────────────┐ │
┌──────────────────────────┐ │ │ External Location │ │
│ Inter-Worker Traffic │ │ │ - User data │ │
│ - Shuffle operations │ │ │ - Delta tables │ │
│ - Data redistribution │ │ └────────────────────┘ │
│ - RPC communication │ │ │
│ - Stays within VNet │ │ ┌────────────────────┐ │
│ - No egress charges │ │ │ DBR Images │ │
└──────────────────────────┘ │ │ (Databricks-managed│ │
│ │ dbartifactsprod*) │ │
│ └────────────────────┘ │
└────────────────────────┘
Time: T+0s to T+5min (typical cluster startup)
Legend:
────> : Data/Control plane traffic
═════> : Storage traffic (Service Endpoints)
- - -> : Monitoring/heartbeat traffic
Flow: User → Databricks Control Plane
Details:
Protocol: HTTPS (TCP/443)
Authentication: Bearer token / Azure AD token
Direction: User browser/CLI → control.databricks.azure.net
Payload: Cluster configuration JSON
- Node type: Standard_DS3_v2
- Worker count: 2-8 (autoscaling)
- Libraries: PyPI packages, Maven JARs
- Init scripts: Cloud storage paths
Response: Cluster ID and state (PENDING)
Latency: < 100ms
What Happens:
0123-456789-abcd)Flow: Control Plane → Azure Resource Manager → Customer VNet
Details:
API: Azure Resource Manager
Action: Create VM resources in customer subscription
Resources Created:
- Driver VM: 1x Standard_DS3_v2 (public subnet)
- Worker VMs: 2-8x Standard_DS3_v2 (private subnet)
- Managed Disks: OS disk + Data disks per VM
- NICs: No public IPs (NPIP enabled)
- NSG: Rules auto-applied by Databricks
Placement:
- Availability Set or Availability Zones (region-dependent)
- Same VNet as workspace configuration
- Subnet delegation: Microsoft.Databricks/workspaces
Encryption:
- Managed Disks: Azure Storage Service Encryption (default)
- CMK: If enabled, Disk Encryption Set applied
What Happens:
Flow: Driver/Worker VMs → Control Plane (via NSG Service Tag: AzureDatabricks)
Details:
Protocol: HTTPS (TCP/443)
Direction: Outbound only (initiated from VNet)
Source: Cluster VMs (no public IPs)
Routing: NSG Service Tag: AzureDatabricks (NOT NAT Gateway)
Destination: tunnel.{region}.azuredatabricks.net
Purpose:
- Control plane registration
- Command execution channel
- Monitoring and logging
Connection:
- Persistent WebSocket over HTTPS
- Heartbeat every 30 seconds
- Automatic reconnection on failure
Security:
- TLS 1.2+ encryption
- Mutual TLS authentication
- Databricks-signed certificates
- NSG allows outbound to AzureDatabricks service tag
What Happens:
Flow: Cluster VMs ↔ Control Plane (via NSG Service Tag: AzureDatabricks)
Details:
Routing: NSG Service Tag: AzureDatabricks (NOT NAT Gateway)
Protocol: HTTPS (TCP/443) over persistent tunnel
Heartbeats:
- Frequency: Every 30 seconds
- Payload: VM health, resource usage, state
- Timeout: 3 missed heartbeats = cluster unhealthy
Commands:
- Notebook execution
- Job runs
- Library installations
- Cluster resize operations
- Cluster termination
Metrics:
- CPU, memory, disk usage
- Spark metrics (tasks, stages, executors)
- Custom metrics from applications
Logs:
- Driver logs
- Executor logs
- Spark event logs
- Application logs
Traffic Volume: ~1-5 Mbps per cluster (low)
Important: Control plane communication does NOT go through NAT Gateway!
Flow: Cluster VMs → NAT Gateway → Internet
Important: This is the PRIMARY and ONLY use case for NAT Gateway!
Details:
Python Packages (PyPI):
- Source: pypi.org
- Protocol: HTTPS
- Examples: pandas, numpy, scikit-learn, tensorflow
- Size: Varies (10 MB - 1 GB)
- Installation: pip install -r requirements.txt
- Routing: NAT Gateway
Maven/Ivy (Java/Scala):
- Source: Maven Central (repo1.maven.org)
- Protocol: HTTPS
- Examples: spark-xml, delta-core, custom JARs
- Size: Varies (1 MB - 100 MB)
- Installation: spark.jars.packages
- Routing: NAT Gateway
Custom Repositories:
- Source: Customer-configured (e.g., Artifactory, Nexus)
- Protocol: HTTPS
- Authentication: If required
- Whitelisting: NAT Gateway IP (203.0.113.45)
- Routing: NAT Gateway
Databricks Runtime (DBR) Image:
- Source: Databricks-managed storage accounts (NOT Docker Hub!)
- Protocol: HTTPS
- Routing: NSG Service Tag "Storage" (Azure backbone)
- Size: ~2-5 GB per cluster
- Frequency: Once per cluster startup
- Caching: Cached on local disk
- Cost: $0 egress (uses Storage service tag)
- Reference: See Data Exfiltration blog for details
Important Notes:
- DBR images come from Microsoft-managed storage (dbartifactsprod*, dblogprod*)
- DBR download uses Storage service tag, NOT NAT Gateway
- NSG allows outbound to Storage service tag for DBR access
- See: https://learn.microsoft.com/en-us/azure/databricks/security/network/data-exfiltration-protection
Traffic Volume: 500 MB - 2 GB per cluster startup (user libraries only)
Cost Consideration: Data egress charges apply (first 100 GB free/month)
Critical Distinction:
Flow: Cluster VMs → Service Endpoints → Azure Storage (via NSG Service Tag: Storage)
Details:
Path:
Cluster VMs → Service Endpoint (NSG: Storage tag) → Azure Storage (Azure backbone)
- Never leaves Azure network
- Optimized routing via NSG service tags
- No public internet traversal
- No NAT Gateway involved
Protocol: HTTPS (TCP/443)
Authentication: Managed Identity (Access Connector)
- No storage account keys exposed
- OAuth 2.0 token-based
- RBAC: Storage Blob Data Contributor
NSG Configuration:
- Outbound rule allows "Storage" service tag
- Traffic routed via Azure backbone network
- Service Endpoints enabled on subnets
DBFS Root Storage (Databricks-managed):
- Init scripts execution: Read from dbfs:/init-scripts/
- Library caching: Write to dbfs:/tmp/
- Cluster logs: Write to dbfs:/cluster-logs/
- Automatic cleanup: Logs deleted after 30 days
- Routing: Service Endpoint (Storage tag)
Unity Catalog Metastore Storage:
- Table metadata queries: Read table definitions
- Schema information: Read database/catalog schemas
- Permissions validation: Check GRANT/REVOKE rules
- Cached locally: Metadata cached on driver
- Routing: Service Endpoint (Storage tag)
External Location Storage (Customer-owned):
- User data access: Read/Write Delta tables, Parquet
- ACID transactions: Delta Lake transaction log
- Time travel: Access historical versions
- Optimize operations: Compaction, Z-ordering
- Routing: Service Endpoint (Storage tag)
Traffic Volume: Varies (depends on workload, typically GBs-TBs)
Cost: No egress charges (Service Endpoints keep traffic on Azure backbone via NSG service tags)
Key Point: Storage access uses NSG “Storage” service tag + Service Endpoints, NOT NAT Gateway!
Flow: Worker VMs ↔ Worker VMs (Within VNet)
Details:
Purpose:
- Shuffle operations: Exchange data between partitions
- Broadcast variables: Distribute read-only data
- RPC communication: Spark executor coordination
- Task result collection: Gather results to driver
Protocol: TCP (custom Spark protocol)
Ports: Dynamic ports (ephemeral range)
NSG Rules: Automatically allowed (VirtualNetwork tag)
Latency: < 1ms (within same availability zone)
Bandwidth: Up to 25 Gbps (VM-dependent)
Security:
- Traffic stays within VNet
- No NAT Gateway traversal
- No public internet exposure
- Encryption in transit (Spark TLS if enabled)
Performance:
- Shuffle data is critical path
- Low latency = faster queries
- High bandwidth = better throughput
- Proximity = reduced network hops
Traffic Volume: Varies (depends on workload, can be TBs for large shuffles)
Cost: No egress charges (intra-VNet traffic is free)
| Traffic Type | Source | Destination | Path | Protocol | NSG Service Tag | Cost | Latency |
|---|---|---|---|---|---|---|---|
| Control Plane | Cluster VMs | Databricks Control Plane | AzureDatabricks tag | HTTPS/443 | AzureDatabricks | No egress | ~50ms |
| Package Downloads | Cluster VMs | PyPI/Maven/Docker Hub | NAT Gateway → Internet | HTTPS/443 | N/A | Egress | Varies |
| DBFS Access | Cluster VMs | DBFS Storage | Storage tag + Service Endpoint | HTTPS/443 | Storage | No egress | ~10ms |
| Unity Catalog | Cluster VMs | UC Storage | Storage tag + Service Endpoint | HTTPS/443 | Storage | No egress | ~10ms |
| External Data | Cluster VMs | External Location | Storage tag + Service Endpoint | HTTPS/443 | Storage | No egress | ~10ms |
| Event Hub (Logs) | Cluster VMs | Event Hub | EventHub tag | HTTPS/9093 | EventHub | No egress | ~10ms |
| Worker-to-Worker | Worker VMs | Worker VMs | Within VNet | Spark/TCP | VirtualNetwork | No egress | < 1ms |
| Workspace UI | User Browser | Databricks UI | Direct | HTTPS/443 | N/A | N/A | ~50ms |
Key Insight: NAT Gateway is ONLY for user-initiated internet downloads. All Azure service communication (Databricks, Storage, Event Hub) uses NSG service tags!
1. User opens notebook → Databricks UI (HTTPS)
2. User runs cell → Control Plane → Cluster (via tunnel)
3. Cluster executes code:
- Reads data from External Location (Service Endpoint)
- Performs computation (local + worker-to-worker)
- Writes results to External Location (Service Endpoint)
4. Results returned → Control Plane → UI
5. Logs written to DBFS (Service Endpoint)
1. Scheduler triggers job → Control Plane API (HTTPS)
2. Control Plane starts cluster (if not running)
3. Job notebook/JAR executed on cluster
4. Data flow:
- Source: External Location (read via Service Endpoint)
- Transform: In-memory + shuffle (worker-to-worker)
- Sink: External Location (write via Service Endpoint)
5. Job completion notification → Control Plane
6. Logs and metrics uploaded to DBFS
1. Data scientist runs training notebook
2. Cluster reads training data from External Location
3. Libraries downloaded (PyPI via NAT Gateway) - one-time
4. Training data loaded into memory/cache
5. Model training:
- Distributed: Worker-to-worker communication (high volume)
- Single-node: Local computation
6. Model artifacts written to External Location
7. MLflow tracking data → DBFS
| Connection | Typical Latency | Notes |
|---|---|---|
| User → Workspace UI | 50-100ms | Depends on user location |
| UI → Control Plane | 20-50ms | Within Azure region |
| Cluster → Control Plane | 30-80ms | Via NAT Gateway |
| Cluster → Storage (Service Endpoint) | 5-15ms | Same region, Azure backbone |
| Worker ↔ Worker (same AZ) | < 1ms | Within VNet |
| Worker ↔ Worker (cross AZ) | 1-3ms | Cross availability zone |
| Cluster → Internet (NAT) | 10-50ms | Destination-dependent |
| Connection | Typical Bandwidth | Notes |
|---|---|---|
| NAT Gateway | 45 Gbps | Per NAT Gateway |
| VM Network | 1-25 Gbps | VM size-dependent |
| Storage (per VM) | 500 MB/s - 8 GB/s | VM and disk type dependent |
| Worker-to-Worker | Up to 25 Gbps | VM network interface |
| Destination | Pricing (first 100 GB) | Pricing (next 10 TB) | Use Case |
|---|---|---|---|
| Internet | Free | $0.087/GB | PyPI, Maven, Docker Hub |
| Same Region Storage | Free | Free | Service Endpoints |
| Cross-Region Storage | Free | $0.02/GB | Not typical |
| Within VNet | Free | Free | Worker-to-worker |
Monthly Egress Estimate (Non-PL Pattern):
| Traffic Type | Volume/Month | Routing | Cost/Month |
|---|---|---|---|
| Control Plane (heartbeats, commands) | ~5 GB | NSG: AzureDatabricks tag | $0 (Azure backbone) |
| DBR image downloads | ~50 GB | NSG: Storage tag | $0 (Azure backbone) |
| User library downloads (PyPI/Maven) | ~10 GB | NAT Gateway → Internet | $0 (< 100 GB free) |
| Storage access (DBFS/UC/External) | 1000+ GB | NSG: Storage tag | $0 (Azure backbone) |
| Event Hub (logs) | ~10 GB | NSG: EventHub tag | $0 (Azure backbone) |
| Worker-to-worker | 1000+ GB | Within VNet | $0 (intra-VNet) |
| TOTAL INTERNET EGRESS | ~10 GB | $0 (< 100 GB free) |
Key Takeaway:
| Entry Point | Risk | Mitigation |
|---|---|---|
| Workspace UI | Public endpoint | IP Access Lists, Azure AD authentication, MFA |
| REST API | Public endpoint | Token authentication, Azure AD, IP restrictions |
| Cluster VMs | No public IPs (NPIP) | Not directly accessible from internet |
| Storage | Public endpoint | Service Endpoints, RBAC, Azure AD |
| NAT Gateway | Outbound only | Stateful firewall, no inbound connections |
| Flow | Encryption | Authentication | Authorization |
|---|---|---|---|
| User → UI | TLS 1.2+ | Azure AD / PAT | RBAC |
| Cluster → Control Plane | TLS 1.2+, Mutual TLS | Certificate-based | N/A |
| Cluster → Storage | TLS 1.2+ | Managed Identity | RBAC (Storage Blob Data Contributor) |
| Cluster → Internet | TLS 1.2+ | Varies | N/A |
| Worker ↔ Worker | Optional TLS | None (trusted VNet) | N/A |
Symptom: Cluster stuck in “Pending” or “Resizing”
Diagnosis:
# From a test VM in same VNet
curl -v https://tunnel.{region}.azuredatabricks.net
nslookup tunnel.{region}.azuredatabricks.net
traceroute tunnel.{region}.azuredatabricks.net
Common Causes:
Symptom: pip install fails, “Connection timeout”
Diagnosis:
# From Databricks notebook
%sh curl -v https://pypi.org/simple/
%sh ping -c 4 8.8.8.8
Common Causes:
Symptom: “403 Forbidden” or “Connection timeout” accessing ADLS
Diagnosis:
# From Databricks notebook
dbutils.fs.ls("abfss://...")
Common Causes:
Document Version: 1.0 Pattern Coverage: Non-PL (complete), Private Link (complete), Hub-Spoke (coming soon)