Understanding Databricks Network Architecture Across Clouds
Perfect for: Platform engineers, cloud architects, security teams
Databricks provides flexible networking options to meet your organization’s security, compliance, and operational requirements. This guide helps you understand how Databricks networking works and how to design networks that align with your cloud architecture.
Important: Databricks has two types of compute planes:
- Classic compute plane (customer VPC) - Covered in this guide
- Serverless compute plane (Databricks-managed) - Separate guide coming soon
This guide focuses on classic compute plane networking where you manage the VPC/VNet.
Note: This guide focuses on AWS classic compute first. Azure and GCP sections will be added in future updates.
Organizations choose customer-managed networking for several compelling reasons:
🔒 Security & Compliance
🏢 Enterprise Integration
💰 Operational Efficiency
🎯 Control & Governance
Databricks offers multiple deployment options to match your requirements:
| Deployment Model | Use Case | Setup Complexity |
|---|---|---|
| Databricks-Managed Network | Quick starts, dev/test, low compliance | Simple (minutes) |
| Customer-Managed Network | Production, compliance, enterprise | Moderate (hours) |
| Customer-Managed + PrivateLink | Highest security, air-gapped, zero-trust | Complex (days) |
Recommended Approach: For production workloads, customer-managed networking provides the control and flexibility enterprises need while maintaining deployment simplicity.
graph TD
A[Start: Deploying Databricks] --> B{Production Workload?}
B -->|No - POC/Dev| C[Databricks-Managed OK]
B -->|Yes| D{Compliance Requirements?}
D -->|Yes| E[Customer-Managed Required]
D -->|No| F{Need Integration?}
F -->|On-prem/VPN| E
F -->|Private Endpoints| E
F -->|Custom Security| E
F -->|Simple Setup| G{IP Space Concerns?}
G -->|Limited IPs| E
G -->|Plenty of IPs| H[Either Option Works]
E --> I{Zero-Trust Required?}
I -->|Yes| J[Add PrivateLink]
I -->|No| K[Customer-Managed VPC]
style E fill:#43A047
style J fill:#1E88E5
style K fill:#43A047
style C fill:#FDD835
Understanding these fundamentals will help you across all cloud providers.
Databricks architecture separates responsibilities between two planes:
Before diving into architecture, understand that Databricks offers two types of compute planes:
| Compute Type | Managed By | Network Location | This Guide Covers |
|---|---|---|---|
| Classic Compute | Customer | Your VPC/VNet | ✅ Yes (detailed) |
| Serverless Compute | Databricks | Databricks VPC | ❌ No (separate guide) |
Classic compute plane: Resources run in your cloud account, in your VPC/VNet. You control networking (subnets, security groups, routes). This guide covers classic compute networking.
Serverless compute plane: Resources run in Databricks-managed cloud account. Databricks manages networking. Connectivity to your resources uses Network Connectivity Configuration (NCC).
Note: This guide focuses exclusively on classic compute plane networking. Serverless networking will be covered in a separate guide.
graph TB
subgraph "Databricks Control Plane (Databricks-Managed)"
CP[Control Plane Services]
WEB[Web Application UI]
JOBS[Jobs Scheduler]
NOTEBOOK[Notebook Storage]
META[Hive Metastore Service]
CLUSTER_MGR[Cluster Manager]
end
subgraph "Customer Cloud Account"
subgraph "Classic Compute Plane (Your VPC)"
DRIVER[Driver Node]
WORKER1[Worker Node 1]
WORKER2[Worker Node N]
end
subgraph "Storage"
S3[Object Storage<br/>S3/ADLS/GCS]
DB[Database Services]
end
end
subgraph "Serverless Compute Plane (Databricks-Managed)"
SERVERLESS[Serverless Resources<br/>SQL Warehouses<br/>Jobs Compute<br/>Not Covered Here]
end
WEB --> CLUSTER_MGR
CLUSTER_MGR -.Secure Cluster<br/>Connectivity.-> DRIVER
JOBS --> DRIVER
JOBS -.-> SERVERLESS
NOTEBOOK -.Sync via HTTPS.-> DRIVER
DRIVER --> WORKER1
DRIVER --> WORKER2
WORKER1 --> S3
WORKER2 --> S3
DRIVER -.Legacy HMS<br/>(Optional).-> META
SERVERLESS -.Private Connection.-> S3
style CP fill:#1E88E5
style DRIVER fill:#43A047
style WORKER1 fill:#43A047
style WORKER2 fill:#43A047
style SERVERLESS fill:#FB8C00
What it does:
Where it runs: Databricks-managed cloud account (you never see or manage this)
Security: All data encrypted at rest, TLS 1.3 in transit
What it does:
Where it runs: Your cloud account, in your VPC/VNet
You control: Network configuration, subnet placement, security rules, routing
Note: This is different from serverless compute plane, which runs in Databricks-managed infrastructure with different networking patterns. See Serverless Compute Networking for serverless details.
Understanding how traffic flows helps you design secure networks:
sequenceDiagram
participant User
participant CP as Control Plane<br/>(Databricks)
participant Driver as Driver Node<br/>(Your VPC)
participant Worker as Worker Nodes<br/>(Your VPC)
participant S3 as Data Sources<br/>(S3/RDS/etc)
Note over User,S3: Cluster Launch
User->>CP: Create Cluster Request (HTTPS)
CP->>Driver: Launch EC2 Instances
Driver->>CP: Establish SCC Connection<br/>(Outbound TLS 1.3)
Note over User,S3: Job Execution
User->>CP: Submit Job
CP->>Driver: Send Commands via SCC
Driver->>Worker: Distribute Tasks<br/>(Internal VPC Traffic)
Worker->>S3: Read/Write Data<br/>(Your Network Path)
Worker->>Driver: Return Results
Driver->>CP: Report Status (via 8443-8451)
CP->>User: Display Results
Note over User,S3: All traffic encrypted in transit<br/>Unity Catalog uses ports 8443-8451
Key Traffic Flows:
Important: Compute plane initiates connections to control plane (outbound). No inbound connections from internet to compute plane are required.
Note: Unity Catalog (recommended) uses ports 8443-8451 for metadata operations. Legacy Hive metastore (port 3306) is optional and can be disabled. See Disable legacy Hive metastore.
| Aspect | Databricks-Managed | Customer-Managed | Customer-Managed + PrivateLink |
|---|---|---|---|
| VPC/VNet Ownership | Databricks creates | You provide | You provide |
| Subnet Control | Automatic (/16) | Full control (/17-/26) | Full control (/17-/26) |
| Security Groups | Managed by Databricks | You configure | You configure |
| NAT Gateway | Included | You provide | You provide |
| IP Address Efficiency | Lower (larger subnets) | Higher (right-sized) | Higher (right-sized) |
| VPC Sharing | No | Yes (multiple workspaces) | Yes (multiple workspaces) |
| PrivateLink Support | No | Optional | Yes |
| Integration w/ Existing | Limited | Full | Full |
| Setup Time | Minutes | Hours | Days |
| AWS Permissions Needed | More (VPC creation) | Fewer (reference only) | Fewer (reference only) |
| Recommended For | POCs, development | Production workloads | High security, compliance |
This section provides detailed guidance for deploying Databricks classic compute plane in a customer-managed AWS VPC.
Customer-managed VPC enables you to deploy Databricks classic compute plane resources in your own AWS VPC. This gives you:
Scope: This section covers classic compute (all-purpose clusters, job clusters). For serverless compute (SQL warehouses, serverless jobs), see Serverless Compute Networking.
Reference: AWS Databricks Customer-Managed VPC Documentation
graph TB
subgraph "Databricks Control Plane"
DCP[Control Plane Services<br/>Databricks AWS Account]
SCC_RELAY[Secure Cluster Connectivity Relay]
end
subgraph "Your AWS Account"
subgraph "Customer VPC"
IGW[Internet Gateway]
subgraph "Public Subnet (NAT)"
NAT[NAT Gateway]
end
subgraph "Private Subnet 1 (AZ-A)"
SG1[Security Group]
DRIVER1[Driver Nodes<br/>Private IPs Only]
WORKER1[Worker Nodes<br/>Private IPs Only]
end
subgraph "Private Subnet 2 (AZ-B)"
SG2[Security Group]
DRIVER2[Driver Nodes<br/>Private IPs Only]
WORKER2[Worker Nodes<br/>Private IPs Only]
end
VPCE_S3[VPC Endpoint: S3<br/>Gateway Type<br/>Optional]
end
S3_ROOT[S3: Root Bucket<br/>Workspace Storage]
S3_DATA[S3: Data Lake<br/>Your Data]
end
IGW --> NAT
NAT --> DRIVER1
NAT --> WORKER1
NAT --> DRIVER2
NAT --> WORKER2
DRIVER1 -.->|TLS 1.3<br/>Outbound Only| SCC_RELAY
DRIVER2 -.->|TLS 1.3<br/>Outbound Only| SCC_RELAY
SCC_RELAY -.->|Commands| DRIVER1
SCC_RELAY -.->|Commands| DRIVER2
SG1 -.All TCP/UDP.-> SG1
SG2 -.All TCP/UDP.-> SG2
SG1 -.All TCP/UDP.-> SG2
DRIVER1 --> VPCE_S3
WORKER1 --> VPCE_S3
VPCE_S3 --> S3_ROOT
VPCE_S3 --> S3_DATA
style DCP fill:#1E88E5
style DRIVER1 fill:#43A047
style WORKER1 fill:#43A047
style DRIVER2 fill:#43A047
style WORKER2 fill:#43A047
style NAT fill:#FF6F00
Customer-managed VPC is available in all AWS regions where Databricks operates. See Databricks AWS Regions for current list.
Key Principle: Plan for growth and multiple workspaces.
/17 (32,768 IPs) and /26 (64 IPs)Example Sizing:
| Workspace Size | Nodes Needed | IPs Required | Subnet Size | Usable Nodes |
|---|---|---|---|---|
| Small (dev/test) | 10-20 | 40 | /26 (64 IPs) |
29 max |
| Medium (production) | 50-100 | 200 | /24 (256 IPs) |
125 max |
| Large (enterprise) | 200-500 | 1000 | /22 (1024 IPs) |
509 max |
| X-Large (multi-workspace) | 1000+ | 2000+ | /21 or larger |
1021+ |
IP Calculation Formula:
Databricks uses 2 IPs per node (management + Spark application)
AWS reserves 5 IPs per subnet
Usable IPs = (2^(32-netmask)) - 5
Max Databricks Nodes = Usable IPs / 2
Example /26: (2^6 - 5) / 2 = 59 / 2 = 29 nodes
Example /24: (2^8 - 5) / 2 = 251 / 2 = 125 nodes
Recommendation: Plan for 30-50% growth beyond immediate needs.
/17 to /26⚠️ Important: All subnets for a Databricks workspace must come from the same VPC CIDR block, not from secondary CIDR blocks.
Required VPC settings:
# Terraform example
resource "aws_vpc" "databricks" {
cidr_block = "<cidr>"
# Both required for Databricks
enable_dns_hostnames = true # Must be enabled
enable_dns_resolution = true # Must be enabled
tags = {
Name = "databricks-vpc"
}
}
Why DNS matters:
Minimum configuration:
Multi-workspace patterns:
graph TB
subgraph "VPC: 10.0.0.0/16"
subgraph "Workspace 1"
WS1_A["Subnet A (AZ-1)<br/>10.0.1.0/24"]
WS1_B["Subnet B (AZ-2)<br/>10.0.2.0/24"]
end
subgraph "Workspace 2"
WS2_A["Subnet A (AZ-1)<br/>10.0.1.0/24<br/>(Shared!)"]
WS2_C["Subnet C (AZ-3)<br/>10.0.3.0/24"]
end
subgraph "Workspace 3"
WS3_D["Subnet D (AZ-1)<br/>10.0.4.0/24"]
WS3_E["Subnet E (AZ-2)<br/>10.0.5.0/24"]
end
end
style WS1_A fill:#43A047
style WS1_B fill:#43A047
style WS2_A fill:#FDD835
style WS2_C fill:#43A047
style WS3_D fill:#7CB342
style WS3_E fill:#7CB342
Note: Workspace 1 and 2 share Subnet A. This is supported but requires careful capacity planning.
Minimum: /17 (32,768 IPs → 16,381 usable nodes)
Maximum: /26 (64 IPs → 29 usable nodes)
Common subnet sizes:
| Netmask | Total IPs | Usable IPs | Max Databricks Nodes | Use Case |
|---|---|---|---|---|
/26 |
64 | 59 | 29 | Small dev/test |
/25 |
128 | 123 | 61 | Medium dev |
/24 |
256 | 251 | 125 | Production |
/23 |
512 | 507 | 253 | Large production |
/22 |
1024 | 1019 | 509 | Multi-workspace |
/21 |
2048 | 2043 | 1021 | Enterprise scale |
Required routing:
# Route table for Databricks subnets
resource "aws_route_table" "databricks_private" {
vpc_id = aws_vpc.databricks.id
# Critical: Quad-zero route to NAT Gateway
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.databricks.id
}
tags = {
Name = "databricks-private-rt"
}
}
⚠️ Critical: The 0.0.0.0/0 route to NAT Gateway is required. Databricks needs outbound internet access to reach the control plane.
Route table for NAT Gateway subnet:
# Public subnet route table (for NAT Gateway)
resource "aws_route_table" "nat_gateway" {
vpc_id = aws_vpc.databricks.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.databricks.id
}
tags = {
Name = "databricks-nat-rt"
}
}
Security groups control traffic to and from Databricks cluster instances.
Egress (Outbound Rules):
| Type | Protocol | Port Range | Destination | Purpose |
|---|---|---|---|---|
| All traffic | TCP | All | Same security group | Cluster internal communication |
| All traffic | UDP | All | Same security group | Cluster internal communication |
| HTTPS | TCP | 443 |
0.0.0.0/0 |
Control plane, AWS services, repos |
| Custom TCP | TCP | 6666 |
0.0.0.0/0 |
Secure cluster connectivity (PrivateLink) |
| Custom TCP | TCP | 2443 |
0.0.0.0/0 |
FIPS-compliant encryption |
| Custom TCP | TCP | 8443 |
0.0.0.0/0 |
Control plane API |
| Custom TCP | TCP | 8444 |
0.0.0.0/0 |
Unity Catalog lineage/logging |
| Custom TCP | TCP | 8445-8451 |
0.0.0.0/0 |
Future use (extendability) |
| DNS | TCP | 53 |
0.0.0.0/0 |
DNS resolution (if custom DNS) |
| MySQL | TCP | 3306 |
0.0.0.0/0 |
Legacy Hive metastore (optional, not needed with Unity Catalog) |
Ingress (Inbound Rules):
| Type | Protocol | Port Range | Source | Purpose |
|---|---|---|---|---|
| All traffic | TCP | All | Same security group | Cluster internal communication |
| All traffic | UDP | All | Same security group | Cluster internal communication |
Note: Port 3306 (MySQL) was historically required for legacy Hive metastore access. With Unity Catalog (the recommended modern approach), this port is no longer required. Unity Catalog uses the control plane APIs (ports 8443-8451) for metadata management. See Disable legacy Hive metastore for migration guidance.
Why
0.0.0.0/0in security groups? This allows Databricks to reach its control plane and AWS services. Use firewall or proxy appliances for fine-grained egress filtering, not security groups.
NAT Gateway provides outbound internet access for private subnets.
/28)For production workloads, deploy NAT Gateways in multiple Availability Zones for redundancy. If a single NAT Gateway fails, clusters in other AZs would lose internet connectivity.
Subnet-level Network ACLs (NACLs) provide an additional security layer beyond security groups.
Key difference: NACLs are stateless, security groups are stateful.
Databricks cluster nodes have no public IP addresses. All connectivity is outbound-initiated:
Since NACLs are stateless, they can’t track these connections. When a cluster node makes an outbound HTTPS request to the control plane, the response comes back as inbound traffic on an ephemeral port. The NACL needs to allow this return traffic.
The correct security model:
0.0.0.0/0 inbound) - required due to stateless natureSecurity Note: Allowing
0.0.0.0/0in NACLs is NOT a security risk because:
- Cluster nodes have no public IPs (cannot be reached from internet)
- It’s required for return traffic due to stateless NACL behavior
- Real security comes from Security Groups (stateful) and no public IPs
- This is specifically for intra-cluster traffic and return traffic from outbound connections
AWS Default NACLs allow all inbound and outbound traffic, which works perfectly for Databricks.
If you customize NACLs, you must follow these requirements:
Inbound Rules:
| Rule # | Type | Protocol | Port Range | Source | Action |
|---|---|---|---|---|---|
| 100 | All traffic | All | All | 0.0.0.0/0 |
ALLOW |
⚠️ Critical: This rule must be prioritized first.
Why this is required:
Security is enforced by:
Outbound Rules:
| Rule # | Type | Protocol | Port Range | Destination | Action |
|---|---|---|---|---|---|
| 100 | All traffic | All | All | VPC CIDR | ALLOW |
| 110 | HTTPS | TCP | 443 | 0.0.0.0/0 |
ALLOW |
| 120 | Custom TCP | TCP | 6666 | 0.0.0.0/0 |
ALLOW |
| 130 | Custom TCP | TCP | 2443 | 0.0.0.0/0 |
ALLOW |
| 140 | Custom TCP | TCP | 8443-8451 | 0.0.0.0/0 |
ALLOW |
| 150 | DNS | TCP/UDP | 53 | 0.0.0.0/0 |
ALLOW |
| 160 | Ephemeral | TCP | 1024-65535 | 0.0.0.0/0 |
ALLOW |
| * | All traffic | All | All | 0.0.0.0/0 |
DENY |
Important: NACLs are stateless and must allow
0.0.0.0/0inbound for return traffic from outbound connections. Since cluster nodes have no public IPs, this is secure. Use Security Groups (stateful) as your primary security control, and use egress firewall or proxy appliances for fine-grained outbound filtering. See Configure a firewall and outbound access.
Why not use NACLs for security filtering? NACLs are stateless and operate at the subnet level. They require complex rules to allow return traffic and can break legitimate connections. Use Security Groups (stateful, instance-level) for access control instead.
Recommendation: Start with default NACLs (allow all). Only customize if organizational security policy requires it. The combination of no public IPs + security groups + Secure Cluster Connectivity provides strong security without NACL complexity.
After creating VPC, subnets, and security groups, register the configuration with Databricks.
<workspace-prefix>-networkvpc-xxxxx# Create network configuration
curl -X POST https://accounts.cloud.databricks.com/api/2.0/accounts/<account-id>/networks \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"network_name": "<workspace-prefix>-network",
"vpc_id": "<vpc-id>",
"subnet_ids": ["<subnet-id-az1>", "<subnet-id-az2>"],
"security_group_ids": ["<security-group-id>"]
}'
Response includes network_id - use this when creating workspace.
Implementation Note: For complete working examples, see the Terraform modules in
awsdb4u/folder of this repository.
VPC endpoints provide private connectivity to AWS services without traversing NAT Gateway.
Benefits:
Setup:
Endpoint Policy Example (optional - restrict S3 access):
Allow access only to specific buckets:
databricks-prod-artifacts-<region>See VPC Endpoints documentation for implementation details.
When using VPC endpoints, configure Spark to use regional endpoints:
Option 1: In notebook (per session):
# Python
spark.conf.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
spark.conf.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
// Scala
spark.conf.set("fs.s3a.endpoint", "https://s3.<region>.amazonaws.com")
spark.conf.set("fs.s3a.stsAssumeRole.stsEndpoint", "https://sts.<region>.amazonaws.com")
Option 2: Cluster configuration (all jobs on cluster):
spark.hadoop.fs.s3a.endpoint https://s3.<region>.amazonaws.com
spark.hadoop.fs.s3a.stsAssumeRole.stsEndpoint https://sts.<region>.amazonaws.com
Option 3: Cluster policy (enforce across all clusters):
{
"spark_conf.fs.s3a.endpoint": {
"type": "fixed",
"value": "https://s3.<region>.amazonaws.com"
},
"spark_conf.fs.s3a.stsAssumeRole.stsEndpoint": {
"type": "fixed",
"value": "https://sts.<region>.amazonaws.com"
}
}
⚠️ Important: Regional endpoint configuration blocks cross-region S3 access. Only apply if all S3 buckets are in the same region.
Restrict S3 bucket access to specific sources for enhanced security.
Your bucket policy must allow access from:
Required buckets to allow:
databricks-prod-artifacts-<region>Where to find IPs: See Databricks Control Plane IPs for your region.
Scenario: Restrict access to control plane, compute plane VPC endpoint, and corporate VPN.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowDatabricksAndTrustedAccess",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
],
"Condition": {
"NotIpAddressIfExists": {
"aws:SourceIp": [
"<control-plane-nat-ip>",
"<corporate-vpn-ip>"
]
},
"StringNotEqualsIfExists": {
"aws:sourceVpce": "<vpc-endpoint-id>",
"aws:SourceVpc": "<vpc-id>"
}
}
}
]
}
How it works:
Deny with NotIpAddressIfExists - only listed sources can accessControl outbound traffic using firewall or proxy appliances.
Option 1: AWS Network Firewall
Option 2: Third-party firewall appliances
Option 3: Proxy appliances
Databricks clusters must reach these destinations. See Databricks Firewall Documentation for complete list:
Databricks Control Plane:
*.cloud.databricks.com (port 443, 8443-8451)AWS Services:
s3.<region>.amazonaws.com (port 443)sts.<region>.amazonaws.com (port 443)ec2.<region>.amazonaws.com (port 443)kinesis.<region>.amazonaws.com (port 443)Package Repositories (if downloading libraries):
pypi.org, files.pythonhosted.org (Python/PyPI)repo1.maven.org, maven.apache.org (Java/Maven)cran.r-project.org (R/CRAN)Note: Legacy Hive metastore endpoints (port 3306) are no longer required when using Unity Catalog. See Disable legacy Hive metastore for migration guidance.
Alternative: Host internal mirrors of package repositories to avoid internet access for library downloads.
AWS PrivateLink provides private connectivity to Databricks control plane without internet traversal.
graph TB
subgraph "Your AWS Account"
subgraph "VPC"
CLUSTER[Databricks Clusters<br/>Private IPs]
VPCE_CONTROL[VPC Endpoint<br/>Control Plane]
VPCE_SCC[VPC Endpoint<br/>Secure Cluster Connectivity]
end
end
subgraph "Databricks AWS Account"
subgraph "Control Plane Services"
CONTROL_LB[Control Plane<br/>Endpoint Service]
SCC_LB[SCC Relay<br/>Endpoint Service]
end
end
CLUSTER -.Private Connection.-> VPCE_CONTROL
CLUSTER -.Private Connection.-> VPCE_SCC
VPCE_CONTROL -.AWS PrivateLink.-> CONTROL_LB
VPCE_SCC -.AWS PrivateLink.-> SCC_LB
style CLUSTER fill:#43A047
style VPCE_CONTROL fill:#1E88E5
style VPCE_SCC fill:#1E88E5
style CONTROL_LB fill:#FF6F00
style SCC_LB fill:#FF6F00
Benefits:
Requirements:
When to use:
Note: PrivateLink setup is complex. See AWS PrivateLink Documentation for detailed implementation guide.
Use this worksheet to plan your network:
| Parameter | Your Value | Notes |
|---|---|---|
| Expected max nodes | _____ | Peak usage estimate |
| Growth factor | 1.5x | Recommend 50% buffer |
| Total nodes needed | _____ | Max nodes × growth factor |
| IPs required | _____ | Total nodes × 2 |
| Recommended netmask | _____ | Use table below |
| IPs Needed | Recommended Netmask | Usable IPs | Max Nodes |
|---|---|---|---|
| < 60 | /26 |
59 | 29 |
| 60-120 | /25 |
123 | 61 |
| 121-250 | /24 |
251 | 125 |
| 251-500 | /23 |
507 | 253 |
| 501-1000 | /22 |
1019 | 509 |
| 1001-2000 | /21 |
2043 | 1021 |
| 2001+ | /20 or larger |
4091+ | 2045+ |
VPC CIDR: ___________________ (e.g., 10.100.0.0/16)
Databricks Subnet 1 (AZ-A): ___________________ (e.g., 10.100.1.0/24)
Databricks Subnet 2 (AZ-B): ___________________ (e.g., 10.100.2.0/24)
NAT Gateway Subnet 1: ___________________ (e.g., 10.100.0.0/28)
NAT Gateway Subnet 2: ___________________ (e.g., 10.100.0.16/28)
Number of workspaces in VPC: _____
Shared or unique subnets: _____
Before registering VPC with Databricks, verify:
VPC Configuration:
Subnet Configuration:
/17 and /26Security Groups:
0.0.0.0/00.0.0.0/00.0.0.0/00.0.0.0/00.0.0.0/0 (Unity Catalog)0.0.0.0/0 (optional - only if using legacy Hive metastore)NAT Gateway:
0.0.0.0/0 to NATNetwork ACLs:
0.0.0.0/0 (priority rule 100)Optional Components:
Databricks Registration:
This section provides detailed guidance for deploying Databricks classic compute plane in a customer-managed Azure VNet.
VNet injection (Azure’s term for customer-managed networking) enables you to deploy Databricks classic compute plane resources in your own Azure Virtual Network (VNet). This gives you:
Scope: This section covers classic compute (all-purpose clusters, job clusters). For serverless compute (SQL warehouses, serverless jobs), networking is managed differently by Databricks.
graph TB
subgraph "Databricks Control Plane"
DCP[Control Plane Services<br/>Databricks Azure Account]
SCC_RELAY[Secure Cluster Connectivity]
end
subgraph "Customer Azure Subscription"
subgraph "Customer VNet"
subgraph "Public Subnet (for infrastructure)"
NSG_PUB[NSG - Public]
INFRA[Databricks Infrastructure<br/>Load Balancers, etc.]
end
subgraph "Private Subnet (for clusters)"
NSG_PRIV[NSG - Private]
DRIVER[Driver Nodes<br/>Private IPs Only]
WORKER[Worker Nodes<br/>Private IPs Only]
end
NAT[NAT Gateway<br/>or Azure Firewall]
end
STORAGE[Storage Account<br/>DBFS Root]
DATA[Data Lake<br/>ADLS Gen2]
end
NAT --> DRIVER
NAT --> WORKER
DRIVER -.TLS<br/>Outbound Only.-> SCC_RELAY
WORKER -.TLS<br/>Outbound Only.-> SCC_RELAY
SCC_RELAY -.Commands.-> DRIVER
NSG_PRIV -.All TCP/UDP.-> NSG_PRIV
DRIVER --> STORAGE
WORKER --> DATA
INFRA --> NSG_PUB
style DCP fill:#1E88E5
style DRIVER fill:#43A047
style WORKER fill:#43A047
style NAT fill:#FF6F00
VNet injection is available in all Azure regions where Databricks operates. See Databricks Azure Regions for current list.
Key Principle: Plan for growth and multiple workspaces.
Azure Databricks requires two subnets:
Subnet sizing:
| Workspace Size | Nodes Needed | Recommended Subnet Size | Usable IPs | Max Nodes |
|---|---|---|---|---|
| Small (dev/test) | 10-20 | /26 |
59 | 59 nodes |
| Medium (production) | 50-100 | /24 |
251 | 251 nodes |
| Large (enterprise) | 200-500 | /22 |
1019 | 1019 nodes |
IP Calculation:
Azure reserves 5 IPs per subnet
Databricks uses 1 IP per node (unlike AWS which uses 2)
Usable IPs = (2^(32-netmask)) - 5
Max Databricks Nodes = Usable IPs - 5 (Azure reserved)
Example /26: 64 - 5 = 59 usable IPs → 59 nodes
Example /24: 256 - 5 = 251 usable IPs → 251 nodes
Note: Azure uses 1 IP per node (simpler than AWS). However, you still need TWO subnets (public and private).
Minimum configuration:
Subnet delegation:
Microsoft.Databricks/workspacesAddress space:
/26 (59 usable IPs)/24 or largerMulti-workspace patterns:
graph TB
subgraph "VNet: 10.0.0.0/16"
subgraph "Workspace 1"
WS1_PUB["Public Subnet<br/>10.0.1.0/26"]
WS1_PRIV["Private Subnet<br/>10.0.2.0/24"]
end
subgraph "Workspace 2"
WS2_PUB["Public Subnet<br/>10.0.3.0/26"]
WS2_PRIV["Private Subnet<br/>10.0.4.0/24"]
end
subgraph "Workspace 3"
WS3_PUB["Public Subnet<br/>10.0.5.0/26"]
WS3_PRIV["Private Subnet<br/>10.0.6.0/24"]
end
end
style WS1_PUB fill:#FDD835
style WS1_PRIV fill:#43A047
style WS2_PUB fill:#FDD835
style WS2_PRIV fill:#43A047
style WS3_PUB fill:#7CB342
style WS3_PRIV fill:#7CB342
Recommendation: Use unique subnets per workspace for isolation and easier troubleshooting.
NSGs control traffic to and from Databricks cluster instances. Azure requires NSG rules on both public and private subnets.
Public Subnet NSG:
Private Subnet NSG:
Inbound Rules:
| Priority | Name | Source | Source Ports | Destination | Dest Ports | Protocol | Action |
|---|---|---|---|---|---|---|---|
| 100 | AllowVnetInBound | VirtualNetwork | * | VirtualNetwork | * | Any | Allow |
| 65000 | AllowAzureLoadBalancerInBound | AzureLoadBalancer | * | * | * | Any | Allow |
Outbound Rules:
| Priority | Name | Source | Source Ports | Destination | Dest Ports | Protocol | Action |
|---|---|---|---|---|---|---|---|
| 100 | AllowVnetOutBound | VirtualNetwork | * | VirtualNetwork | * | Any | Allow |
| 110 | AllowControlPlaneOutBound | VirtualNetwork | * | AzureDatabricks | 443 | TCP | Allow |
| 120 | AllowStorageOutBound | VirtualNetwork | * | Storage | 443 | TCP | Allow |
| 130 | AllowSqlOutBound | VirtualNetwork | * | Sql | 3306 | TCP | Allow |
| 140 | AllowEventHubOutBound | VirtualNetwork | * | EventHub | 9093 | TCP | Allow |
Note: Azure uses Service Tags (
AzureDatabricks,Storage,Sql,EventHub) which automatically resolve to the correct IP ranges for your region. This is simpler than AWS security groups which require0.0.0.0/0.
Important: Port 3306 (Sql service tag) is for legacy Hive metastore. With Unity Catalog (recommended), this outbound rule is optional.
Why
VirtualNetworkservice tag? This allows all traffic within the VNet, which includes intra-cluster communication between driver and worker nodes. It’s secure because it’s limited to your VNet only.
Azure Databricks clusters need outbound internet connectivity to reach the control plane.
Benefits:
Setup:
When to use:
Setup:
0.0.0.0/0 to Azure FirewallIf using Azure Firewall for egress filtering, allow these destinations:
Databricks Control Plane:
*.databricks.azure.net (port 443)AzureDatabricksAzure Services:
StorageSql, port 3306EventHub, port 9093Package Repositories (if downloading libraries):
pypi.org, files.pythonhosted.org (Python/PyPI)repo1.maven.org (Java/Maven)cran.r-project.org (R/CRAN)Note: Service tags automatically include all required IPs for Azure services in your region. Much simpler than maintaining IP allow lists!
/26)Microsoft.Databricks/workspacesImplementation Note: See the
adb4u/folder in this repository for production-ready Terraform templates that implement these patterns.
Azure Private Link provides private connectivity to Databricks control plane without internet traversal.
graph TB
subgraph "Customer VNet"
CLUSTER[Databricks Clusters<br/>Private IPs]
PE_UI[Private Endpoint<br/>Workspace UI]
PE_BACKEND[Private Endpoint<br/>Backend Services]
end
subgraph "Databricks Azure Subscription"
PLS_UI[Private Link Service<br/>Workspace Frontend]
PLS_BACKEND[Private Link Service<br/>Backend Services]
end
CLUSTER -.Private Connection.-> PE_UI
CLUSTER -.Private Connection.-> PE_BACKEND
PE_UI -.Azure Backbone.-> PLS_UI
PE_BACKEND -.Azure Backbone.-> PLS_BACKEND
style CLUSTER fill:#43A047
style PE_UI fill:#1E88E5
style PE_BACKEND fill:#1E88E5
style PLS_UI fill:#FF6F00
style PLS_BACKEND fill:#FF6F00
Benefits:
Requirements:
When to use:
Note: Private Link setup is complex. See Azure Databricks Private Link Documentation for detailed implementation.
Option 1: Service Endpoints (Recommended for simplicity)
Option 2: Private Endpoints (Maximum security)
Option 3: Storage Firewall
| Component | Address Space | Notes |
|---|---|---|
| VNet CIDR | <cidr> |
Must accommodate all subnets |
| Workspace 1 - Public | <cidr> |
Min /26, typically /26 is sufficient |
| Workspace 1 - Private | <cidr> |
Size based on node count |
| Workspace 2 - Public | <cidr> |
If sharing VNet |
| Workspace 2 - Private | <cidr> |
If sharing VNet |
Capacity Calculation:
Usable IPs = (2^(32-netmask)) - 5 (Azure reserved)
Max Databricks Nodes = Usable IPs
Example /26: 64 - 5 = 59 nodes
Example /24: 256 - 5 = 251 nodes
Example /22: 1024 - 5 = 1019 nodes
VNet Configuration:
Subnet Configuration:
Microsoft.Databricks/workspacesNetwork Security Groups:
VirtualNetwork inboundAzureLoadBalancer inboundVirtualNetwork outboundAzureDatabricks outbound (443)Storage outbound (443)EventHub outbound (9093)Sql outbound (3306) - optional if using Unity CatalogOutbound Connectivity:
Optional Components:
Databricks Workspace:
This section provides detailed guidance for deploying Databricks classic compute plane in a customer-managed GCP VPC.
Customer-managed VPC enables you to deploy Databricks classic compute plane resources in your own Google Cloud VPC. This gives you:
Scope: This section covers classic compute (all-purpose clusters, job clusters). For serverless compute (SQL warehouses, serverless jobs), networking is managed differently by Databricks.
Reference: GCP Databricks Customer-Managed VPC Documentation
graph TB
subgraph "Databricks Control Plane"
DCP[Control Plane Services<br/>Databricks GCP Project]
SCC_RELAY[Secure Cluster Connectivity]
end
subgraph "Customer GCP Project"
subgraph "Customer VPC"
subgraph "Databricks Subnet"
DRIVER[Driver Nodes<br/>Private IPs Only]
WORKER[Worker Nodes<br/>Private IPs Only]
end
CLOUD_NAT[Cloud NAT<br/>Egress Appliance]
CLOUD_ROUTER[Cloud Router]
end
GCS_WS[GCS Bucket<br/>Workspace Storage]
GCS_DATA[GCS Buckets<br/>Data Lake]
GAR[Artifact Registry<br/>Runtime Images]
end
CLOUD_ROUTER --> CLOUD_NAT
CLOUD_NAT --> DRIVER
CLOUD_NAT --> WORKER
DRIVER -.TLS<br/>Outbound Only.-> SCC_RELAY
WORKER -.TLS<br/>Outbound Only.-> SCC_RELAY
SCC_RELAY -.Commands.-> DRIVER
DRIVER -.Private Google Access.-> GAR
WORKER -.Private Google Access.-> GCS_DATA
DRIVER --> GCS_WS
style DCP fill:#1E88E5
style DRIVER fill:#43A047
style WORKER fill:#43A047
style CLOUD_NAT fill:#FF6F00
Customer-managed VPC is available in all GCP regions where Databricks operates. See Databricks GCP Regions for current list.
Key Principle: GCP requires only 1 subnet per workspace (simpler than AWS or Azure).
Subnet sizing:
| Workspace Size | Nodes Needed | Recommended Subnet Size | Usable IPs | Max Nodes |
|---|---|---|---|---|
| Small (dev/test) | 10-20 | /28 |
12 | 12 nodes |
| Medium (production) | 50-100 | /25 |
123 | 123 nodes |
| Large (enterprise) | 200-500 | /23 |
507 | 507 nodes |
IP Calculation:
GCP reserves 4 IPs per subnet (network, gateway, broadcast, future)
Databricks uses 1 IP per node
Usable IPs = (2^(32-netmask)) - 4
Max Databricks Nodes = Usable IPs
Example /28: 16 - 4 = 12 usable IPs → 12 nodes
Example /26: 64 - 4 = 60 usable IPs → 60 nodes
Example /24: 256 - 4 = 252 usable IPs → 252 nodes
Note: GCP uses 1 IP per node (same as Azure). Only 1 subnet needed per workspace (simpler than AWS/Azure which need 2).
Minimum configuration:
/29 (smallest) to /9 (largest)/26 or /25Multi-workspace patterns:
graph TB
subgraph "VPC: 10.0.0.0/16"
subgraph "Workspace 1"
WS1["Subnet A<br/>10.0.1.0/26"]
end
subgraph "Workspace 2"
WS2["Subnet B<br/>10.0.2.0/26"]
end
subgraph "Workspace 3 (Shared)"
WS3["Subnet A<br/>10.0.1.0/26<br/>(Shared with WS1)"]
end
end
style WS1 fill:#43A047
style WS2 fill:#7CB342
style WS3 fill:#FDD835
Note: Workspaces can share a subnet (unlike Azure where each workspace needs unique subnets). Plan capacity if sharing.
Option 1: Standalone VPC (same project for VPC and workspace)
Option 2: Shared VPC (cross-project networking)
Terminology: Google calls this a “Shared VPC” or “Cross Project Network (XPN)”. Don’t confuse with whether multiple workspaces share a VPC - both standalone and Shared VPCs can host multiple workspaces.
GCP uses VPC-level firewall rules (unlike AWS Security Groups which are instance-level).
GCP default VPC includes:
For Databricks, default egress rules are sufficient.
If you customize firewall rules, ensure these are allowed:
Egress (Outbound):
| Priority | Direction | Action | Source | Destination | Protocol | Ports | Purpose |
|---|---|---|---|---|---|---|---|
| 1000 | Egress | Allow | Subnet CIDR | 0.0.0.0/0 |
TCP | 443 | Databricks control plane, GCS, GAR |
| 1000 | Egress | Allow | Subnet CIDR | 0.0.0.0/0 |
TCP | 3306 | Legacy Hive metastore (optional) |
| 1000 | Egress | Allow | Subnet CIDR | 0.0.0.0/0 |
TCP | 53 | DNS resolution |
| 1000 | Egress | Allow | Subnet CIDR | Subnet CIDR | All | All | Intra-cluster communication |
Ingress (Inbound):
| Priority | Direction | Action | Source | Destination | Protocol | Ports | Purpose |
|---|---|---|---|---|---|---|---|
| 1000 | Ingress | Allow | Subnet CIDR | Subnet CIDR | All | All | Intra-cluster communication |
Note: Port 3306 (Hive metastore) is optional with Unity Catalog (recommended). Unity Catalog uses port 443 for metadata operations.
Why allow from subnet to itself? This enables communication between driver and worker nodes within the cluster.
You can use network tags to apply firewall rules to specific instances:
GCP Databricks clusters need outbound internet connectivity to reach the control plane.
Benefits:
Setup:
Components:
VPC → Subnet → Cloud Router → Cloud NAT → Internet
What it is:
Benefits:
How it works:
Cluster Nodes (Private IP) → Private Google Access → GCS/GAR/BigQuery
(No internet traversal)
Important: Private Google Access is enabled per subnet. Ensure it’s enabled for Databricks subnets.
If using egress filtering (firewall appliance), allow these destinations:
Databricks Control Plane:
*.gcp.databricks.com (port 443)Google Services:
storage.googleapis.com (GCS)*.pkg.dev (Artifact Registry)bigquery.googleapis.com (if using BigQuery)Package Repositories (if downloading libraries):
pypi.org, files.pythonhosted.org (Python/PyPI)repo1.maven.org (Java/Maven)cran.r-project.org (R/CRAN)Alternative: Use Private Google Access for Google services. Only need NAT for control plane and external repos.
Implementation Note: See the
gcpdb4u/folder in this repository for production-ready Terraform templates that implement these patterns.
Private Service Connect (PSC) provides private connectivity to Databricks control plane without internet.
graph TB
subgraph "Customer VPC"
CLUSTER[Databricks Clusters<br/>Private IPs]
PSC_ENDPOINT[PSC Endpoint<br/>Private IP]
end
subgraph "Databricks GCP Project"
PSC_SERVICE[PSC Service Attachment<br/>Control Plane]
end
CLUSTER -.Private Connection.-> PSC_ENDPOINT
PSC_ENDPOINT -.Google Backbone.-> PSC_SERVICE
style CLUSTER fill:#43A047
style PSC_ENDPOINT fill:#1E88E5
style PSC_SERVICE fill:#FF6F00
Benefits:
Requirements:
When to use:
VPC Service Controls provides an additional security perimeter around Google Cloud resources.
Purpose: Prevent data exfiltration by creating security perimeters around Google Cloud services.
How it works:
Use cases:
Complexity: VPC-SC is more complex to set up and manage than basic networking. See VPC-SC documentation for detailed guidance.
Note: VPC-SC is a Google Cloud security feature that works with Databricks but requires careful planning and configuration.
| Component | Address Space | Notes |
|---|---|---|
| VPC CIDR | <cidr> |
Plan for multiple workspaces |
| Workspace 1 - Subnet | <cidr> |
Min /29, recommend /26 or /25 |
| Workspace 2 - Subnet | <cidr> |
Can share subnet with WS1 |
| Workspace 3 - Subnet | <cidr> |
Or use unique subnet |
Capacity Calculation:
Usable IPs = (2^(32-netmask)) - 4 (GCP reserved)
Max Databricks Nodes = Usable IPs
Example /28: 16 - 4 = 12 nodes
Example /26: 64 - 4 = 60 nodes
Example /24: 256 - 4 = 252 nodes
VPC Configuration:
Subnet Configuration:
Cloud NAT:
Firewall Rules:
IAM Permissions:
Databricks Registration:
| Feature | AWS | Azure | GCP |
|---|---|---|---|
| Virtual Network | VPC | VNet | VPC |
| Customer-Managed Term | Customer-Managed VPC | VNet Injection | Customer-Managed VPC |
| Subnet Requirements | 2+ (different AZs) | 2 (public + private, delegated) | 1 |
| Network Security | Security Groups | NSGs (Network Security Groups) | Firewall Rules |
| Security Rules Model | Stateful | Stateful | Stateful |
| Service Tags/Labels | No (use 0.0.0.0/0) |
Yes (AzureDatabricks, Storage, etc.) | Partial |
| Private Connectivity | PrivateLink | Private Link | VPC-SC/PSC |
| NAT Solution | NAT Gateway | NAT Gateway or Azure Firewall | Cloud NAT |
| Outbound Required | Yes (0.0.0.0/0) |
Yes (service tags) | Yes |
| IPs per Node | 2 (management + Spark) | 1 | 1 |
| Subnet Delegation | No | Yes (private subnet) | No |
| Aspect | AWS | Azure | GCP |
|---|---|---|---|
| Minimum Subnets | 2 | 2 (different purposes) | 1 |
| Security Rule Complexity | Medium (requires 0.0.0.0/0) | Lower (uses service tags) | Medium |
| NAT Setup | NAT Gateway + IGW | NAT Gateway or Azure Firewall | Cloud NAT |
| DNS Configuration | Must enable on VPC | Automatic | Automatic |
| Private Link Setup | Complex | Medium | VPC-SC (Complex) |
| Multi-Workspace Sharing | Yes | Yes | Yes |
AWS:
0.0.0.0/0 in security groups (no service tags)Azure:
AzureDatabricks, Storage, Sql)Microsoft.Databricks/workspacesGCP:
Have questions about Databricks networking? Check out our comprehensive Common Questions & Answers Guide which covers:
The Q&A guide is organized by topic and cloud provider for easy navigation.
✅ Two compute types: Classic (your VPC) and Serverless (Databricks-managed) ✅ This guide covers classic compute plane networking only ✅ Control plane (Databricks-managed) and classic compute plane are separate ✅ Classic compute plane initiates outbound connections to control plane - no inbound required ✅ Customer-managed networking is recommended for production classic compute ✅ Databricks offers flexible deployment options to match your requirements ✅ Networking choice is permanent - set during workspace creation
✅ Minimum: 2 subnets in different Availability Zones
✅ IP allocation: 2 IPs per Databricks node (management + Spark)
✅ Subnet sizing: Between /17 (large) and /26 (small)
✅ Security groups: Allow all TCP/UDP within same SG
✅ Outbound access: 0.0.0.0/0 required in security groups (filter at firewall)
✅ NAT Gateway: Required for internet access (or PrivateLink for Databricks-only)
✅ DNS: Both DNS hostnames and DNS resolution must be enabled
✅ VPC Endpoints: S3 Gateway Endpoint recommended (free, better performance)
✅ Minimum: 2 subnets (host + container) delegated to Databricks
✅ IP allocation: Host subnet (/26 min), Container subnet (/23 to /26)
✅ Subnet delegation: Both subnets must be delegated to Microsoft.Databricks/workspaces
✅ NSG rules: Inbound/outbound to AzureDatabricks Service Tag + internal communication
✅ Outbound access: Internet or Azure NAT Gateway required (or Private Link)
✅ Service Tags: Use AzureDatabricks Service Tag to simplify NSG rules
✅ Private Link: Front-end (UI/REST API) and back-end (compute) connections
✅ Storage access: Service Endpoints or Private Endpoints for Azure storage
✅ Minimum: 1 subnet with 2 secondary IP ranges (pods + services)
✅ IP allocation: Primary range for nodes, secondary for pods/services
✅ Subnet sizing: /23 for primary, /17 pods, /21 services (minimum)
✅ Firewall rules: Allow internal communication (all TCP/UDP within subnet)
✅ Outbound access: Cloud NAT required for internet access
✅ Private Google Access: Enable for GCS and Google APIs access
✅ Private Service Connect: Optional for private connectivity to control plane
✅ VPC-SC: Optional perimeter for data exfiltration protection
✅ Capacity formula: (Usable IPs / 2) = Max Databricks nodes ✅ Growth buffer: Add 30-50% extra IP capacity ✅ Multi-workspace: Can share VPC, but plan capacity accordingly ✅ One SG per workspace: Recommended for isolation ✅ Document CIDRs: Avoid conflicts with existing networks
✅ Encryption: All traffic encrypted with TLS 1.3 ✅ Private connectivity: Use AWS PrivateLink for highest security ✅ S3 bucket policies: Restrict access to specific VPCs/IPs ✅ Egress filtering: Use firewall/proxy for fine-grained control ✅ VPC Flow Logs: Enable for traffic monitoring ✅ Required ports: 443, 3306, 53, 6666, 2443, 8443-8451
❌ Don’t: Forget to enable DNS hostnames and DNS resolution
❌ Don’t: Use subnets outside /17 to /26 range
❌ Don’t: Block 0.0.0.0/0 in security groups (filter at firewall instead)
❌ Don’t: Block 0.0.0.0/0 in NACLs inbound rules (required by Databricks)
❌ Don’t: Use NACLs for egress filtering (use firewall/proxy instead)
❌ Don’t: Reuse same subnet across multiple Availability Zones
❌ Don’t: Skip high availability for NAT Gateway in production
❌ Don’t: Assume you can migrate from Databricks-managed to customer-managed later
❌ Don’t: Under-provision IP capacity (always add growth buffer)
❌ Don’t: Mix subnets from primary and secondary CIDR blocks
| Scenario | Recommended Approach |
|---|---|
| Quick POC or demo | Databricks-managed networking |
| Production workloads | Customer-managed VPC |
| Compliance requirements | Customer-managed VPC |
| Need AWS PrivateLink | Customer-managed VPC (required) |
| Tight IP address space | Customer-managed VPC (smaller subnets) |
| On-premises integration | Customer-managed VPC |
| Multiple workspaces | Customer-managed VPC (share VPC) |
| Air-gapped environment | Customer-managed VPC + PrivateLink |
| Simple dev/test | Databricks-managed networking |
| Port | Protocol | Purpose | Required |
|---|---|---|---|
443 |
TCP | HTTPS - Control plane, AWS services, repos | ✅ Yes |
8443 |
TCP | Control plane API | ✅ Yes |
8444 |
TCP | Unity Catalog logging/lineage | ✅ Yes (recommended) |
8445-8451 |
TCP | Future extendability | ✅ Yes |
53 |
TCP | DNS resolution | ✅ Yes (if custom DNS) |
6666 |
TCP | Secure Cluster Connectivity (PrivateLink) | ✅ Yes (if PrivateLink) |
2443 |
TCP | FIPS-compliant encryption | ✅ Yes (if FIPS) |
3306 |
TCP | Legacy Hive metastore | ⚠️ Optional (not needed with Unity Catalog) |
| All | TCP/UDP | Within same security group | ✅ Yes |
Modern Approach: Unity Catalog (ports 8443-8451) is the recommended metadata management solution. Legacy Hive metastore (port 3306) is optional and can be disabled.
This section provides cloud-specific troubleshooting guidance:
Symptom: Cluster stuck in “Pending” or fails to start
Common causes:
Error: "Subnet does not have available IP addresses"
Check:
aws ec2 describe-subnets --subnet-ids <subnet-id> \
--query 'Subnets[0].AvailableIpAddressCount'
Fix: Use larger subnet or add more subnets
Error: "Security group rules do not allow required traffic"
Check: Verify egress rules include 443, 3306, 6666, 8443-8451 to 0.0.0.0/0
Fix: Update security group rules per requirements
Error: "Cannot reach Databricks control plane"
Check:
aws ec2 describe-nat-gateways --nat-gateway-ids <nat-gw-id>
# Verify state = "available"
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=<subnet-id>"
# Verify 0.0.0.0/0 route points to NAT Gateway
Fix: Ensure NAT Gateway is running and route table correct
Error: "DNS resolution failed"
Check:
aws ec2 describe-vpc-attribute --vpc-id <vpc-id> --attribute enableDnsHostnames
aws ec2 describe-vpc-attribute --vpc-id <vpc-id> --attribute enableDnsSupport
Fix: Enable both DNS settings on VPC
Error: "Connection timeout" or "Network unreachable"
Check:
aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=<subnet-id>"
# Look for inbound rule allowing 0.0.0.0/0
Fix:
0.0.0.0/0 (required for return traffic)Symptom: Cluster starts but can’t reach services
Common causes:
S3 access denied
Check: IAM instance profile permissions
aws iam get-role --role-name <instance-profile-role>
aws iam list-attached-role-policies --role-name <instance-profile-role>
Fix: Attach policy with S3 read/write permissions
Bucket policy blocking access
Check: S3 bucket policy allows VPC endpoint or NAT IP
aws s3api get-bucket-policy --bucket <bucket-name>
Fix: Update bucket policy to allow aws:sourceVpce or aws:SourceIp
VPC Endpoint misconfiguration
Check: Endpoint attached to route tables
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids <endpoint-id>
Fix: Attach endpoint to correct route tables
Regional endpoint issues
Symptom: Cross-region S3 access fails
Fix: Remove regional endpoint Spark configuration or ensure all S3 buckets in same region
Symptom: Intermittent connectivity or specific ports blocked
Network ACLs are stateless and can cause connectivity issues if misconfigured. Understanding why Databricks requires permissive NACLs is key.
Why NACLs must be permissive:
Databricks cluster nodes have no public IPs. All connections are outbound-initiated (to control plane, S3, etc.). However, NACLs are stateless - they don’t track connections. When a cluster makes an outbound HTTPS request:
0.0.0.0/0This is fundamentally different from Security Groups (stateful), which automatically allow return traffic.
Check: NACL rules allow required traffic
aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=<subnet-id>"
Common NACL issues:
0.0.0.0/0
Fix:
Option 1 (Strongly Recommended): Use default NACL
# Associate subnet with default NACL
aws ec2 replace-network-acl-association \
--association-id <assoc-id> \
--network-acl-id <default-nacl-id>
Why default NACL is best:
Option 2: Fix custom NACL rules (only if required by policy)
0.0.0.0/0 (for return traffic - not a security risk)0.0.0.0/00.0.0.0/0Security Model: The proper security model is:
- NACLs: Permissive (stateless, allow return traffic)
- Security Groups: Primary control (stateful, connection-aware)
- No Public IPs: Nodes cannot be reached from internet
- Egress Firewall: Fine-grained outbound filtering at proper layer
This provides strong security without fighting against stateless NACL behavior.
Test connectivity from within VPC:
Launch EC2 instance in same subnet with same security group, then:
# Test control plane connectivity
curl -I https://accounts.cloud.databricks.com
# Test S3 regional endpoint
curl -I https://s3.<region>.amazonaws.com
# Test DNS resolution
nslookup accounts.cloud.databricks.com
nslookup s3.<region>.amazonaws.com
# Test routing
traceroute 8.8.8.8 # Should go through NAT Gateway
# Check instance metadata (validates IAM role)
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
Analyze VPC Flow Logs:
# Enable VPC Flow Logs
aws ec2 create-flow-logs \
--resource-type Subnet \
--resource-ids <subnet-id> \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name /aws/vpc/flowlogs
# Query logs (requires CloudWatch Logs Insights)
# Look for REJECT entries indicating blocked traffic
Symptom: Cluster stuck in “Pending” or fails to start
Common causes:
Error: "No available IP addresses in subnet"
Check:
az network vnet subnet show \
--resource-group <rg-name> \
--vnet-name <vnet-name> \
--name <subnet-name> \
--query "addressPrefix"
Fix:
/26/23 and /26Error: "Network security group denies required traffic"
Check: Verify NSG rules include AzureDatabricks Service Tag
az network nsg rule list \
--resource-group <rg-name> \
--nsg-name <nsg-name> \
--output table
Fix: Add required NSG rules:
AzureDatabricks Service TagAzureDatabricks Service TagError: "Subnet must be delegated to Microsoft.Databricks/workspaces"
Check:
az network vnet subnet show \
--resource-group <rg-name> \
--vnet-name <vnet-name> \
--name <subnet-name> \
--query "delegations"
Fix: Delegate both host and container subnets
az network vnet subnet update \
--resource-group <rg-name> \
--vnet-name <vnet-name> \
--name <subnet-name> \
--delegations Microsoft.Databricks/workspaces
Error: "Cannot reach Azure Databricks control plane"
Check:
az network nat gateway show \
--resource-group <rg-name> \
--name <nat-gateway-name>
Fix:
Symptom: Cluster starts but can’t reach storage or services
Common causes:
Storage account access denied
Check: Verify storage firewall and network rules
az storage account show \
--resource-group <rg-name> \
--name <storage-account-name> \
--query "networkRuleSet"
Fix:
Private Endpoint DNS resolution failing
Symptom: Storage FQDN resolves to public IP instead of private IP
Check:
nslookup <storage-account-name>.blob.core.windows.net
# Should resolve to 10.x.x.x (private IP)
Fix:
privatelink.blob.core.windows.netService Tag not updated
Symptom: Connection fails after Databricks infrastructure update
Fix: Service Tags are automatically updated by Azure, but NSG rules may need refresh
# Download current Service Tag IP ranges
az rest --method get \
--url "https://www.microsoft.com/download/confirmation.aspx?id=56519"
Symptom: Intermittent connectivity or specific ports blocked
Check: NSG flow logs
# Enable NSG flow logs
az network watcher flow-log create \
--resource-group <rg-name> \
--nsg <nsg-name> \
--name <flow-log-name> \
--location <region> \
--storage-account <storage-account-id>
# View NSG effective rules
az network nic list-effective-nsg \
--resource-group <rg-name> \
--name <nic-name>
Common NSG issues:
AzureDatabricksTest connectivity from within VNet:
Launch VM in same subnet, then:
# Test control plane connectivity
curl -I https://<workspace-url>.azuredatabricks.net
# Test Azure storage
curl -I https://<storage-account>.blob.core.windows.net
# Test DNS resolution
nslookup <workspace-url>.azuredatabricks.net
nslookup <storage-account>.blob.core.windows.net
# Check effective routes
az network nic show-effective-route-table \
--resource-group <rg-name> \
--name <nic-name> \
--output table
Analyze NSG Flow Logs:
# Query flow logs (Azure Monitor)
# Look for "Deny" action in NSG flow log events
# Filter by source/destination IP and port
Symptom: Cluster stuck in “Pending” or fails to start
Common causes:
Error: "Insufficient IP addresses in subnet secondary range"
Check:
gcloud compute networks subnets describe <subnet-name> \
--region=<region> \
--format="value(secondaryIpRanges)"
Fix:
/23/17/21Error: "Firewall rules deny required traffic"
Check: List firewall rules
gcloud compute firewall-rules list \
--filter="network:<vpc-name>" \
--format="table(name,direction,allowed,sourceRanges)"
Fix: Add required firewall rules
# Allow internal communication
gcloud compute firewall-rules create databricks-internal \
--network=<vpc-name> \
--direction=INGRESS \
--action=ALLOW \
--rules=tcp,udp,icmp \
--source-ranges=<subnet-cidr>,<pods-cidr>,<services-cidr> \
--target-tags=databricks
Error: "Cannot reach external services"
Check:
gcloud compute routers nats list \
--router=<router-name> \
--region=<region>
Fix: Create Cloud NAT
gcloud compute routers nats create databricks-nat \
--router=<router-name> \
--region=<region> \
--nat-all-subnet-ip-ranges \
--auto-allocate-nat-external-ips
Error: "Cannot reach GCS or Google APIs"
Check:
gcloud compute networks subnets describe <subnet-name> \
--region=<region> \
--format="value(privateIpGoogleAccess)"
Fix: Enable Private Google Access
gcloud compute networks subnets update <subnet-name> \
--region=<region> \
--enable-private-ip-google-access
Symptom: Cluster starts but can’t reach GCS or services
Common causes:
GCS bucket IAM permissions missing
Check: Verify service account has GCS access
gcloud storage buckets get-iam-policy gs://<bucket-name>
Fix: Grant service account Storage Object Admin role
gcloud storage buckets add-iam-policy-binding gs://<bucket-name> \
--member="serviceAccount:<sa-email>" \
--role="roles/storage.objectAdmin"
VPC-SC perimeter blocking access
Symptom: Requests to GCS or other Google services are denied
Check:
gcloud access-context-manager perimeters list \
--policy=<policy-id>
Fix:
Private Service Connect endpoint misconfigured
Symptom: Cannot reach Databricks control plane via PSC
Check:
gcloud compute forwarding-rules list \
--filter="target:serviceAttachments"
Fix: Verify PSC endpoint configuration and DNS
Symptom: Specific ports or protocols blocked
Check: Firewall logs
# Enable firewall logging
gcloud compute firewall-rules update <rule-name> \
--enable-logging
# View logs
gcloud logging read "resource.type=gce_subnetwork AND logName=projects/<project-id>/logs/compute.googleapis.com%2Ffirewall" \
--limit 50 \
--format json
Common firewall issues:
databricks tag automaticallyTest connectivity from within VPC:
Launch Compute Engine VM in same subnet, then:
# Test control plane connectivity
curl -I https://<workspace-id>.gcp.databricks.com
# Test GCS
curl -I https://storage.googleapis.com
# Test DNS resolution
nslookup <workspace-id>.gcp.databricks.com
nslookup storage.googleapis.com
# Check effective routes
gcloud compute instances describe <instance-name> \
--zone=<zone> \
--format="value(networkInterfaces[0].networkIP)"
Analyze VPC Flow Logs:
# Enable VPC Flow Logs
gcloud compute networks subnets update <subnet-name> \
--region=<region> \
--enable-flow-logs
# Query logs (Cloud Logging)
gcloud logging read "resource.type=gce_subnetwork AND logName=projects/<project-id>/logs/compute.googleapis.com%2Fflow" \
--limit 50 \
--format json
Plan for growth
Use Infrastructure as Code
Implement high availability
Separate environments
Use private connectivity
Implement defense in depth
Enable logging and monitoring
Follow least privilege
Document everything
Implement monitoring
Test before production
Automate operations
Right-size subnets
/24 or /23 for most workspaces/17 only for very large deploymentsOptimize data transfer
NAT Gateway costs
Share resources
Monitor and optimize
Classic Compute Networking (This Guide):
Serverless Compute Networking (Separate Guide):
General:
Official Providers:
Example Modules:
awsdb4u/ folder of this repositorydatabricks or databricks-awsFound something confusing or have suggestions for improvement? We’d love to hear from you!