The Problem: You’re confused about how Databricks actually gets into your AWS/Azure/GCP account to create workspaces and access storage. Terms like “cross-account role,” “first-party app,” and “service account” are flying around and you just want to understand what’s happening.
This Guide: Explains in simple terms how Databricks identities work across clouds, why each cloud is different, and how your data stays secure.
Quick Navigation:
Think of Databricks as a construction company that needs to build a house (workspace) on your property (cloud account).
The Question: How do you give them access without handing over the keys to your entire property?
The Answer: Each cloud has a different way of handling this:
Let’s break down each one…
flowchart TB
subgraph "Your Decision"
Q{Which cloud are you using?}
end
Q -->|AWS| AWS[AWS: Cross-Account Role<br/>Databricks assumes temporary credentials]
Q -->|Azure| AZ[Azure: First-Party App<br/>Databricks has built-in access]
Q -->|GCP| GCP[GCP: Service Account<br/>Databricks creates identity in your account]
AWS --> AWSDetails[You create IAM role<br/>Databricks assumes it<br/>You control via trust policy]
AZ --> AZDetails[Azure trusts Databricks automatically<br/>You control via RBAC<br/>No setup needed]
GCP --> GCPDetails[Databricks creates GSA<br/>You grant permissions<br/>You control via IAM]
style AWS fill:#FF9900,color:#000
style AZ fill:#0078D4,color:#fff
style GCP fill:#4285F4,color:#fff
Imagine you hire a contractor. Instead of giving them a permanent key, you:
sequenceDiagram
participant You as You<br/>(Your AWS Account)
participant DB as Databricks<br/>(Account 414351767826)
participant IAM as AWS IAM
participant Resources as AWS Resources<br/>(EC2, VPC, S3)
Note over You,DB: Setup Phase
You->>IAM: 1. Create Cross-Account Role
You->>IAM: 2. Set trust policy:<br/>"Only Databricks can use this"
You->>IAM: 3. Attach permissions:<br/>"Can launch EC2, access S3"
Note over You,DB: Workspace Creation
DB->>IAM: 4. AssumeRole (with external ID)
IAM->>IAM: Verify: Is this really Databricks?
IAM->>DB: ✅ Here are temporary credentials<br/>(valid for 1 hour)
DB->>Resources: 5. Create workspace using temp creds
DB->>Resources: Launch EC2 instances
DB->>Resources: Configure VPC/subnets
DB->>Resources: Access S3 buckets
Note over DB,Resources: Credentials expire automatically
You create multiple roles for different purposes:
flowchart TD
subgraph "Databricks Control Plane<br/>Account: 414351767826"
DB[Databricks SaaS]
end
subgraph "Your AWS Account"
subgraph "Workspace Management"
CROSS[Cross-Account Role<br/>Manages workspace infrastructure]
STORAGE[Storage Config Role<br/>Accesses DBFS root bucket]
end
subgraph "Unity Catalog Management"
UCMETA[UC Metastore Role<br/>Shared across workspaces]
UCEXT[UC External Role<br/>Per-workspace storage]
end
subgraph "Cluster Access"
INSTANCE[Instance Profile<br/>Data access for clusters]
end
end
DB -.->|AssumeRole| CROSS
DB -.->|AssumeRole| STORAGE
DB -.->|AssumeRole| UCMETA
DB -.->|AssumeRole| UCEXT
CROSS -->|Launch/Configure| EC2[EC2 Instances]
STORAGE -->|Read/Write| S3DBFS[DBFS Root S3 Bucket]
UCMETA -->|Read/Write| S3META[Metastore S3 Bucket]
UCEXT -->|Read/Write| S3EXT[External S3 Buckets]
INSTANCE -->|Attached to| EC2
style DB fill:#FF3621,color:#fff
style CROSS fill:#FF9900,color:#000
style STORAGE fill:#FF9900,color:#000
style UCMETA fill:#1B72E8,color:#fff
style UCEXT fill:#1B72E8,color:#fff
style INSTANCE fill:#34A853,color:#fff
| Role | Purpose | Who Uses It | When Created |
|---|---|---|---|
| Cross-Account Role | Creates workspace, launches clusters, manages VPC | Databricks control plane | Before workspace |
| Storage Config Role | Accesses DBFS root bucket (workspace files) | Databricks control plane | Before workspace |
| UC Metastore Role | Accesses Unity Catalog shared storage | Databricks control plane | Before metastore |
| UC External Role | Accesses workspace-specific catalog storage | Databricks control plane | During UC setup |
| Instance Profile | Gives clusters access to your data | Cluster VMs | Before clusters run |
Every role has a trust policy that says “who can use this role”:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::414351767826:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "your-databricks-account-id"
}
}
}]
}
What this means in English:
414351767826) use this role”Why it’s secure:
Unity Catalog uses a special pattern with an external_id for extra security:
sequenceDiagram
participant You as Your AWS Account
participant TF as Terraform
participant DB as Databricks
participant S3 as S3 Bucket
Note over You,S3: Setup Flow
You->>TF: 1. Create storage credential<br/>(with placeholder role)
TF->>DB: 2. Register credential
DB->>DB: 3. Generate unique external_id
DB->>TF: 4. Return external_id
TF->>You: 5. Create IAM role with trust policy<br/>(including external_id)
TF->>You: 6. Attach S3 permissions
TF->>DB: 7. Update credential with real role ARN
Note over You,S3: Access Flow
DB->>You: 8. AssumeRole (with external_id)
You->>You: Verify external_id matches
You->>DB: ✅ Temporary credentials
DB->>S3: 9. Access S3 data
Why the external_id?
Azure Databricks is built into Azure - it’s not a separate company accessing your account. Think of it like this:
Imagine Azure is an apartment building:
sequenceDiagram
participant You as You<br/>(Azure Subscription)
participant SP as Service Principal<br/>(Your automation)
participant ARM as Azure Resource Manager
participant DB as Azure Databricks<br/>(First-party service)
participant Resources as Azure Resources<br/>(VNet, Storage)
Note over You,DB: You create workspace (not Databricks)
You->>SP: 1. Give SP permissions<br/>(Contributor role)
SP->>ARM: 2. Create Databricks workspace
ARM->>ARM: 3. Provision managed resource group
ARM->>DB: 4. Deploy Databricks control plane
DB->>Resources: 5. Create managed resources<br/>(automatically trusted)
Note over DB,Resources: Databricks accesses resources<br/>using Azure's built-in trust
DB->>Resources: Configure VNet injection
DB->>Resources: Access storage accounts
DB->>Resources: Create private endpoints
| Aspect | AWS | Azure |
|---|---|---|
| Who creates workspace | Databricks (via API) | You (via Azure Portal/Terraform) |
| Identity setup | You create cross-account role | Nothing - it’s built-in |
| Trust mechanism | AssumeRole with external ID | First-party app (trusted by Azure) |
| Resource management | Databricks manages in its account | Resources in YOUR subscription |
| Managed resource group | Doesn’t exist | Auto-created by Azure |
flowchart TD
subgraph "Your Subscription"
subgraph "Resources You Control"
RG[Resource Group]
WS[Databricks Workspace]
VNET[VNet/Subnets]
STORAGE[Storage Accounts]
end
subgraph "Managed Resource Group<br/>(Azure creates, Databricks uses)"
NSG[Network Security Groups]
LB[Load Balancers]
VMS[Worker VMs]
DISKS[Managed Disks]
end
subgraph "Service Principal<br/>(Your automation identity)"
SP[Service Principal<br/>Used by Terraform]
end
end
SP -->|Contributor role| RG
RG -->|Contains| WS
RG -->|Contains| VNET
RG -->|Contains| STORAGE
WS -->|Auto-creates| NSG
WS -->|Auto-creates| LB
WS -->|Launches| VMS
WS -->|Creates| DISKS
style SP fill:#0078D4,color:#fff
style WS fill:#FF3621,color:#fff
Azure uses Managed Identities instead of service principals for UC:
sequenceDiagram
participant You as Your Subscription
participant WS as Databricks Workspace
participant MI as Managed Identity<br/>(Auto-created)
participant Storage as Storage Account
Note over You,Storage: Setup
You->>WS: 1. Create Unity Catalog<br/>storage credential
WS->>MI: 2. Create managed identity<br/>(access connector)
WS->>You: 3. Return MI principal ID
You->>Storage: 4. Grant MI permissions<br/>(Storage Blob Data Contributor)
Note over You,Storage: Access
WS->>MI: 5. Request token
MI->>Storage: 6. Access blob storage<br/>(using token)
Why Managed Identity?
The Access Connector is a special Azure resource that holds the managed identity:
Access Connector = Managed Identity Wrapper
It lets Databricks:
You create it once, reuse it across:
GCP has a unique model where Databricks creates a service account IN YOUR PROJECT:
Imagine hiring a new employee:
sequenceDiagram
participant You as Your GCP Project
participant DB as Databricks<br/>(accounts.gcp.databricks.com)
participant GSA as Service Account<br/>(Created by Databricks)
participant Resources as GCP Resources<br/>(GCS, Compute)
Note over You,DB: Workspace Creation
You->>DB: 1. Request workspace creation
DB->>You: 2. Create service account<br/>databricks-compute@project.iam
DB->>You: 3. This GSA represents Databricks
You->>GSA: 4. Grant permissions<br/>(Compute Admin, Storage Admin)
Note over DB,Resources: Databricks Uses GSA
DB->>GSA: 5. Impersonate GSA
GSA->>Resources: 6. Create Compute Engine VMs
GSA->>Resources: 7. Access GCS buckets
GSA->>Resources: 8. Configure VPC
GCP has MULTIPLE service accounts working together - some in YOUR project, some in Databricks’ project:
flowchart TD
subgraph DBXProj[Databricks Control Plane Project]
WS_SA[Workspace SA<br/>Manages workspace infrastructure]
DELEGATE[Delegate SA<br/>Launches GCE clusters]
UC_SA[Unity Catalog Storage SA<br/>UC data access]
end
subgraph YourProj[Your GCP Project]
subgraph CustomerGSA[Customer-Created GSAs]
YOUR_SA[Terraform GSA<br/>You create this for Terraform]
end
subgraph DBXCreatedGSA[Databricks-Created GSAs]
COMPUTE[Compute SA<br/>Attached to cluster VMs]
end
subgraph Resources
VPC[VPC Networks]
VM[GCE Instances]
GCS[GCS Buckets]
end
end
YOUR_SA -->|Impersonated by Databricks provider| WS_SA
WS_SA -->|Validates/Creates| COMPUTE
WS_SA -->|Configures| VPC
DELEGATE -->|Launches| VM
VM -->|Has attached| COMPUTE
COMPUTE -->|Accesses| GCS
UC_SA -->|Accesses UC-managed storage| GCS
style WS_SA fill:#FF3621,color:#fff
style DELEGATE fill:#FF9900,color:#000
style UC_SA fill:#1B72E8,color:#fff
style YOUR_SA fill:#FBBC04,color:#000
style COMPUTE fill:#4285F4,color:#fff
The key to understanding GCP is knowing which GSA lives where and who creates it:
1. Workspace SA - db-{workspaceid}@prod-gcp-{region}.iam.gserviceaccount.com
2. Delegate SA - delegate-sa@prod-gcp-{region}.iam.gserviceaccount.com
3. Unity Catalog Storage SA - db-uc-storage-UUID@uc-{region}.iam.gserviceaccount.com
4. Compute SA - databricks-compute@{your-project}.iam.gserviceaccount.com
roles/logging.logWriter + roles/monitoring.metricWriter5. Terraform/Automation SA - terraform-automation@{your-project}.iam.gserviceaccount.com
roles/owner on workspace and VPC projects (simplest), or custom role with workspace creator permissionsUnlike AWS (where YOU create all IAM roles) or Azure (where it’s built-in), GCP uses a hybrid model:
sequenceDiagram
participant You as Your Terraform<br/>(using your GSA)
participant DB as Databricks<br/>Control Plane
participant WS_SA as Workspace SA<br/>(Databricks project)
participant Your_Proj as Your GCP Project
participant Compute_SA as Compute SA<br/>(Your project)
Note over You,Your_Proj: Workspace Creation
You->>DB: 1. Create workspace<br/>(impersonate your GSA)
DB->>WS_SA: 2. Create Workspace SA<br/>(in Databricks project)
WS_SA->>Your_Proj: 3. Validate project access
WS_SA->>Compute_SA: 4. Create Compute SA<br/>(in your project)
WS_SA->>Your_Proj: 5. Grant Compute SA roles<br/>(log writer, metric writer)
WS_SA->>Your_Proj: 6. Validate VPC configuration
DB->>You: 7. Workspace ready!
Note over You,Your_Proj: Cluster Launch
You->>DB: 8. Launch cluster
WS_SA->>Your_Proj: 9. Create GCE instances
Your_Proj->>Compute_SA: 10. Attach Compute SA to VMs
| Service Account | Lives In | Created By | Purpose | Permissions You Grant |
|---|---|---|---|---|
Workspace SAdb-{wsid}@prod-gcp-{region}.iam |
Databricks project | Databricks | Manage workspace infra | Custom roles (auto-created): - Databricks Project Role v2 - Databricks Resource Role v2 |
Delegate SAdelegate-sa@prod-gcp-{region}.iam |
Databricks project | Databricks (pre-existing) | Launch GCE clusters | Permissions granted via custom roles |
Compute SAdatabricks-compute@project.iam |
Your project | Databricks Workspace SA | Cluster VM identity | Roles you grant: - Logging Log Writer - Monitoring Metric Writer |
UC Storage SAdb-uc-storage-UUID@uc-{region}.iam |
Databricks UC project | Databricks | Unity Catalog storage | Storage Object Admin on UC buckets |
Your Terraform SAterraform-automation@project.iam |
Your project | You | Terraform automation | Owner or custom role with workspace creation permissions |
GCP Unity Catalog uses a Databricks-managed GSA pattern:
sequenceDiagram
participant You as Your GCP Project
participant WS as Databricks Workspace
participant DBGSA as Databricks GSA<br/>(Databricks-owned account)
participant GCS as GCS Bucket
Note over You,GCS: Setup
You->>WS: 1. Create storage credential
WS->>WS: 2. Generate Databricks-managed GSA email
WS->>You: 3. Return GSA email
You->>GCS: 4. Grant GSA permissions<br/>(Storage Object Admin)
Note over You,GCS: Access
WS->>DBGSA: 5. Use Databricks GSA
DBGSA->>GCS: 6. Access bucket
Note over DBGSA,GCS: The GSA lives in Databricks' GCP<br/>project, not yours
Key Point: For Unity Catalog, the service account is created by Databricks in their GCP project, but you grant it permissions in your project.
This is different from workspace compute, where the GSA lives in YOUR project.
If you use Shared VPC (common in enterprises), there’s another layer:
flowchart TD
subgraph "Host Project<br/>(Centralized networking)"
SHARED_VPC[Shared VPC]
SUBNETS[Subnets]
end
subgraph "Service Project<br/>(Your Databricks workspace)"
GSA[databricks-compute GSA]
VM[Compute VMs]
end
subgraph "Permissions Required"
HOST_PERMS[Host project IAM:<br/>Compute Network User]
SERVICE_PERMS[Service project IAM:<br/>Compute Admin]
end
GSA -->|Needs| HOST_PERMS
GSA -->|Needs| SERVICE_PERMS
GSA -->|Attached to| VM
VM -->|Uses| SHARED_VPC
VM -->|Deploys into| SUBNETS
style SHARED_VPC fill:#4285F4,color:#fff
style GSA fill:#34A853,color:#fff
Unity Catalog works the same across all clouds, but accesses storage differently:
flowchart LR
subgraph "1. Identity"
AWS_ROLE[AWS: IAM Role]
AZURE_MI[Azure: Managed Identity]
GCP_GSA[GCP: Service Account]
end
subgraph "2. Storage Credential"
SC[Storage Credential<br/>Links identity to Databricks]
end
subgraph "3. External Location"
EL[External Location<br/>Maps to actual storage path]
end
subgraph "4. Catalog/Schema/Table"
CAT[Catalog]
SCHEMA[Schema]
TABLE[Table]
end
AWS_ROLE -->|Registered in| SC
AZURE_MI -->|Registered in| SC
GCP_GSA -->|Registered in| SC
SC -->|Used by| EL
EL -->|Storage for| CAT
CAT -->|Contains| SCHEMA
SCHEMA -->|Contains| TABLE
style SC fill:#FF6B35,color:#fff
style EL fill:#004E89,color:#fff
style CAT fill:#1B9AAA,color:#fff
sequenceDiagram
participant UC as Unity Catalog
participant Role as IAM Role<br/>(In your account)
participant S3 as S3 Bucket
UC->>Role: 1. AssumeRole<br/>(with external_id)
Role->>Role: 2. Verify external_id
Role->>UC: 3. Temporary credentials
UC->>S3: 4. Access data
Created by: You (via Terraform) Trust: AssumeRole with external_id Permissions: S3 GetObject, PutObject, ListBucket
sequenceDiagram
participant UC as Unity Catalog
participant MI as Managed Identity<br/>(Access Connector)
participant ADLS as ADLS Gen2
UC->>MI: 1. Request token
MI->>MI: 2. Azure AD authenticates
MI->>UC: 3. Access token
UC->>ADLS: 4. Access data<br/>(with token)
Created by: You (via Terraform/Portal) Trust: Azure AD (automatic) Permissions: Storage Blob Data Contributor
sequenceDiagram
participant UC as Unity Catalog
participant DBGSA as Databricks GSA<br/>(Databricks project)
participant GCS as GCS Bucket<br/>(Your project)
UC->>DBGSA: 1. Use Databricks GSA
DBGSA->>GCS: 2. Access bucket<br/>(cross-project)
Note over DBGSA,GCS: You grant permissions to<br/>Databricks' GSA in your project
Created by: Databricks (in their project) Trust: GCP IAM (cross-project) Permissions: Storage Object Admin
Each identity only gets permissions it needs:
flowchart TD
subgraph "AWS Example"
CROSS[Cross-Account Role]
CROSS_PERMS[Can: Launch EC2, Configure VPC<br/>Cannot: Access S3, Delete resources]
STORAGE[Storage Config Role]
STORAGE_PERMS[Can: Read/Write DBFS bucket<br/>Cannot: Launch EC2, Access other buckets]
end
subgraph "Azure Example"
SP[Service Principal]
SP_PERMS[Can: Create workspace, Deploy resources<br/>Cannot: Access data, Modify other resources]
MI[Managed Identity]
MI_PERMS[Can: Read/Write assigned storage<br/>Cannot: Create resources, Access other storage]
end
subgraph "GCP Example"
COMPUTE_GSA[databricks-compute GSA]
COMPUTE_PERMS[Can: Access assigned buckets<br/>Cannot: Create VMs, Modify IAM]
UC_GSA[UC Storage GSA]
UC_PERMS[Can: Read/Write catalog storage<br/>Cannot: Access compute, Modify network]
end
CROSS --> CROSS_PERMS
STORAGE --> STORAGE_PERMS
SP --> SP_PERMS
MI --> MI_PERMS
COMPUTE_GSA --> COMPUTE_PERMS
UC_GSA --> UC_PERMS
style CROSS_PERMS fill:#90EE90,color:#000
style STORAGE_PERMS fill:#90EE90,color:#000
style SP_PERMS fill:#90EE90,color:#000
style MI_PERMS fill:#90EE90,color:#000
style COMPUTE_PERMS fill:#90EE90,color:#000
style UC_PERMS fill:#90EE90,color:#000
AWS credentials expire automatically:
AssumeRole → Get temp creds → Use for 1 hour → Expired → Request new ones
You never give Databricks permanent access keys.
The Attack Without External ID:
How External ID Prevents This:
sequenceDiagram
participant Attacker
participant DB as Databricks
participant Your_Role as Your IAM Role
Attacker->>DB: Use role arn:aws:iam::YOUR-ACCT:role/your-role
DB->>Your_Role: AssumeRole<br/>(with Attacker's external_id)
Your_Role->>Your_Role: Check: external_id = YOUR-ACCOUNT-ID?
Your_Role->>DB: ❌ Access Denied<br/>(external_id mismatch)
Note over Attacker,Your_Role: Attacker cannot access your data
Every access is logged:
| Cloud | Log Service | What Gets Logged |
|---|---|---|
| AWS | CloudTrail | Every AssumeRole call, API calls made with temp creds |
| Azure | Activity Log | Resource creation, managed identity token requests |
| GCP | Cloud Audit Logs | Service account usage, API calls, permission grants |
AWS: No. The IAM role you create defines exact permissions. Databricks can only access what you explicitly allow in the role policy.
Azure: No. The managed identity has NO permissions until you grant them (e.g., Storage Blob Data Contributor).
GCP: No. The service account has NO permissions until you grant them via IAM bindings.
AWS: Workspace stops working immediately. Databricks can’t AssumeRole, so it can’t launch clusters or access storage.
Azure: Workspace may continue running (resources already created), but you can’t create new resources or access storage.
GCP: Workspace stops working. Compute VMs can’t be created, storage can’t be accessed.
AWS: No. Databricks uses AssumeRole to get temporary credentials from AWS directly. Your permanent credentials stay with you.
Azure: No. Databricks uses Azure’s token service. Your service principal secret is only used by your Terraform, not shared with Databricks.
GCP: No. Databricks impersonates service accounts via GCP’s APIs. No credential files are shared.
Each identity serves a different purpose with different permissions:
This separation means:
Technically yes, but don’t:
| Aspect | AWS | Azure | GCP |
|---|---|---|---|
| Who creates workspace identity? | You (IAM role) | Azure (first-party trust) | Databricks (GSA in your project) |
| Credential type | Temporary (AssumeRole) | Token (Managed Identity) | Service Account |
| Trust mechanism | External ID | First-party app | Service account impersonation |
| UC storage identity | IAM role (you create) | Managed Identity (you create) | Databricks GSA (they create) |
| Setup complexity | Medium (multiple roles) | Low (mostly automatic) | Medium (permission grants) |
| Number of identities | 4-5 IAM roles | 1-2 managed identities | 2-3 service accounts |
| Credential rotation | Automatic (temp creds) | Automatic (Azure handles) | Automatic (GCP handles) |
| Least privilege | Via IAM policies | Via RBAC roles | Via IAM bindings |
| Audit logs | CloudTrail | Activity Log | Cloud Audit Logs |
Each cloud has a distinct identity flow pattern. Here’s how each works:
flowchart LR
A[1. You create<br/>Cross-Account Role] --> B[2. Add trust policy<br/>for Databricks account]
B --> C[3. Databricks calls<br/>AWS STS AssumeRole]
C --> D[4. AWS returns<br/>temporary credentials]
D --> E[5. Databricks accesses<br/>your AWS resources]
style A fill:#FF9900,color:#000
style B fill:#FF9900,color:#000
style C fill:#FF9900,color:#000
style D fill:#FF9900,color:#000
style E fill:#FF9900,color:#000
Key Point: YOU create all IAM roles upfront. Databricks never creates anything in your AWS account.
flowchart LR
A[1. Azure trusts<br/>Databricks natively] --> B[2. You deploy<br/>workspace resource]
B --> C[3. Azure creates<br/>managed identity]
C --> D[4. You grant RBAC<br/>permissions]
D --> E[5. Databricks accesses<br/>your Azure resources]
style A fill:#0078D4,color:#fff
style B fill:#0078D4,color:#fff
style C fill:#0078D4,color:#fff
style D fill:#0078D4,color:#fff
style E fill:#0078D4,color:#fff
Key Point: Azure and Databricks are “partners” - Azure automatically handles identity creation.
flowchart LR
A[1. Databricks creates<br/>Workspace GSA] --> B[2. Your Terraform GSA<br/>grants permissions]
B --> C[3. Workspace GSA creates<br/>Compute GSA in your project]
C --> D[4. You grant GSA<br/>IAM roles]
D --> E[5. Databricks accesses<br/>your GCP resources]
style A fill:#4285F4,color:#fff
style B fill:#4285F4,color:#fff
style C fill:#4285F4,color:#fff
style D fill:#4285F4,color:#fff
style E fill:#4285F4,color:#fff
Key Point: Hybrid model - Databricks creates GSAs in BOTH their project and yours.
Now that you understand how identities work, check out:
Remember the analogy:
All three are secure, just different approaches based on each cloud’s design philosophy.