You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
mdp/Infrastructure/modern_data_platform_infras...

53 KiB

Greenfield Modern Data Platform — Azure Infrastructure Deployment Guide

Companion to: Modern Data Platform High-Level Reference Architecture v8.0 Classification: Internal Confidential | Date: March 2026


1. Azure Landing Zone Architecture

1.1 Design Philosophy

The Greenfield Modern Data Platform deploys within an Azure Landing Zone aligned with Microsoft's Cloud Adoption Framework (CAF) Enterprise-Scale architecture. The landing zone provides the secure, governed foundation required for a regulated financial institution operating under AMF, OSFI, and Law 25 requirements. Every design decision traces back to the reference architecture's principles: Security by Design, Right Tool for the Right Workload, and Unified Governance with Federated Execution.

1.2 Hub-Spoke Topology

The deployment follows a hub-spoke network topology — the standard pattern for enterprise-scale Azure deployments in regulated financial services. Shared network services (firewall, DNS, on-premises connectivity) are centralized in a hub virtual network, while each workload environment operates in its own spoke VNet peered to the hub.

Component VNet Name Purpose
Connectivity Hub vnet-hub-canadacentral Azure Firewall (Premium), ExpressRoute gateway to on-premises data centers, Azure Bastion for secure admin access, centralized Private DNS Zones, VPN gateway for backup connectivity
Data Platform Production vnet-data-prod-cc Databricks workspaces (VNet-injected), ADLS Gen2 private endpoints, Databricks SQL Warehouse endpoints, Purview private endpoints, Key Vault private endpoints
Data Platform Non-Production vnet-data-nonprod-cc Development and staging Databricks workspaces, test storage accounts, sandbox environments for data scientists
SAS Viya vnet-sas-prod-cc SAS Viya Compute Server pods on AKS (or VMs), JDBC connectivity to Databricks SQL Warehouses routed through hub firewall
Management vnet-mgmt-cc Azure DevOps self-hosted agents, Terraform runners, monitoring infrastructure, Manta lineage engine deployment (if IaaS)
Fabric (Overlay) N/A (PaaS) Microsoft Fabric is fully managed PaaS — no VNet required. Connectivity to ADLS Gen2 is through OneLake shortcuts and Microsoft-managed private endpoints from the Fabric tenant

All spoke VNets peer to the hub. Inter-spoke traffic routes through Azure Firewall for inspection. No direct spoke-to-spoke peering is permitted — this ensures all cross-environment traffic is logged and inspectable.

1.3 Hub Services Detail

Azure Firewall (Premium SKU). Centralized egress filtering with TLS inspection for outbound traffic, FQDN-based application rules, network rules for port-level control, and threat intelligencebased filtering. All spoke VNets route their default route (0.0.0.0/0) through the hub firewall via user-defined routes (UDRs). The Premium SKU is required for TLS inspection and IDPS capabilities mandated by OSFI guidelines.

ExpressRoute Gateway. Dedicated, private connectivity between Greenfield' on-premises data centers (Lévis/Montréal) and Azure. The ExpressRoute circuit terminates in the hub VNet with route propagation to all spokes. This is the primary path for source system connectivity — core banking (DB2, Oracle), insurance policy administration, mainframe extracts, and CRM systems all feed data through this circuit into ADF and Auto Loader ingestion pipelines.

Private DNS Zones. Centralized in the hub subscription and linked to every spoke VNet. Zones include privatelink.dfs.core.windows.net, privatelink.blob.core.windows.net, privatelink.vaultcore.azure.net, privatelink.azuredatabricks.net, privatelink.purview.azure.com, privatelink.servicebus.windows.net, and others. This ensures all workloads resolve private endpoint FQDNs to their private IPs regardless of which spoke they run in.

Azure Bastion. Browser-based RDP/SSH access to management VMs and jump boxes without exposing public IPs. This is the only permitted path for interactive administrative access to infrastructure resources.

1.4 Region Strategy

Primary Secondary
Region Canada Central (Toronto) Canada East (Québec City)
Role All production and non-production workloads Disaster recovery, geo-redundant storage replication target
Rationale Lower latency to majority of Azure services; broader service availability Data residency compliance (Law 25, OSFI); geographic separation for DR

Critical constraint: All data must remain within Canadian Azure regions. No replication, backup, or compute spillover to non-Canadian regions is permitted. This constraint is enforced via Azure Policy at the management group level (allowedLocations = canadacentral, canadaeast).


2. Resource Organization

2.1 Management Group Hierarchy

The data platform subscriptions inherit organization-wide Azure Policies from Greenfield' root management group (allowed regions, required tags, prohibited resource types, mandatory encryption). The platform-specific hierarchy:

Greenfield (Root MG)
└── Data & AI Platform (MG)
    ├── Production (MG)
    │   ├── sub-data-platform-prod
    │   ├── sub-data-sas-prod
    │   └── sub-data-fabric-prod
    ├── Non-Production (MG)
    │   └── sub-data-platform-nonprod
    └── Connectivity & Shared Services (MG)
        ├── sub-data-connectivity
        └── sub-data-management

Azure Policies applied at the "Data & AI Platform" management group level include: deny public endpoint creation on storage/Key Vault/Purview, require specific tags on all resources, enforce diagnostic settings on all supported resources, deny creation of unmanaged disks, and require TLS 1.2 minimum.

2.2 Subscription Layout

Subscription Purpose Key Resources
sub-data-connectivity Hub networking, shared services Azure Firewall, ExpressRoute gateway, Bastion, Private DNS Zones, VPN gateway
sub-data-platform-prod Production data platform workloads Databricks workspaces (prod), ADLS Gen2 accounts (bronze/silver/gold/staging), Key Vault, Purview account, Event Hub namespaces
sub-data-platform-nonprod Development, staging, sandbox Databricks workspaces (dev/stg/sandbox), ADLS Gen2 (dev/stg), Key Vault (non-prod)
sub-data-sas-prod SAS Viya Compute Server AKS cluster (or VM scale sets), SAS-specific staging storage, SAS license server
sub-data-fabric-prod Microsoft Fabric capacity Fabric F-SKU capacity resources, Fabric admin workspace for capacity management
sub-data-management Platform operations & DevOps Azure DevOps self-hosted agents, Terraform state storage account, Log Analytics workspace, Manta deployment (if IaaS), Azure Monitor resources, Microsoft Sentinel

Rationale for subscription isolation: Separate subscriptions provide independent RBAC boundaries (SAS team cannot accidentally modify Databricks resources), independent Azure Cost Management scopes for precise chargeback, independent quota management, and separate blast radii for misconfigurations.

2.3 Resource Group Strategy

Within sub-data-platform-prod:

Resource Group Contents
rg-databricks-prod-cc Databricks workspaces, managed resource groups (auto-created by Databricks), associated NSGs
rg-storage-prod-cc ADLS Gen2 storage accounts (bronze, silver, gold, staging, archive), storage private endpoints
rg-governance-prod-cc Microsoft Purview account, Purview managed storage, Purview private endpoints, Purview managed Event Hub
rg-keyvault-prod-cc Azure Key Vault instances (platform secrets + encryption keys), Key Vault private endpoints
rg-monitoring-prod-cc Log Analytics workspace (prod), diagnostic settings resources, Azure Monitor action groups, alert rules, workbooks
rg-networking-prod-cc VNet, subnets, NSGs, route tables (UDRs), private endpoint NICs
rg-ingestion-prod-cc Azure Data Factory instance, Event Hub namespaces (for streaming ingestion), ADF private endpoints

3. Networking Architecture

3.1 VNet & Subnet Design — Production Data Platform

The production spoke VNet (vnet-data-prod-cc) uses a /16 address space (e.g., 10.10.0.0/16) to accommodate Databricks' substantial IP requirements (VNet injection demands large subnets) and future growth.

Subnet CIDR Purpose Delegation / Notes
snet-dbx-host-prod 10.10.0.0/22 Databricks cluster host VMs (VNet injection) Delegation: Microsoft.Databricks/workspaces; NSG: nsg-dbx-host-prod
snet-dbx-container-prod 10.10.4.0/22 Databricks cluster container network (VNet injection) Delegation: Microsoft.Databricks/workspaces; NSG: nsg-dbx-container-prod
snet-private-endpoints 10.10.8.0/24 Private endpoints for ADLS, Key Vault, Purview, Event Hub No delegation; NSG restricts inbound to platform VNets only
snet-sqlwarehouse-prod 10.10.9.0/24 Databricks Serverless SQL Warehouse connectivity Network Connectivity Config (NCC) for serverless compute
snet-adf-prod 10.10.10.0/24 ADF Integration Runtime (self-hosted, if needed for on-prem sources) VMs running IR software
snet-services-prod 10.10.11.0/24 Supporting services, internal load balancers, utility VMs General-purpose

Sizing note: Databricks VNet injection requires /22 or larger subnets for host and container in production to support auto-scaling across multiple concurrent clusters (data engineering, SQL warehouses, ML training). Each running node consumes 2 IPs (one host, one container). A /22 provides ~1,022 usable IPs per subnet, supporting approximately 500 concurrent nodes — sufficient for Greenfield' production workload with headroom.

3.2 Private Endpoints

Every platform service that supports private endpoints must be configured with public network access disabled. This is non-negotiable for a regulated financial institution.

Service Private Endpoint Sub-Resource Private DNS Zone
ADLS Gen2 (each account × 2) dfs, blob privatelink.dfs.core.windows.net, privatelink.blob.core.windows.net
Azure Key Vault (each instance) vault privatelink.vaultcore.azure.net
Microsoft Purview account, portal privatelink.purview.azure.com, privatelink.purviewstudio.azure.com
Databricks Workspace databricks_ui_api privatelink.azuredatabricks.net
Azure Event Hub namespace privatelink.servicebus.windows.net
Azure Data Factory dataFactory, portal privatelink.datafactory.azure.net, privatelink.adf.azure.com
Azure Container Registry (for SAS) registry privatelink.azurecr.io
Databricks Unity Catalog Metastore (via workspace PE) Covered by Databricks workspace PE

Total private endpoints (production): Approximately 2530 PEs across all services. Each PE creates a NIC in snet-private-endpoints with a private IP, plus a DNS A record in the corresponding Private DNS Zone.

3.3 Network Security Groups (NSGs)

NSGs are applied at the subnet level with a default deny-all posture. Explicit allow rules are created only for required traffic flows.

Databricks host & container subnets:

  • Allow all inbound/outbound between snet-dbx-host-prod and snet-dbx-container-prod (required for Spark executor ↔ driver communication)
  • Allow outbound to snet-private-endpoints on ports 443 (ADLS HTTPS), 443 (Key Vault), 1433 (if Azure SQL is used)
  • Allow outbound to Databricks control plane IPs (published by Databricks, region-specific) on port 443
  • Allow outbound to Azure Service Tags (AzureActiveDirectory, AzureMonitor) for authentication and telemetry
  • Deny all inbound from internet
  • Deny all other outbound (forced through Azure Firewall via UDR)

Private endpoint subnet:

  • Allow inbound from snet-dbx-host-prod, snet-dbx-container-prod, snet-adf-prod, snet-services-prod, and vnet-sas-prod-cc (via peering) on service-specific ports
  • Deny inbound from all other sources
  • No outbound rules needed (PEs are inbound-only destinations)

3.4 Azure Firewall Rules

The hub Azure Firewall controls all egress from spoke VNets. Key application rule collections:

Rule Collection FQDN Targets Protocol Purpose
rc-databricks-control *.azuredatabricks.net, Databricks control plane IPs HTTPS (443) Databricks workspace communication with Azure-managed control plane
rc-azure-services login.microsoftonline.com, management.azure.com, *.monitor.azure.com HTTPS (443) Entra ID authentication, ARM management, Azure Monitor telemetry
rc-package-repos pypi.org, files.pythonhosted.org, repo.anaconda.com, conda.anaconda.org HTTPS (443) Python/Conda package installation for Databricks clusters and SAS
rc-databricks-artifacts dbartifactsprodcac.blob.core.windows.net (region-specific) HTTPS (443) Databricks runtime artifacts, libraries, Spark distributions
rc-sas-licensing SAS licensing endpoints (provided by SAS) HTTPS (443) SAS Viya license validation
rc-deny-all * Any Default deny — all egress not matching an explicit rule is blocked and logged

3.5 Connectivity to On-Premises

Source systems (core banking on DB2, Oracle, SQL Server; insurance policy administration; CRM; mainframe extracts) connect to the Azure data platform through ExpressRoute. ADF pipelines and Databricks Auto Loader access on-premises sources through the hub VNet's ExpressRoute gateway, with traffic routing controlled by BGP route propagation and UDRs. Self-hosted Integration Runtimes (in snet-adf-prod) are deployed for sources that require a local agent.


4. Identity & Access Management

4.1 Azure Entra ID as the Identity Foundation

All platform access is brokered through Azure Entra ID (formerly Azure Active Directory) with mandatory multi-factor authentication enforced via Conditional Access policies. Conditional Access policies require: MFA for all users, compliant device, access from Greenfield corporate network or approved VPN locations, and session lifetime controls (re-authentication every 12 hours for interactive sessions).

Component Identity Integration Authentication Method
Databricks Workspaces Entra ID SSO via SCIM provisioning; Entra security groups synced to Unity Catalog for RBAC OAuth 2.0 / SAML; MFA via Conditional Access
Microsoft Fabric Native Entra ID integration via Microsoft 365 tenant Entra ID SSO; MFA enforced
SAS Viya Entra ID integration via SAML 2.0 federation or direct OIDC (depending on SAS version) SAML 2.0 federated with Entra ID; MFA enforced
Microsoft Purview Native Entra ID; Purview collection-level roles mapped to Entra security groups Entra ID SSO
Azure Data Factory Managed identity for pipeline execution; Entra groups for authoring RBAC System-assigned Managed Identity
ADLS Gen2 Entra ID RBAC at Azure resource level (Storage Blob Data roles) + Unity Catalog ACLs at the data layer Managed Identity / Service Principal

4.2 RBAC Model — Azure Resource Layer

Azure RBAC provides coarse-grained access at the resource level. Data-level fine-grained access (RLS, CLS, DDM) is enforced by Unity Catalog at the compute layer.

Entra Security Group Azure RBAC Role Scope
sg-data-platform-admins Contributor sub-data-platform-prod, sub-data-platform-nonprod
sg-data-engineers Databricks Contributor, Storage Blob Data Contributor Databricks workspaces, ADLS Gen2 accounts
sg-data-scientists Databricks Contributor (sandbox workspace only), Storage Blob Data Reader Sandbox workspace, Silver/Gold ADLS (read)
sg-data-analysts Reader Databricks analytics workspace (SQL Warehouse access via Unity Catalog grants)
sg-governance-admins Purview Data Curator, Key Vault Administrator Purview account, Key Vault
sg-sas-developers Contributor sub-data-sas-prod
sg-fabric-admins Fabric Capacity Administrator Fabric capacity resource
sg-finops Cost Management Reader, Billing Reader All data platform subscriptions

4.3 Service Principals & Managed Identities

Automated workloads use managed identities (preferred) or service principals (where managed identities are not supported). Shared secrets and embedded credentials are explicitly prohibited per the reference architecture.

Workload Identity Type Scope / Role Assignments
Databricks workspace (system) System-assigned Managed Identity Storage Blob Data Contributor on ADLS Gen2 accounts; Key Vault Secrets User on kv-data-platform-prod
ADF pipelines System-assigned Managed Identity Storage Blob Data Contributor on ADLS; Databricks workspace access via linked service token
SAS Viya Compute Server Service Principal (sp-sas-compute-prod) Storage Blob Data Reader on authorized ADLS paths only (non-sensitive datasets); JDBC access to Databricks SQL Warehouses scoped via Unity Catalog grants
Purview scanners System-assigned Managed Identity Storage Blob Data Reader on all ADLS accounts; Databricks metastore reader for Unity Catalog metadata harvesting
Manta lineage engine Service Principal (sp-manta-prod) Read-only access to Databricks repos, ADF metadata APIs, SAS code repositories; Purview Data Curator for lineage publishing
CI/CD pipelines (Terraform) Service Principal (sp-terraform-prod) Contributor on data platform subscriptions; User Access Administrator for RBAC assignments

SAS read-path security enforcement: Because SAS Compute Server accesses ADLS files using its service principal's Azure RBAC permissions (bypassing Unity Catalog's fine-grained RLS/CLS/DDM), the sp-sas-compute-prod service principal is granted Storage Blob Data Reader only on non-sensitive, pre-authorized ADLS paths. Those paths are registered as Unity Catalog external locations with restricted scope. For sensitive datasets, SAS must read through JDBC LIBNAME to Databricks SQL Warehouses, where Unity Catalog enforcement applies at query time.


5. Compute Sizing & Configuration

5.1 Databricks Workspaces

Workspace Environment Purpose Configuration
dbw-data-eng-prod Production Data engineering: Bronze→Silver→Gold DLT pipelines, Auto Loader streaming ingestion, data quality workflows Unity Catalog enabled; VNet injected; cluster policies enforce approved instance types (Standard_DS4_v2 8 vCPU/28 GB as default workers, Standard_DS5_v2 16 vCPU/56 GB for memory-intensive), auto-terminate 30 min, spot instances (60/40 spot/on-demand) for non-critical batch jobs; Photon enabled for DLT pipelines
dbw-analytics-prod Production SQL analytics, Databricks AI/BI Dashboards, Genie NL querying, ad-hoc analyst queries Serverless SQL Warehouses (preferred); Pro SQL Warehouses for workloads requiring Classic compute; auto-suspend 10 min
dbw-mlops-prod Production ML/AI: MLflow experiments, model training, Feature Store serving, Model Serving endpoints, GenAI workloads GPU instance pools (Standard_NC6s_v3 6 vCPU/112 GB/1× V100 for training; Standard_NC4as_T4_v3 for inference); CPU pools for feature engineering; Model Serving auto-scale (min replicas: 0 for dev, 1 for prod SLA endpoints)
dbw-data-eng-dev Non-Prod Development and testing of data engineering pipelines Reduced cluster sizes (max 4 workers); mandatory auto-terminate 15 min; restricted to Standard_DS3_v2; Unity Catalog dev metastore (separate from prod)
dbw-sandbox Non-Prod Data science exploration, POCs, experimentation Read-only access to Silver/Gold via Unity Catalog; no production write access; budget cap enforced ($X/month per user via tag-based cost alerts); auto-terminate 10 min

Cluster policy governance: All workspaces enforce cluster policies that prevent creation of non-compliant clusters. Policies define: allowed instance types (approved VM families only), maximum workers per cluster (10 for dev, 50 for prod data eng, 20 for ML), mandatory auto-termination, mandatory Unity Catalog mode (no legacy clusters), spot instance minimum ratio, custom tags (CostCenter, DataDomain required), and prohibited features (no local file system access in prod, no public IP assignment).

5.2 Databricks SQL Warehouses

SQL Warehouse Size (DBU) Purpose Configuration
sqlwh-bi-serving Medium (48 DBU base) JDBC endpoint consumed by SAS Compute Server (JDBC LIBNAME) and indirectly by Power BI (via OneLake shortcut refresh queries) Serverless; auto-suspend 10 min; scaling policy: economy (cost-optimized, queues queries rather than scaling aggressively); spot instances enabled; query result caching enabled
sqlwh-analytics SmallMedium (28 DBU) Ad-hoc analyst queries via SQL editor, JDBC/ODBC for downstream applications, notebook SQL exploration Serverless; auto-suspend 10 min; query queuing enabled; intelligent workload management
sqlwh-etl-support MediumLarge (416 DBU) Data quality validation queries run during pipeline execution, data contract SLA checks (freshness, completeness), Gold layer aggregation queries Pro; scheduled scaling aligned with pipeline windows (04:0008:00 EST peak for T+1 processing); auto-suspend after pipeline completion

5.3 SAS Viya Compute Server

SAS Viya runs on Compute Server engine (not CAS — this is AD-04). Processing is sequential and batch-oriented, so compute is sized for single-threaded throughput with high memory for in-process data manipulation, not for horizontal parallelism.

Deployment option: AKS (recommended)

Component VM Size / Pod Spec Quantity Notes
SAS Compute Server pods Standard_E16s_v5 (16 vCPU, 128 GB RAM) 24 pods Primary workhorses for actuarial model development, risk model validation, moderate batch scoring; memory-optimized for SAS DATA step and PROC SQL in-memory operations
SAS Programming Runtime (heavy) Standard_E32s_v5 (32 vCPU, 256 GB RAM) 12 pods Large actuarial reserving models (IFRS 17, IBNR) requiring extended memory; scheduled availability during nightly batch windows to control cost
SAS License Server Standard_D4s_v5 (4 vCPU, 16 GB RAM) 1 (HA pair) SAS license management service; always-on
SAS Model Manager Standard_D8s_v5 (8 vCPU, 32 GB RAM) 12 pods Model governance, model registration, champion/challenger tracking for regulatory model management
AKS System Node Pool Standard_D4s_v5 3 nodes (across AZs) AKS system services (CoreDNS, kube-proxy, metrics-server)

AKS configuration: Private AKS cluster (no public API endpoint); Azure CNI networking for VNet integration in vnet-sas-prod-cc; Entra ID integration for RBAC; node pool auto-scaling (min 2 / max 6 for compute pods); Azure Disk CSI driver for persistent volumes (SAS work directories); Azure Files for shared SAS configuration.

5.4 Microsoft Fabric Capacity

Fabric capacity is sized strictly for the BI serving workload (AD-03). Any request to increase capacity for data engineering or warehousing triggers an architecture review per the reference architecture's anti-pattern rules.

Capacity SKU CU Use Case Governance
F64 (Production) 64 Capacity Units Power BI Direct Lake semantic models serving 55,000 users, report rendering, paginated reports for regulatory distribution, Copilot-assisted analytics Auto-pause during non-business hours (22:0006:00 EST); smoothing enabled for burst absorption; capacity alerts at 70% and 90% utilization; Fabric admin monitors for non-BI workloads
F32 (Non-Production) 32 CU Development and testing of Power BI semantic models, report prototyping, Direct Lake connectivity validation Auto-pause aggressive (off outside business hours); BI workloads only; no Fabric notebooks or Spark
F16 (Fabric IQ POC — Horizon 2) 16 CU Ontology POC on Customer 360 domain, Data Agent evaluation with controlled user group (per Section 4.5 prerequisites) Isolated capacity; time-limited (6-month evaluation); requires Architecture Review Board approval before provisioning

Cost note: F64 at list price is approximately $8,0009,000 USD/month. This is significantly less than the Databricks DBU cost that would be incurred if 55,000 Power BI users queried Databricks SQL Warehouses via DirectQuery — the entire economic justification for Fabric in this architecture (AD-03).


6. Storage Architecture

6.1 ADLS Gen2 Account Layout

ADLS Gen2 is the shared storage substrate across all three platforms (AD-07). All accounts have hierarchical namespace (HNS) enabled for Delta Lake compatibility and are configured with private endpoints only (public access disabled at the account level).

Storage Account Access Tier Container Structure Access Pattern
stadlsbronzeprod Hot /bronze/{source_system}/{entity}/ (e.g., /bronze/core_banking/accounts/, /bronze/insurance/claims/) Write: ADF managed identity, Auto Loader managed identity; Read: Databricks data engineering workspace MI, Purview scanner MI
stadlssilverprod Hot /silver/{domain}/{entity}/ (e.g., /silver/customer/individual/, /silver/claims/claim_header/) Write: Databricks DLT pipelines; Read: Databricks analytics & ML workspaces, SAS via JDBC (Unity Catalog enforced)
stadlsgoldprod Hot /gold/{data_product}/{entity}/ (e.g., /gold/customer_360/member_profile/, /gold/risk_features/credit_scores/) Write: Databricks DLT/SQL pipelines; Read: Power BI (OneLake shortcut → Direct Lake), Databricks SQL, SAS JDBC LIBNAME, ML Feature Store, REST API services
stadlsstagingprod Hot /staging/ingestion/{source}/, /staging/sas_writeback/{model_domain}/{model_name}/{run_date}/, /staging/purview_dq/ Write: ADF (ingestion landing), SAS ADLS LIBNAME (writeback staging); Read: Purview DQ scanner (pre-Bronze assessment), Databricks promotion pipelines
stadlsarchiveprod Cool → Archive /archive/{domain}/{entity}/{year}/ Write: Lifecycle policy tiering; Read: Rare, on-demand rehydration for regulatory retrieval

Why separate storage accounts (not just separate containers)? Separate accounts provide independent RBAC boundaries (the SAS service principal has no RBAC on stadlsbronzeprod), independent throughput limits (each account gets its own IOPS/bandwidth quota), independent private endpoints (enabling subnet-level NSG control), independent lifecycle policies, and independent diagnostic logging.

6.2 Delta Lake Configuration

All data within Bronze, Silver, and Gold containers is stored in Delta Lake format (AD-01).

Time Travel retention: 90 days minimum for Bronze (regulatory auditability per AMF requirements); 30 days for Silver and Gold (sufficient for pipeline replay and debugging), extended to 90 days for regulatory domain Gold tables (Financial Aggregates, Risk Features).

VACUUM & OPTIMIZE schedule (via Databricks Workflows):

  • Bronze: VACUUM retaining 90 days, runs weekly (low urgency, append-only)
  • Silver: VACUUM retaining 30 days, runs nightly; OPTIMIZE with ZORDER BY on business keys, runs nightly
  • Gold: VACUUM retaining 30 days, runs nightly; OPTIMIZE with LIQUID CLUSTERING on high-cardinality filter columns (date, business_unit, product_code, member_id), runs nightly post-refresh; table statistics maintained via ANALYZE TABLE for SQL Warehouse query optimizer

Schema evolution: mergeSchema enabled on Bronze tables (Auto Loader schema inference accommodates upstream changes without pipeline failures). Silver and Gold tables enforce strict schemas via DLT expectations and data contracts — schema-breaking changes require a data product lifecycle change request.

6.3 Lifecycle Policies

Storage Account Policy Rule Action / Rationale
stadlsbronzeprod Blobs not modified in 180 days → Cool tier; not modified in 365 days → Archive tier Cost optimization for aged raw data while maintaining regulatory access; Delta Time Travel handles recent rollback needs
stadlsstagingprod Ingestion staging blobs > 30 days → delete; SAS writeback blobs > 14 days after promotion validation → delete Prevent staging zone bloat; staging data is ephemeral by design
stadlsarchiveprod Archive tier by default; rehydrate on demand (Standard priority, 15 hours) Lowest cost for long-term regulatory retention; rare access pattern
All accounts Soft delete enabled (14-day retention); blob versioning enabled; container soft delete (7 days) Accidental deletion recovery baseline

6.4 Encryption

All storage accounts use encryption at rest with Microsoft-managed keys (MMK) as the baseline. For data classified as Restricted (per Purview auto-classification — e.g., NAS/SIN numbers, certain financial details), customer-managed keys (CMK) stored in Azure Key Vault (kv-data-encryption-prod) are applied at the storage account level. In practice, this means stadlssilverprod and stadlsgoldprod use CMK (these contain processed, queryable sensitive data), while stadlsbronzeprod uses MMK (raw data is further protected by restricted RBAC — only data engineering roles have access).

Encryption in transit: TLS 1.2 minimum enforced on all storage endpoints. Secure transfer required = true on all accounts.


7. Security Architecture

7.1 Azure Key Vault

Key Vault Instance Purpose Access Policy
kv-data-platform-prod Platform operational secrets: database connection strings, SAS license keys, API keys for external data feeds, ADF linked service credentials Databricks MI, ADF MI, SAS SP → Key Vault Secrets User role; CI/CD SP → Key Vault Administrator (for secret rotation automation)
kv-data-encryption-prod Customer-managed encryption keys: ADLS CMK, Databricks workspace CMK (managed services encryption, DBFS encryption) Storage account MI, Databricks workspace MI → Key Vault Crypto User role; Key Vault admin group → Key Vault Administrator
kv-data-platform-nonprod Non-production secrets (same structure, isolated values) Non-prod workspace MIs and SPs

Hardening:

  • Key Vault Firewall: enabled; access restricted to private endpoint only (from snet-private-endpoints)
  • Purge protection: enabled on all instances (prevents permanent deletion, even by admins, for 90 days)
  • Soft delete: enabled with 90-day retention
  • Diagnostic settings: all Key Vault operations streamed to Log Analytics; alert rules configured for unexpected access patterns (access from unknown service principals, bulk secret reads, key deletion attempts)
  • Secret rotation: automated rotation policies for secrets with 90-day expiry; CI/CD pipeline rotates service principal credentials quarterly

7.2 Private Endpoints — Defense in Depth

All platform services operate exclusively through private endpoints. Public endpoints are disabled at the resource level and blocked by Azure Policy. Network traffic between services never traverses the public internet.

The full private endpoint inventory (approximately 2530 PEs in production) is detailed in Section 3.2. Each PE creates: a NIC with private IP in snet-private-endpoints, a DNS A record in the corresponding Private DNS Zone, and NSG rules governing which subnets can reach it.

7.3 Data Encryption — Layers

Layer Encryption Key Management
At rest — ADLS Gen2 AES-256; MMK (default) or CMK for Restricted data CMK in kv-data-encryption-prod; auto-rotation annually
At rest — Databricks DBFS AES-256 with CMK CMK in kv-data-encryption-prod
At rest — Databricks managed services CMK for notebook results, job results, ML artifacts Workspace-level CMK configuration
In transit — all services TLS 1.2 minimum Azure-managed certificates; Databricks enforces TLS for all cluster ↔ storage and cluster ↔ control plane communication
In transit — ExpressRoute MACsec (Layer 2 encryption at peering) Optional but recommended for Express Route Direct circuits

7.4 Databricks Workspace Security Configuration

Security Control Configuration Detail
Workspace encryption CMK from kv-data-encryption-prod for managed services and DBFS
Cluster policies Mandatory: enforce approved instance types, auto-termination, spot ratios, max workers, Unity Catalog mode, custom tags; prohibit legacy table ACLs, local file download, init scripts from untrusted sources
Unity Catalog External metastore on dedicated ADLS storage; all data access routed through UC; legacy Hive metastore disabled; default catalog = none (users must specify catalog explicitly)
Token management Personal Access Tokens (PATs) disabled in production workspaces; service principals use OAuth 2.0 M2M (client credentials flow)
IP access lists Workspace accessible only from Greenfield corporate network IP ranges (via ExpressRoute) and Azure Bastion subnet
Audit logging All workspace audit logs streamed to Log Analytics via diagnostic settings; captures: data access events, cluster lifecycle, admin configuration changes, SQL query history, secrets access
Secret scopes All Databricks secret scopes backed by Azure Key Vault (kv-data-platform-prod); Databricks-native secret storage disabled
Secure cluster connectivity (No Public IP) Enabled — cluster nodes have no public IPs; all communication via VNet and private links

7.5 Data Loss Prevention & Exfiltration Controls

  • Egress restriction: All compute cluster outbound traffic routes through Azure Firewall; only whitelisted FQDNs are permitted (Section 3.4)
  • DBFS download restriction: Local file download from Databricks notebooks disabled in production via workspace admin settings
  • Purview DLP: Microsoft Purview DLP policies integrated with M365 prevent sensitive data from leaving via email, Teams, or SharePoint
  • Clipboard/export controls: Fabric Power BI tenant settings restrict data export from reports (no CSV export of Restricted-classified datasets)

8. Monitoring & Observability

8.1 Centralized Monitoring Stack

All monitoring data converges on a central Log Analytics workspace (law-data-platform-prod in sub-data-management) to provide a single-pane-of-glass view.

Component Technology Data Collected
Infrastructure Azure Monitor metrics + Log Analytics VM metrics (CPU, memory, disk), AKS node/pod metrics, storage account metrics (transactions, latency, capacity), network flow logs (NSG flow logs v2), Azure Firewall logs (application + network rules)
Databricks Databricks audit logs → Diagnostic Settings → Log Analytics; Databricks system tables (system.billing.usage, system.compute.clusters, system.query.history) Cluster utilization & idle time, job run durations/statuses/failures, SQL Warehouse query performance (duration, bytes scanned, cache hit ratio), DBU consumption by workspace/cluster/job/user, Unity Catalog audit trail (who accessed what data)
Data pipelines ADF monitoring → Log Analytics; Databricks Workflows run logs Pipeline run status, duration, row counts, error details; DLT pipeline quality metrics (rows passed/failed expectations); ingestion audit log (source, timestamp, row counts, schema version)
Data quality Purview DQ scores (Tier 1) + DLT expectation metrics (Tier 23) → unified DQ dashboard (Power BI) Quality scores by domain, source system, data product; quarantine volumes; SLA compliance trends; CDE completeness tracking
Security / SIEM Microsoft Sentinel connected to Log Analytics Security events aggregated from all services; anomalous data access patterns; DLP alerts; Key Vault access audit; failed authentication attempts; service principal credential anomalies
Cost Azure Cost Management + Databricks Account Console Subscription-level cost by resource group and tag; Databricks DBU consumption by workspace, cluster type, job, and user; Fabric CU consumption; storage growth trends

8.2 Key Dashboards

Dashboard Audience Content
Platform Health Platform engineering team Cluster availability, pipeline success rates, SQL Warehouse queue times, storage latency, private endpoint health, AKS node status
Data Quality Governance CDO office, data stewards Unified DQ scores (all three tiers), quarantine volumes, SLA compliance by data product, CDE completeness trends, steward action items
FinOps & Cost FinOps team, CDO office DBU spend by team/domain, Fabric CU utilization, storage growth, reserved vs. on-demand ratio, cost anomaly detection, budget burn rate
Security & Compliance Security team, CISO office Sentinel incident summary, data access audit summary, DLP alert trends, service principal activity, Key Vault operations

8.3 Alerting Strategy

Alert Category Trigger Severity Action
Pipeline failure Production ingestion or DLT pipeline fails Sev 1 (critical domains: Customer 360, Financial Aggregates, Risk Features) / Sev 2 (others) Page on-call data engineer; auto-retry with exponential backoff (3 retries); escalate to lead if all retries fail
Data quality SLA breach Gold data product fails freshness SLA (e.g., T+1 by 06:00 EST not met) or completeness threshold (<99.5% on CDEs) Sev 1 Notify data product owner + steward; defer Gold refresh; CDO office escalation if unresolved within 4 hours
Security anomaly Unexpected SP data access; bulk data download; Key Vault access from unknown identity Sev 1 Sentinel incident auto-created; SOC investigation; auto-block identity if high-confidence threat
Cluster over-provisioning Databricks cluster idle >30 min or CPU utilization <20% for sustained period Sev 3 Notify FinOps team; recommend right-sizing; auto-terminate if policy allows
Fabric capacity saturation CU utilization >90% sustained for 1 hour Sev 2 Notify Fabric admin; evaluate workload scheduling or capacity burst
Storage anomaly Storage account growth rate exceeds 2× historical 7-day baseline Sev 3 Notify data engineering lead; investigate potential staging bloat or data duplication

9. Disaster Recovery & Business Continuity

9.1 Recovery Objectives

Tier Workloads RPO RTO Strategy
Tier 1 — Critical Gold data products (Customer 360, Financial Aggregates, Risk Features), regulatory reporting pipelines, active ML model serving endpoints ≤ 1 hour ≤ 4 hours GRS/RA-GRS storage replication to Canada East; Databricks workspace deployable via IaC in <2 hours; pre-provisioned standby networking in Canada East; SQL Warehouse endpoints re-creatable from IaC
Tier 2 — Important Silver layer, ingestion pipelines (ADF + Auto Loader), SAS Viya actuarial models, Feature Store ≤ 4 hours ≤ 8 hours GRS storage replication; IaC-based compute rebuild in secondary region; SAS model code in version control (Git); ADF pipeline definitions exported and version-controlled
Tier 3 — Standard Bronze layer, development/sandbox environments, non-production workloads ≤ 24 hours ≤ 24 hours GRS storage (async replication); rebuild from IaC; accept data loss up to last successful replication point

9.2 Storage Replication

All production ADLS Gen2 accounts are configured with Geo-Redundant Storage (GRS), replicating data asynchronously to Canada East. For Tier 1 storage accounts (stadlsgoldprod), Read-Access GRS (RA-GRS) is enabled, allowing read access from the secondary region during a regional outage without waiting for Microsoft-initiated failover.

Delta Lake's transaction log (_delta_log/) is included in GRS replication. After failover, replicated Delta tables are query-consistent — the transaction log ensures that only committed transactions are visible, even if replication was mid-flight during the outage.

9.3 Compute Recovery

All infrastructure is defined as code (Terraform — see Section 12). In a DR scenario:

  1. Networking: Standby hub-spoke VNets in Canada East are pre-provisioned (warm standby) with peering, NSGs, and route tables ready. ExpressRoute circuit has a secondary connection to Canada East.
  2. Databricks: Workspace configuration, cluster policies, Unity Catalog metastore settings, and secret scope bindings are all in Terraform. A terraform apply targeting the Canada East module deploys a functional workspace in ~6090 minutes.
  3. SAS Viya: AKS cluster definition is in Terraform; SAS container images are in Azure Container Registry (geo-replicated to Canada East). Deployment time ~23 hours including SAS configuration restore.
  4. Fabric: Fabric capacity is region-specific. A standby F-SKU in Canada East can be provisioned on-demand (manual, ~30 minutes). Power BI semantic models would need to be re-pointed to the secondary ADLS endpoints.

9.4 Backup Strategy

Component Backup Mechanism Retention
Delta Lake data Delta Time Travel (point-in-time recovery within retention window) + GRS replication 90 days Bronze, 30 days Silver/Gold
Azure Key Vault Soft delete (90 days) + purge protection; secrets/keys recoverable even after deletion 90 days
Unity Catalog metastore Databricks account-level configuration backup + IaC definitions Recoverable from IaC
Purview configuration Glossary, classification schemas, and policy definitions exported periodically to version control (JSON export) Git history (indefinite)
Pipeline definitions ADF ARM templates in Git; Databricks notebooks in Git (Repos); DLT pipeline code in Git Git history (indefinite)
SAS code & models SAS programs, macros, model code in Git; SAS Model Manager metadata exported periodically Git history (indefinite)

DR testing cadence: Tabletop exercise quarterly; partial failover test (storage failover + compute redeploy for one domain) semi-annually; full DR simulation annually.


10. Cost Management & FinOps

10.1 Cost Allocation Architecture

Cost visibility is built into the infrastructure from day one through mandatory tagging (enforced by Azure Policy), subscription-level isolation, and platform-native cost tools.

Mandatory tags (enforced via Azure Policy — deny if missing):

Tag Key Purpose Example Values
Environment Environment segregation prod, nonprod, sandbox, dr
CostCenter Chargeback to business unit CC-CDO-1234, CC-INS-5678, CC-RISK-9012
Platform Technology pillar identification databricks, fabric, sas, governance, shared
Owner Operational contact team-data-eng@Greenfield.com
DataDomain Domain-level cost attribution customer, claims, risk, finance, shared
DataClassification Align cost with sensitivity tier public, internal, confidential, restricted

Recommended tags (enforced via Azure Policy — audit mode):

Tag Key Purpose Example Values
ManagedBy IaC tool tracking terraform, manual, bicep
Horizon Implementation phase tracking h1, h2, h3
ExpiryDate Temporary resource cleanup 2026-09-30 (for POCs, sandboxes)

10.2 Budget Controls

Subscription-level budgets: Set for each subscription with alerts at 50%, 75%, 90%, and 100% of monthly budget. Breaches above 100% trigger automatic notification to CDO office and FinOps team.

Databricks cost controls:

  • Cluster policies enforce max cluster sizes, auto-termination, and spot instance ratios
  • Databricks Account Console provides DBU consumption dashboards by workspace, cluster type, job, and user
  • Databricks budgets feature (if available) configured per workspace with alerts
  • SQL Warehouses configured with economy scaling to minimize DBU for BI serving workload

Fabric capacity governance:

  • Capacity sized strictly for BI serving only (AD-03)
  • Auto-pause enabled (22:0006:00 EST) — saves ~33% of capacity cost
  • Smoothing enabled for burst absorption without over-provisioning
  • Any request to increase capacity for non-BI workloads triggers architecture review
  • Monthly Fabric utilization review by FinOps + Fabric admin

Storage cost controls:

  • Lifecycle policies (Section 6.3) auto-tier aged data to lower-cost tiers
  • Storage growth alerts (Section 8.3) detect anomalous growth
  • Monthly review of staging zone sizes to prevent bloat

10.3 Reserved Capacity & Savings Plans

Resource Commitment Type Term Estimated Savings
Databricks DBU Databricks Committed Use Discount (DBCU) 1-year (based on 6-month usage baseline after Horizon 1 stabilization) 2035% vs. pay-as-you-go
Azure VMs (SAS Viya AKS nodes) Azure Savings Plan for Compute 1-year 1525% vs. pay-as-you-go
ADLS Gen2 Storage Azure Reserved Capacity (hot tier) 1-year (based on projected growth model) Up to 30% on hot tier
Fabric Capacity Fabric reservation (evaluate availability) Evaluate after 6 months of usage data TBD at GA pricing
ExpressRoute ExpressRoute circuit commitment Already committed (enterprise) N/A — existing circuit

10.4 Chargeback Model

Cost is allocated to business units via CostCenter tags. The CDO's FinOps practice produces monthly chargeback reports:

  • Direct costs (Databricks DBU, Fabric CU, SAS compute) → allocated to the business unit whose workloads consumed them
  • Shared costs (storage, networking, governance, monitoring) → allocated proportionally by data volume and query consumption per domain
  • Platform overhead (hub networking, Key Vault, Purview, Manta) → allocated as a shared infrastructure charge across all consuming business units

11. DevOps & Infrastructure as Code

11.1 Terraform — Primary IaC Tool

All Azure infrastructure is managed through Terraform. Resources are organized as reusable modules stored in a central Git repository with a clear module hierarchy.

Terraform Module Scope State Backend
terraform-module-networking Hub-spoke VNets, subnets, NSGs, route tables, Private DNS Zones, Azure Firewall, ExpressRoute, VNet peering Remote state in Azure Storage (sub-data-management); state locking via blob lease
terraform-module-databricks Databricks workspaces, cluster policies, Unity Catalog metastore, secret scopes, workspace configuration, IP access lists Remote state per environment (prod/nonprod); separate state file per workspace
terraform-module-storage ADLS Gen2 accounts, containers, lifecycle policies, private endpoints, RBAC role assignments, CMK configuration Remote state per environment
terraform-module-governance Purview account, Purview collections, scan rules, Purview private endpoints, diagnostic settings Remote state (governance)
terraform-module-keyvault Key Vault instances, access policies, private endpoints, diagnostic settings, purge protection Remote state (security)
terraform-module-sas AKS cluster, node pools, AKS networking, SAS service principal, SAS storage Remote state (SAS)
terraform-module-monitoring Log Analytics workspace, diagnostic settings (applied to all resources), alert rules, action groups, Sentinel workspace Remote state (monitoring)
terraform-module-fabric Fabric capacity resource, Fabric admin configuration Remote state (fabric)
terraform-module-policy Azure Policy definitions and assignments at management group level (required tags, allowed regions, deny public endpoints) Remote state (governance)

State management: All Terraform state files are stored in a dedicated storage account in sub-data-management with: blob lease-based state locking (prevents concurrent applies), versioning enabled (state history for rollback), encryption at rest (MMK), access restricted to CI/CD service principal only.

11.2 CI/CD Pipeline Design

Pipelines run in Azure DevOps (Greenfield standard). Self-hosted agents run in vnet-mgmt-cc to access private endpoints. The pipeline follows a 4-stage model:

Stage 1 — Validate (on every Pull Request):

  • terraform fmt -check (formatting compliance)
  • terraform validate (syntax and configuration validation)
  • tflint (Terraform linting for best practices)
  • tfsec / Checkov security scanning: validates that all resources comply with security policies (public endpoints disabled, encryption enabled, required tags present, no unmanaged secrets, private endpoints configured)
  • Results posted as PR comment; PR blocked if critical findings

Stage 2 — Plan (on PR approval):

  • terraform plan generates execution plan against target environment
  • Plan output posted to PR as a formatted comment for human review
  • No changes applied — review gate ensures human approval of all changes

Stage 3 — Apply Non-Production (on merge to develop branch):

  • terraform apply deploys changes to non-production environment
  • Automated smoke tests: verify VNet connectivity, private endpoint DNS resolution, Databricks workspace health, Key Vault accessibility, storage account reachability
  • If smoke tests fail, automatic rollback via terraform apply with previous state

Stage 4 — Apply Production (on merge to main branch + manual approval):

  • Manual approval gate (requires two approvals from platform engineering leads)
  • terraform apply deploys to production
  • Post-deployment validation: resource health checks, Databricks workspace connectivity, pipeline reachability, monitoring integration verification
  • Deployment tagged in Git with version number and timestamp

11.3 Data Engineering CI/CD (Separate from Infrastructure)

Data engineering artifacts follow their own CI/CD lifecycle, independent of infrastructure IaC:

Artifact Type Version Control CI/CD Tool Deployment Method
DLT pipeline definitions Databricks Repos (Git-backed) Azure DevOps pipeline Databricks Asset Bundles (DABs) — databricks bundle deploy to prod workspace
Databricks notebooks Databricks Repos Azure DevOps DABs or Repos-based deployment on merge to main
dbt models (if used) Git repository Azure DevOps dbt run executed by Databricks Workflow job
ADF pipelines ARM template export in Git Azure DevOps ARM template deployment to ADF instance
Great Expectations suites Git repository Azure DevOps Deployed as part of pipeline package
SAS programs & macros Git repository (external to SAS) Azure DevOps File deployment to SAS Compute Server shared filesystem; SAS Model Manager registration
Power BI semantic models Power BI Desktop files in Git Azure DevOps + Fabric REST API Automated deployment via Fabric deployment pipelines (dev → test → prod)

Branching strategy: Gitflow — feature branches → develop (integration) → release (staging validation) → main (production). Hotfixes branch from main and merge back to both main and develop.

Testing strategy for data pipelines:

  • Unit tests: DLT expectations serve as row-level quality tests; Great Expectations suites validate cross-table invariants
  • Integration tests: End-to-end pipeline run in non-prod workspace against sample datasets (masked production data); validates Bronze → Silver → Gold flow, quality gates, and data contract compliance
  • Regression tests: Compare Gold output schema and row counts against baseline after pipeline changes
  • Performance tests: Benchmark pipeline duration and DBU consumption against historical baselines; flag regressions exceeding 20% threshold

Appendix A — Resource Naming Convention

General pattern: {type}-{workload}-{environment}-{region}

Resource Type Prefix Example
Resource Group rg- rg-databricks-prod-cc
Virtual Network vnet- vnet-data-prod-cc
Subnet snet- snet-dbx-host-prod
NSG nsg- nsg-dbx-host-prod
Storage Account (ADLS) stadls stadlsgoldprod (no hyphens — Azure restriction)
Key Vault kv- kv-data-platform-prod
Databricks Workspace dbw- dbw-data-eng-prod
SQL Warehouse sqlwh- sqlwh-bi-serving
Log Analytics law- law-data-platform-prod
Azure Data Factory adf- adf-data-platform-prod
AKS Cluster aks- aks-sas-viya-prod
Private Endpoint pe- pe-stadlsgoldprod-dfs
Managed Identity id- id-databricks-prod
Service Principal sp- sp-sas-compute-prod
Purview Account pv- pv-data-governance-prod
Fabric Capacity fc- fc-bi-serving-prod
Azure Firewall afw- afw-hub-canadacentral

Region abbreviations: cc = Canada Central, ce = Canada East.


This covers the full Azure infrastructure deployment design for the Greenfield Modern Data Platform. When you're ready, I can generate this as a formatted .docx document for download — the computer tools appear to have been experiencing intermittent issues, but I'm happy to retry whenever you'd like.