You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
mdp/Infrastructure/modern_data_platform_infras...

616 lines
53 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Greenfield Modern Data Platform — Azure Infrastructure Deployment Guide
**Companion to:** Modern Data Platform High-Level Reference Architecture v8.0
**Classification:** Internal Confidential | **Date:** March 2026
---
## 1. Azure Landing Zone Architecture
### 1.1 Design Philosophy
The Greenfield Modern Data Platform deploys within an Azure Landing Zone aligned with Microsoft's Cloud Adoption Framework (CAF) Enterprise-Scale architecture. The landing zone provides the secure, governed foundation required for a regulated financial institution operating under AMF, OSFI, and Law 25 requirements. Every design decision traces back to the reference architecture's principles: Security by Design, Right Tool for the Right Workload, and Unified Governance with Federated Execution.
### 1.2 Hub-Spoke Topology
The deployment follows a hub-spoke network topology — the standard pattern for enterprise-scale Azure deployments in regulated financial services. Shared network services (firewall, DNS, on-premises connectivity) are centralized in a hub virtual network, while each workload environment operates in its own spoke VNet peered to the hub.
| Component | VNet Name | Purpose |
|---|---|---|
| **Connectivity Hub** | `vnet-hub-canadacentral` | Azure Firewall (Premium), ExpressRoute gateway to on-premises data centers, Azure Bastion for secure admin access, centralized Private DNS Zones, VPN gateway for backup connectivity |
| **Data Platform Production** | `vnet-data-prod-cc` | Databricks workspaces (VNet-injected), ADLS Gen2 private endpoints, Databricks SQL Warehouse endpoints, Purview private endpoints, Key Vault private endpoints |
| **Data Platform Non-Production** | `vnet-data-nonprod-cc` | Development and staging Databricks workspaces, test storage accounts, sandbox environments for data scientists |
| **SAS Viya** | `vnet-sas-prod-cc` | SAS Viya Compute Server pods on AKS (or VMs), JDBC connectivity to Databricks SQL Warehouses routed through hub firewall |
| **Management** | `vnet-mgmt-cc` | Azure DevOps self-hosted agents, Terraform runners, monitoring infrastructure, Manta lineage engine deployment (if IaaS) |
| **Fabric (Overlay)** | N/A (PaaS) | Microsoft Fabric is fully managed PaaS — no VNet required. Connectivity to ADLS Gen2 is through OneLake shortcuts and Microsoft-managed private endpoints from the Fabric tenant |
All spoke VNets peer to the hub. Inter-spoke traffic routes through Azure Firewall for inspection. No direct spoke-to-spoke peering is permitted — this ensures all cross-environment traffic is logged and inspectable.
### 1.3 Hub Services Detail
**Azure Firewall (Premium SKU).** Centralized egress filtering with TLS inspection for outbound traffic, FQDN-based application rules, network rules for port-level control, and threat intelligencebased filtering. All spoke VNets route their default route (0.0.0.0/0) through the hub firewall via user-defined routes (UDRs). The Premium SKU is required for TLS inspection and IDPS capabilities mandated by OSFI guidelines.
**ExpressRoute Gateway.** Dedicated, private connectivity between Greenfield' on-premises data centers (Lévis/Montréal) and Azure. The ExpressRoute circuit terminates in the hub VNet with route propagation to all spokes. This is the primary path for source system connectivity — core banking (DB2, Oracle), insurance policy administration, mainframe extracts, and CRM systems all feed data through this circuit into ADF and Auto Loader ingestion pipelines.
**Private DNS Zones.** Centralized in the hub subscription and linked to every spoke VNet. Zones include `privatelink.dfs.core.windows.net`, `privatelink.blob.core.windows.net`, `privatelink.vaultcore.azure.net`, `privatelink.azuredatabricks.net`, `privatelink.purview.azure.com`, `privatelink.servicebus.windows.net`, and others. This ensures all workloads resolve private endpoint FQDNs to their private IPs regardless of which spoke they run in.
**Azure Bastion.** Browser-based RDP/SSH access to management VMs and jump boxes without exposing public IPs. This is the only permitted path for interactive administrative access to infrastructure resources.
### 1.4 Region Strategy
| | Primary | Secondary |
|---|---|---|
| **Region** | Canada Central (Toronto) | Canada East (Québec City) |
| **Role** | All production and non-production workloads | Disaster recovery, geo-redundant storage replication target |
| **Rationale** | Lower latency to majority of Azure services; broader service availability | Data residency compliance (Law 25, OSFI); geographic separation for DR |
**Critical constraint:** All data must remain within Canadian Azure regions. No replication, backup, or compute spillover to non-Canadian regions is permitted. This constraint is enforced via Azure Policy at the management group level (`allowedLocations` = `canadacentral`, `canadaeast`).
---
## 2. Resource Organization
### 2.1 Management Group Hierarchy
The data platform subscriptions inherit organization-wide Azure Policies from Greenfield' root management group (allowed regions, required tags, prohibited resource types, mandatory encryption). The platform-specific hierarchy:
```
Greenfield (Root MG)
└── Data & AI Platform (MG)
├── Production (MG)
│ ├── sub-data-platform-prod
│ ├── sub-data-sas-prod
│ └── sub-data-fabric-prod
├── Non-Production (MG)
│ └── sub-data-platform-nonprod
└── Connectivity & Shared Services (MG)
├── sub-data-connectivity
└── sub-data-management
```
Azure Policies applied at the "Data & AI Platform" management group level include: deny public endpoint creation on storage/Key Vault/Purview, require specific tags on all resources, enforce diagnostic settings on all supported resources, deny creation of unmanaged disks, and require TLS 1.2 minimum.
### 2.2 Subscription Layout
| Subscription | Purpose | Key Resources |
|---|---|---|
| `sub-data-connectivity` | Hub networking, shared services | Azure Firewall, ExpressRoute gateway, Bastion, Private DNS Zones, VPN gateway |
| `sub-data-platform-prod` | Production data platform workloads | Databricks workspaces (prod), ADLS Gen2 accounts (bronze/silver/gold/staging), Key Vault, Purview account, Event Hub namespaces |
| `sub-data-platform-nonprod` | Development, staging, sandbox | Databricks workspaces (dev/stg/sandbox), ADLS Gen2 (dev/stg), Key Vault (non-prod) |
| `sub-data-sas-prod` | SAS Viya Compute Server | AKS cluster (or VM scale sets), SAS-specific staging storage, SAS license server |
| `sub-data-fabric-prod` | Microsoft Fabric capacity | Fabric F-SKU capacity resources, Fabric admin workspace for capacity management |
| `sub-data-management` | Platform operations & DevOps | Azure DevOps self-hosted agents, Terraform state storage account, Log Analytics workspace, Manta deployment (if IaaS), Azure Monitor resources, Microsoft Sentinel |
**Rationale for subscription isolation:** Separate subscriptions provide independent RBAC boundaries (SAS team cannot accidentally modify Databricks resources), independent Azure Cost Management scopes for precise chargeback, independent quota management, and separate blast radii for misconfigurations.
### 2.3 Resource Group Strategy
Within `sub-data-platform-prod`:
| Resource Group | Contents |
|---|---|
| `rg-databricks-prod-cc` | Databricks workspaces, managed resource groups (auto-created by Databricks), associated NSGs |
| `rg-storage-prod-cc` | ADLS Gen2 storage accounts (bronze, silver, gold, staging, archive), storage private endpoints |
| `rg-governance-prod-cc` | Microsoft Purview account, Purview managed storage, Purview private endpoints, Purview managed Event Hub |
| `rg-keyvault-prod-cc` | Azure Key Vault instances (platform secrets + encryption keys), Key Vault private endpoints |
| `rg-monitoring-prod-cc` | Log Analytics workspace (prod), diagnostic settings resources, Azure Monitor action groups, alert rules, workbooks |
| `rg-networking-prod-cc` | VNet, subnets, NSGs, route tables (UDRs), private endpoint NICs |
| `rg-ingestion-prod-cc` | Azure Data Factory instance, Event Hub namespaces (for streaming ingestion), ADF private endpoints |
---
## 3. Networking Architecture
### 3.1 VNet & Subnet Design — Production Data Platform
The production spoke VNet (`vnet-data-prod-cc`) uses a /16 address space (e.g., 10.10.0.0/16) to accommodate Databricks' substantial IP requirements (VNet injection demands large subnets) and future growth.
| Subnet | CIDR | Purpose | Delegation / Notes |
|---|---|---|---|
| `snet-dbx-host-prod` | 10.10.0.0/22 | Databricks cluster host VMs (VNet injection) | Delegation: `Microsoft.Databricks/workspaces`; NSG: `nsg-dbx-host-prod` |
| `snet-dbx-container-prod` | 10.10.4.0/22 | Databricks cluster container network (VNet injection) | Delegation: `Microsoft.Databricks/workspaces`; NSG: `nsg-dbx-container-prod` |
| `snet-private-endpoints` | 10.10.8.0/24 | Private endpoints for ADLS, Key Vault, Purview, Event Hub | No delegation; NSG restricts inbound to platform VNets only |
| `snet-sqlwarehouse-prod` | 10.10.9.0/24 | Databricks Serverless SQL Warehouse connectivity | Network Connectivity Config (NCC) for serverless compute |
| `snet-adf-prod` | 10.10.10.0/24 | ADF Integration Runtime (self-hosted, if needed for on-prem sources) | VMs running IR software |
| `snet-services-prod` | 10.10.11.0/24 | Supporting services, internal load balancers, utility VMs | General-purpose |
**Sizing note:** Databricks VNet injection requires /22 or larger subnets for host and container in production to support auto-scaling across multiple concurrent clusters (data engineering, SQL warehouses, ML training). Each running node consumes 2 IPs (one host, one container). A /22 provides ~1,022 usable IPs per subnet, supporting approximately 500 concurrent nodes — sufficient for Greenfield' production workload with headroom.
### 3.2 Private Endpoints
Every platform service that supports private endpoints **must** be configured with public network access disabled. This is non-negotiable for a regulated financial institution.
| Service | Private Endpoint Sub-Resource | Private DNS Zone |
|---|---|---|
| ADLS Gen2 (each account × 2) | `dfs`, `blob` | `privatelink.dfs.core.windows.net`, `privatelink.blob.core.windows.net` |
| Azure Key Vault (each instance) | `vault` | `privatelink.vaultcore.azure.net` |
| Microsoft Purview | `account`, `portal` | `privatelink.purview.azure.com`, `privatelink.purviewstudio.azure.com` |
| Databricks Workspace | `databricks_ui_api` | `privatelink.azuredatabricks.net` |
| Azure Event Hub | `namespace` | `privatelink.servicebus.windows.net` |
| Azure Data Factory | `dataFactory`, `portal` | `privatelink.datafactory.azure.net`, `privatelink.adf.azure.com` |
| Azure Container Registry (for SAS) | `registry` | `privatelink.azurecr.io` |
| Databricks Unity Catalog Metastore | (via workspace PE) | Covered by Databricks workspace PE |
**Total private endpoints (production):** Approximately 2530 PEs across all services. Each PE creates a NIC in `snet-private-endpoints` with a private IP, plus a DNS A record in the corresponding Private DNS Zone.
### 3.3 Network Security Groups (NSGs)
NSGs are applied at the subnet level with a default deny-all posture. Explicit allow rules are created only for required traffic flows.
**Databricks host & container subnets:**
- Allow all inbound/outbound between `snet-dbx-host-prod` and `snet-dbx-container-prod` (required for Spark executor ↔ driver communication)
- Allow outbound to `snet-private-endpoints` on ports 443 (ADLS HTTPS), 443 (Key Vault), 1433 (if Azure SQL is used)
- Allow outbound to Databricks control plane IPs (published by Databricks, region-specific) on port 443
- Allow outbound to Azure Service Tags (`AzureActiveDirectory`, `AzureMonitor`) for authentication and telemetry
- Deny all inbound from internet
- Deny all other outbound (forced through Azure Firewall via UDR)
**Private endpoint subnet:**
- Allow inbound from `snet-dbx-host-prod`, `snet-dbx-container-prod`, `snet-adf-prod`, `snet-services-prod`, and `vnet-sas-prod-cc` (via peering) on service-specific ports
- Deny inbound from all other sources
- No outbound rules needed (PEs are inbound-only destinations)
### 3.4 Azure Firewall Rules
The hub Azure Firewall controls all egress from spoke VNets. Key application rule collections:
| Rule Collection | FQDN Targets | Protocol | Purpose |
|---|---|---|---|
| `rc-databricks-control` | `*.azuredatabricks.net`, Databricks control plane IPs | HTTPS (443) | Databricks workspace communication with Azure-managed control plane |
| `rc-azure-services` | `login.microsoftonline.com`, `management.azure.com`, `*.monitor.azure.com` | HTTPS (443) | Entra ID authentication, ARM management, Azure Monitor telemetry |
| `rc-package-repos` | `pypi.org`, `files.pythonhosted.org`, `repo.anaconda.com`, `conda.anaconda.org` | HTTPS (443) | Python/Conda package installation for Databricks clusters and SAS |
| `rc-databricks-artifacts` | `dbartifactsprodcac.blob.core.windows.net` (region-specific) | HTTPS (443) | Databricks runtime artifacts, libraries, Spark distributions |
| `rc-sas-licensing` | SAS licensing endpoints (provided by SAS) | HTTPS (443) | SAS Viya license validation |
| `rc-deny-all` | `*` | Any | Default deny — all egress not matching an explicit rule is blocked and logged |
### 3.5 Connectivity to On-Premises
Source systems (core banking on DB2, Oracle, SQL Server; insurance policy administration; CRM; mainframe extracts) connect to the Azure data platform through ExpressRoute. ADF pipelines and Databricks Auto Loader access on-premises sources through the hub VNet's ExpressRoute gateway, with traffic routing controlled by BGP route propagation and UDRs. Self-hosted Integration Runtimes (in `snet-adf-prod`) are deployed for sources that require a local agent.
---
## 4. Identity & Access Management
### 4.1 Azure Entra ID as the Identity Foundation
All platform access is brokered through Azure Entra ID (formerly Azure Active Directory) with mandatory multi-factor authentication enforced via Conditional Access policies. Conditional Access policies require: MFA for all users, compliant device, access from Greenfield corporate network or approved VPN locations, and session lifetime controls (re-authentication every 12 hours for interactive sessions).
| Component | Identity Integration | Authentication Method |
|---|---|---|
| Databricks Workspaces | Entra ID SSO via SCIM provisioning; Entra security groups synced to Unity Catalog for RBAC | OAuth 2.0 / SAML; MFA via Conditional Access |
| Microsoft Fabric | Native Entra ID integration via Microsoft 365 tenant | Entra ID SSO; MFA enforced |
| SAS Viya | Entra ID integration via SAML 2.0 federation or direct OIDC (depending on SAS version) | SAML 2.0 federated with Entra ID; MFA enforced |
| Microsoft Purview | Native Entra ID; Purview collection-level roles mapped to Entra security groups | Entra ID SSO |
| Azure Data Factory | Managed identity for pipeline execution; Entra groups for authoring RBAC | System-assigned Managed Identity |
| ADLS Gen2 | Entra ID RBAC at Azure resource level (Storage Blob Data roles) + Unity Catalog ACLs at the data layer | Managed Identity / Service Principal |
### 4.2 RBAC Model — Azure Resource Layer
Azure RBAC provides coarse-grained access at the resource level. Data-level fine-grained access (RLS, CLS, DDM) is enforced by Unity Catalog at the compute layer.
| Entra Security Group | Azure RBAC Role | Scope |
|---|---|---|
| `sg-data-platform-admins` | Contributor | `sub-data-platform-prod`, `sub-data-platform-nonprod` |
| `sg-data-engineers` | Databricks Contributor, Storage Blob Data Contributor | Databricks workspaces, ADLS Gen2 accounts |
| `sg-data-scientists` | Databricks Contributor (sandbox workspace only), Storage Blob Data Reader | Sandbox workspace, Silver/Gold ADLS (read) |
| `sg-data-analysts` | Reader | Databricks analytics workspace (SQL Warehouse access via Unity Catalog grants) |
| `sg-governance-admins` | Purview Data Curator, Key Vault Administrator | Purview account, Key Vault |
| `sg-sas-developers` | Contributor | `sub-data-sas-prod` |
| `sg-fabric-admins` | Fabric Capacity Administrator | Fabric capacity resource |
| `sg-finops` | Cost Management Reader, Billing Reader | All data platform subscriptions |
### 4.3 Service Principals & Managed Identities
Automated workloads use managed identities (preferred) or service principals (where managed identities are not supported). **Shared secrets and embedded credentials are explicitly prohibited** per the reference architecture.
| Workload | Identity Type | Scope / Role Assignments |
|---|---|---|
| Databricks workspace (system) | System-assigned Managed Identity | Storage Blob Data Contributor on ADLS Gen2 accounts; Key Vault Secrets User on `kv-data-platform-prod` |
| ADF pipelines | System-assigned Managed Identity | Storage Blob Data Contributor on ADLS; Databricks workspace access via linked service token |
| SAS Viya Compute Server | Service Principal (`sp-sas-compute-prod`) | Storage Blob Data Reader on **authorized ADLS paths only** (non-sensitive datasets); JDBC access to Databricks SQL Warehouses scoped via Unity Catalog grants |
| Purview scanners | System-assigned Managed Identity | Storage Blob Data Reader on all ADLS accounts; Databricks metastore reader for Unity Catalog metadata harvesting |
| Manta lineage engine | Service Principal (`sp-manta-prod`) | Read-only access to Databricks repos, ADF metadata APIs, SAS code repositories; Purview Data Curator for lineage publishing |
| CI/CD pipelines (Terraform) | Service Principal (`sp-terraform-prod`) | Contributor on data platform subscriptions; User Access Administrator for RBAC assignments |
**SAS read-path security enforcement:** Because SAS Compute Server accesses ADLS files using its service principal's Azure RBAC permissions (bypassing Unity Catalog's fine-grained RLS/CLS/DDM), the `sp-sas-compute-prod` service principal is granted Storage Blob Data Reader **only** on non-sensitive, pre-authorized ADLS paths. Those paths are registered as Unity Catalog external locations with restricted scope. For sensitive datasets, SAS **must** read through JDBC LIBNAME to Databricks SQL Warehouses, where Unity Catalog enforcement applies at query time.
---
## 5. Compute Sizing & Configuration
### 5.1 Databricks Workspaces
| Workspace | Environment | Purpose | Configuration |
|---|---|---|---|
| `dbw-data-eng-prod` | Production | Data engineering: Bronze→Silver→Gold DLT pipelines, Auto Loader streaming ingestion, data quality workflows | Unity Catalog enabled; VNet injected; cluster policies enforce approved instance types (`Standard_DS4_v2` 8 vCPU/28 GB as default workers, `Standard_DS5_v2` 16 vCPU/56 GB for memory-intensive), auto-terminate 30 min, spot instances (60/40 spot/on-demand) for non-critical batch jobs; Photon enabled for DLT pipelines |
| `dbw-analytics-prod` | Production | SQL analytics, Databricks AI/BI Dashboards, Genie NL querying, ad-hoc analyst queries | Serverless SQL Warehouses (preferred); Pro SQL Warehouses for workloads requiring Classic compute; auto-suspend 10 min |
| `dbw-mlops-prod` | Production | ML/AI: MLflow experiments, model training, Feature Store serving, Model Serving endpoints, GenAI workloads | GPU instance pools (`Standard_NC6s_v3` 6 vCPU/112 GB/1× V100 for training; `Standard_NC4as_T4_v3` for inference); CPU pools for feature engineering; Model Serving auto-scale (min replicas: 0 for dev, 1 for prod SLA endpoints) |
| `dbw-data-eng-dev` | Non-Prod | Development and testing of data engineering pipelines | Reduced cluster sizes (max 4 workers); mandatory auto-terminate 15 min; restricted to `Standard_DS3_v2`; Unity Catalog dev metastore (separate from prod) |
| `dbw-sandbox` | Non-Prod | Data science exploration, POCs, experimentation | Read-only access to Silver/Gold via Unity Catalog; no production write access; budget cap enforced ($X/month per user via tag-based cost alerts); auto-terminate 10 min |
**Cluster policy governance:** All workspaces enforce cluster policies that prevent creation of non-compliant clusters. Policies define: allowed instance types (approved VM families only), maximum workers per cluster (10 for dev, 50 for prod data eng, 20 for ML), mandatory auto-termination, mandatory Unity Catalog mode (no legacy clusters), spot instance minimum ratio, custom tags (CostCenter, DataDomain required), and prohibited features (no local file system access in prod, no public IP assignment).
### 5.2 Databricks SQL Warehouses
| SQL Warehouse | Size (DBU) | Purpose | Configuration |
|---|---|---|---|
| `sqlwh-bi-serving` | Medium (48 DBU base) | JDBC endpoint consumed by SAS Compute Server (JDBC LIBNAME) and indirectly by Power BI (via OneLake shortcut refresh queries) | Serverless; auto-suspend 10 min; scaling policy: economy (cost-optimized, queues queries rather than scaling aggressively); spot instances enabled; query result caching enabled |
| `sqlwh-analytics` | SmallMedium (28 DBU) | Ad-hoc analyst queries via SQL editor, JDBC/ODBC for downstream applications, notebook SQL exploration | Serverless; auto-suspend 10 min; query queuing enabled; intelligent workload management |
| `sqlwh-etl-support` | MediumLarge (416 DBU) | Data quality validation queries run during pipeline execution, data contract SLA checks (freshness, completeness), Gold layer aggregation queries | Pro; scheduled scaling aligned with pipeline windows (04:0008:00 EST peak for T+1 processing); auto-suspend after pipeline completion |
### 5.3 SAS Viya Compute Server
SAS Viya runs on Compute Server engine (not CAS — this is AD-04). Processing is sequential and batch-oriented, so compute is sized for single-threaded throughput with high memory for in-process data manipulation, not for horizontal parallelism.
**Deployment option: AKS (recommended)**
| Component | VM Size / Pod Spec | Quantity | Notes |
|---|---|---|---|
| SAS Compute Server pods | `Standard_E16s_v5` (16 vCPU, 128 GB RAM) | 24 pods | Primary workhorses for actuarial model development, risk model validation, moderate batch scoring; memory-optimized for SAS DATA step and PROC SQL in-memory operations |
| SAS Programming Runtime (heavy) | `Standard_E32s_v5` (32 vCPU, 256 GB RAM) | 12 pods | Large actuarial reserving models (IFRS 17, IBNR) requiring extended memory; scheduled availability during nightly batch windows to control cost |
| SAS License Server | `Standard_D4s_v5` (4 vCPU, 16 GB RAM) | 1 (HA pair) | SAS license management service; always-on |
| SAS Model Manager | `Standard_D8s_v5` (8 vCPU, 32 GB RAM) | 12 pods | Model governance, model registration, champion/challenger tracking for regulatory model management |
| AKS System Node Pool | `Standard_D4s_v5` | 3 nodes (across AZs) | AKS system services (CoreDNS, kube-proxy, metrics-server) |
**AKS configuration:** Private AKS cluster (no public API endpoint); Azure CNI networking for VNet integration in `vnet-sas-prod-cc`; Entra ID integration for RBAC; node pool auto-scaling (min 2 / max 6 for compute pods); Azure Disk CSI driver for persistent volumes (SAS work directories); Azure Files for shared SAS configuration.
### 5.4 Microsoft Fabric Capacity
Fabric capacity is sized **strictly for the BI serving workload** (AD-03). Any request to increase capacity for data engineering or warehousing triggers an architecture review per the reference architecture's anti-pattern rules.
| Capacity SKU | CU | Use Case | Governance |
|---|---|---|---|
| F64 (Production) | 64 Capacity Units | Power BI Direct Lake semantic models serving 55,000 users, report rendering, paginated reports for regulatory distribution, Copilot-assisted analytics | Auto-pause during non-business hours (22:0006:00 EST); smoothing enabled for burst absorption; capacity alerts at 70% and 90% utilization; Fabric admin monitors for non-BI workloads |
| F32 (Non-Production) | 32 CU | Development and testing of Power BI semantic models, report prototyping, Direct Lake connectivity validation | Auto-pause aggressive (off outside business hours); BI workloads only; no Fabric notebooks or Spark |
| F16 (Fabric IQ POC — Horizon 2) | 16 CU | Ontology POC on Customer 360 domain, Data Agent evaluation with controlled user group (per Section 4.5 prerequisites) | Isolated capacity; time-limited (6-month evaluation); requires Architecture Review Board approval before provisioning |
**Cost note:** F64 at list price is approximately $8,0009,000 USD/month. This is significantly less than the Databricks DBU cost that would be incurred if 55,000 Power BI users queried Databricks SQL Warehouses via DirectQuery — the entire economic justification for Fabric in this architecture (AD-03).
---
## 6. Storage Architecture
### 6.1 ADLS Gen2 Account Layout
ADLS Gen2 is the shared storage substrate across all three platforms (AD-07). All accounts have hierarchical namespace (HNS) enabled for Delta Lake compatibility and are configured with private endpoints only (public access disabled at the account level).
| Storage Account | Access Tier | Container Structure | Access Pattern |
|---|---|---|---|
| `stadlsbronzeprod` | Hot | `/bronze/{source_system}/{entity}/` (e.g., `/bronze/core_banking/accounts/`, `/bronze/insurance/claims/`) | **Write:** ADF managed identity, Auto Loader managed identity; **Read:** Databricks data engineering workspace MI, Purview scanner MI |
| `stadlssilverprod` | Hot | `/silver/{domain}/{entity}/` (e.g., `/silver/customer/individual/`, `/silver/claims/claim_header/`) | **Write:** Databricks DLT pipelines; **Read:** Databricks analytics & ML workspaces, SAS via JDBC (Unity Catalog enforced) |
| `stadlsgoldprod` | Hot | `/gold/{data_product}/{entity}/` (e.g., `/gold/customer_360/member_profile/`, `/gold/risk_features/credit_scores/`) | **Write:** Databricks DLT/SQL pipelines; **Read:** Power BI (OneLake shortcut → Direct Lake), Databricks SQL, SAS JDBC LIBNAME, ML Feature Store, REST API services |
| `stadlsstagingprod` | Hot | `/staging/ingestion/{source}/`, `/staging/sas_writeback/{model_domain}/{model_name}/{run_date}/`, `/staging/purview_dq/` | **Write:** ADF (ingestion landing), SAS ADLS LIBNAME (writeback staging); **Read:** Purview DQ scanner (pre-Bronze assessment), Databricks promotion pipelines |
| `stadlsarchiveprod` | Cool → Archive | `/archive/{domain}/{entity}/{year}/` | **Write:** Lifecycle policy tiering; **Read:** Rare, on-demand rehydration for regulatory retrieval |
**Why separate storage accounts (not just separate containers)?** Separate accounts provide independent RBAC boundaries (the SAS service principal has no RBAC on `stadlsbronzeprod`), independent throughput limits (each account gets its own IOPS/bandwidth quota), independent private endpoints (enabling subnet-level NSG control), independent lifecycle policies, and independent diagnostic logging.
### 6.2 Delta Lake Configuration
All data within Bronze, Silver, and Gold containers is stored in Delta Lake format (AD-01).
**Time Travel retention:** 90 days minimum for Bronze (regulatory auditability per AMF requirements); 30 days for Silver and Gold (sufficient for pipeline replay and debugging), extended to 90 days for regulatory domain Gold tables (Financial Aggregates, Risk Features).
**VACUUM & OPTIMIZE schedule (via Databricks Workflows):**
- Bronze: `VACUUM` retaining 90 days, runs weekly (low urgency, append-only)
- Silver: `VACUUM` retaining 30 days, runs nightly; `OPTIMIZE` with `ZORDER BY` on business keys, runs nightly
- Gold: `VACUUM` retaining 30 days, runs nightly; `OPTIMIZE` with `LIQUID CLUSTERING` on high-cardinality filter columns (`date`, `business_unit`, `product_code`, `member_id`), runs nightly post-refresh; table statistics maintained via `ANALYZE TABLE` for SQL Warehouse query optimizer
**Schema evolution:** `mergeSchema` enabled on Bronze tables (Auto Loader schema inference accommodates upstream changes without pipeline failures). Silver and Gold tables enforce strict schemas via DLT expectations and data contracts — schema-breaking changes require a data product lifecycle change request.
### 6.3 Lifecycle Policies
| Storage Account | Policy Rule | Action / Rationale |
|---|---|---|
| `stadlsbronzeprod` | Blobs not modified in 180 days → Cool tier; not modified in 365 days → Archive tier | Cost optimization for aged raw data while maintaining regulatory access; Delta Time Travel handles recent rollback needs |
| `stadlsstagingprod` | Ingestion staging blobs > 30 days → delete; SAS writeback blobs > 14 days after promotion validation → delete | Prevent staging zone bloat; staging data is ephemeral by design |
| `stadlsarchiveprod` | Archive tier by default; rehydrate on demand (Standard priority, 15 hours) | Lowest cost for long-term regulatory retention; rare access pattern |
| All accounts | Soft delete enabled (14-day retention); blob versioning enabled; container soft delete (7 days) | Accidental deletion recovery baseline |
### 6.4 Encryption
All storage accounts use encryption at rest with **Microsoft-managed keys (MMK)** as the baseline. For data classified as **Restricted** (per Purview auto-classification — e.g., NAS/SIN numbers, certain financial details), **customer-managed keys (CMK)** stored in Azure Key Vault (`kv-data-encryption-prod`) are applied at the storage account level. In practice, this means `stadlssilverprod` and `stadlsgoldprod` use CMK (these contain processed, queryable sensitive data), while `stadlsbronzeprod` uses MMK (raw data is further protected by restricted RBAC — only data engineering roles have access).
Encryption in transit: TLS 1.2 minimum enforced on all storage endpoints. `Secure transfer required` = true on all accounts.
---
## 7. Security Architecture
### 7.1 Azure Key Vault
| Key Vault Instance | Purpose | Access Policy |
|---|---|---|
| `kv-data-platform-prod` | Platform operational secrets: database connection strings, SAS license keys, API keys for external data feeds, ADF linked service credentials | Databricks MI, ADF MI, SAS SP → Key Vault Secrets User role; CI/CD SP → Key Vault Administrator (for secret rotation automation) |
| `kv-data-encryption-prod` | Customer-managed encryption keys: ADLS CMK, Databricks workspace CMK (managed services encryption, DBFS encryption) | Storage account MI, Databricks workspace MI → Key Vault Crypto User role; Key Vault admin group → Key Vault Administrator |
| `kv-data-platform-nonprod` | Non-production secrets (same structure, isolated values) | Non-prod workspace MIs and SPs |
**Hardening:**
- Key Vault Firewall: enabled; access restricted to private endpoint only (from `snet-private-endpoints`)
- Purge protection: enabled on all instances (prevents permanent deletion, even by admins, for 90 days)
- Soft delete: enabled with 90-day retention
- Diagnostic settings: all Key Vault operations streamed to Log Analytics; alert rules configured for unexpected access patterns (access from unknown service principals, bulk secret reads, key deletion attempts)
- Secret rotation: automated rotation policies for secrets with 90-day expiry; CI/CD pipeline rotates service principal credentials quarterly
### 7.2 Private Endpoints — Defense in Depth
All platform services operate exclusively through private endpoints. Public endpoints are disabled at the resource level and blocked by Azure Policy. Network traffic between services never traverses the public internet.
The full private endpoint inventory (approximately 2530 PEs in production) is detailed in Section 3.2. Each PE creates: a NIC with private IP in `snet-private-endpoints`, a DNS A record in the corresponding Private DNS Zone, and NSG rules governing which subnets can reach it.
### 7.3 Data Encryption — Layers
| Layer | Encryption | Key Management |
|---|---|---|
| **At rest — ADLS Gen2** | AES-256; MMK (default) or CMK for Restricted data | CMK in `kv-data-encryption-prod`; auto-rotation annually |
| **At rest — Databricks DBFS** | AES-256 with CMK | CMK in `kv-data-encryption-prod` |
| **At rest — Databricks managed services** | CMK for notebook results, job results, ML artifacts | Workspace-level CMK configuration |
| **In transit — all services** | TLS 1.2 minimum | Azure-managed certificates; Databricks enforces TLS for all cluster ↔ storage and cluster ↔ control plane communication |
| **In transit — ExpressRoute** | MACsec (Layer 2 encryption at peering) | Optional but recommended for Express Route Direct circuits |
### 7.4 Databricks Workspace Security Configuration
| Security Control | Configuration Detail |
|---|---|
| **Workspace encryption** | CMK from `kv-data-encryption-prod` for managed services and DBFS |
| **Cluster policies** | Mandatory: enforce approved instance types, auto-termination, spot ratios, max workers, Unity Catalog mode, custom tags; prohibit legacy table ACLs, local file download, init scripts from untrusted sources |
| **Unity Catalog** | External metastore on dedicated ADLS storage; all data access routed through UC; legacy Hive metastore disabled; default catalog = none (users must specify catalog explicitly) |
| **Token management** | Personal Access Tokens (PATs) **disabled** in production workspaces; service principals use OAuth 2.0 M2M (client credentials flow) |
| **IP access lists** | Workspace accessible only from Greenfield corporate network IP ranges (via ExpressRoute) and Azure Bastion subnet |
| **Audit logging** | All workspace audit logs streamed to Log Analytics via diagnostic settings; captures: data access events, cluster lifecycle, admin configuration changes, SQL query history, secrets access |
| **Secret scopes** | All Databricks secret scopes backed by Azure Key Vault (`kv-data-platform-prod`); Databricks-native secret storage disabled |
| **Secure cluster connectivity (No Public IP)** | Enabled — cluster nodes have no public IPs; all communication via VNet and private links |
### 7.5 Data Loss Prevention & Exfiltration Controls
- **Egress restriction:** All compute cluster outbound traffic routes through Azure Firewall; only whitelisted FQDNs are permitted (Section 3.4)
- **DBFS download restriction:** Local file download from Databricks notebooks disabled in production via workspace admin settings
- **Purview DLP:** Microsoft Purview DLP policies integrated with M365 prevent sensitive data from leaving via email, Teams, or SharePoint
- **Clipboard/export controls:** Fabric Power BI tenant settings restrict data export from reports (no CSV export of Restricted-classified datasets)
---
## 8. Monitoring & Observability
### 8.1 Centralized Monitoring Stack
All monitoring data converges on a central Log Analytics workspace (`law-data-platform-prod` in `sub-data-management`) to provide a single-pane-of-glass view.
| Component | Technology | Data Collected |
|---|---|---|
| **Infrastructure** | Azure Monitor metrics + Log Analytics | VM metrics (CPU, memory, disk), AKS node/pod metrics, storage account metrics (transactions, latency, capacity), network flow logs (NSG flow logs v2), Azure Firewall logs (application + network rules) |
| **Databricks** | Databricks audit logs → Diagnostic Settings → Log Analytics; Databricks system tables (`system.billing.usage`, `system.compute.clusters`, `system.query.history`) | Cluster utilization & idle time, job run durations/statuses/failures, SQL Warehouse query performance (duration, bytes scanned, cache hit ratio), DBU consumption by workspace/cluster/job/user, Unity Catalog audit trail (who accessed what data) |
| **Data pipelines** | ADF monitoring → Log Analytics; Databricks Workflows run logs | Pipeline run status, duration, row counts, error details; DLT pipeline quality metrics (rows passed/failed expectations); ingestion audit log (source, timestamp, row counts, schema version) |
| **Data quality** | Purview DQ scores (Tier 1) + DLT expectation metrics (Tier 23) → unified DQ dashboard (Power BI) | Quality scores by domain, source system, data product; quarantine volumes; SLA compliance trends; CDE completeness tracking |
| **Security / SIEM** | Microsoft Sentinel connected to Log Analytics | Security events aggregated from all services; anomalous data access patterns; DLP alerts; Key Vault access audit; failed authentication attempts; service principal credential anomalies |
| **Cost** | Azure Cost Management + Databricks Account Console | Subscription-level cost by resource group and tag; Databricks DBU consumption by workspace, cluster type, job, and user; Fabric CU consumption; storage growth trends |
### 8.2 Key Dashboards
| Dashboard | Audience | Content |
|---|---|---|
| **Platform Health** | Platform engineering team | Cluster availability, pipeline success rates, SQL Warehouse queue times, storage latency, private endpoint health, AKS node status |
| **Data Quality Governance** | CDO office, data stewards | Unified DQ scores (all three tiers), quarantine volumes, SLA compliance by data product, CDE completeness trends, steward action items |
| **FinOps & Cost** | FinOps team, CDO office | DBU spend by team/domain, Fabric CU utilization, storage growth, reserved vs. on-demand ratio, cost anomaly detection, budget burn rate |
| **Security & Compliance** | Security team, CISO office | Sentinel incident summary, data access audit summary, DLP alert trends, service principal activity, Key Vault operations |
### 8.3 Alerting Strategy
| Alert Category | Trigger | Severity | Action |
|---|---|---|---|
| **Pipeline failure** | Production ingestion or DLT pipeline fails | Sev 1 (critical domains: Customer 360, Financial Aggregates, Risk Features) / Sev 2 (others) | Page on-call data engineer; auto-retry with exponential backoff (3 retries); escalate to lead if all retries fail |
| **Data quality SLA breach** | Gold data product fails freshness SLA (e.g., T+1 by 06:00 EST not met) or completeness threshold (<99.5% on CDEs) | Sev 1 | Notify data product owner + steward; defer Gold refresh; CDO office escalation if unresolved within 4 hours |
| **Security anomaly** | Unexpected SP data access; bulk data download; Key Vault access from unknown identity | Sev 1 | Sentinel incident auto-created; SOC investigation; auto-block identity if high-confidence threat |
| **Cluster over-provisioning** | Databricks cluster idle >30 min or CPU utilization <20% for sustained period | Sev 3 | Notify FinOps team; recommend right-sizing; auto-terminate if policy allows |
| **Fabric capacity saturation** | CU utilization >90% sustained for 1 hour | Sev 2 | Notify Fabric admin; evaluate workload scheduling or capacity burst |
| **Storage anomaly** | Storage account growth rate exceeds 2× historical 7-day baseline | Sev 3 | Notify data engineering lead; investigate potential staging bloat or data duplication |
---
## 9. Disaster Recovery & Business Continuity
### 9.1 Recovery Objectives
| Tier | Workloads | RPO | RTO | Strategy |
|---|---|---|---|---|
| **Tier 1 — Critical** | Gold data products (Customer 360, Financial Aggregates, Risk Features), regulatory reporting pipelines, active ML model serving endpoints | ≤ 1 hour | ≤ 4 hours | GRS/RA-GRS storage replication to Canada East; Databricks workspace deployable via IaC in <2 hours; pre-provisioned standby networking in Canada East; SQL Warehouse endpoints re-creatable from IaC |
| **Tier 2 — Important** | Silver layer, ingestion pipelines (ADF + Auto Loader), SAS Viya actuarial models, Feature Store | 4 hours | 8 hours | GRS storage replication; IaC-based compute rebuild in secondary region; SAS model code in version control (Git); ADF pipeline definitions exported and version-controlled |
| **Tier 3 — Standard** | Bronze layer, development/sandbox environments, non-production workloads | 24 hours | 24 hours | GRS storage (async replication); rebuild from IaC; accept data loss up to last successful replication point |
### 9.2 Storage Replication
All production ADLS Gen2 accounts are configured with **Geo-Redundant Storage (GRS)**, replicating data asynchronously to Canada East. For Tier 1 storage accounts (`stadlsgoldprod`), **Read-Access GRS (RA-GRS)** is enabled, allowing read access from the secondary region during a regional outage without waiting for Microsoft-initiated failover.
Delta Lake's transaction log (`_delta_log/`) is included in GRS replication. After failover, replicated Delta tables are query-consistent the transaction log ensures that only committed transactions are visible, even if replication was mid-flight during the outage.
### 9.3 Compute Recovery
All infrastructure is defined as code (Terraform see Section 12). In a DR scenario:
1. **Networking:** Standby hub-spoke VNets in Canada East are pre-provisioned (warm standby) with peering, NSGs, and route tables ready. ExpressRoute circuit has a secondary connection to Canada East.
2. **Databricks:** Workspace configuration, cluster policies, Unity Catalog metastore settings, and secret scope bindings are all in Terraform. A `terraform apply` targeting the Canada East module deploys a functional workspace in ~6090 minutes.
3. **SAS Viya:** AKS cluster definition is in Terraform; SAS container images are in Azure Container Registry (geo-replicated to Canada East). Deployment time ~23 hours including SAS configuration restore.
4. **Fabric:** Fabric capacity is region-specific. A standby F-SKU in Canada East can be provisioned on-demand (manual, ~30 minutes). Power BI semantic models would need to be re-pointed to the secondary ADLS endpoints.
### 9.4 Backup Strategy
| Component | Backup Mechanism | Retention |
|---|---|---|
| **Delta Lake data** | Delta Time Travel (point-in-time recovery within retention window) + GRS replication | 90 days Bronze, 30 days Silver/Gold |
| **Azure Key Vault** | Soft delete (90 days) + purge protection; secrets/keys recoverable even after deletion | 90 days |
| **Unity Catalog metastore** | Databricks account-level configuration backup + IaC definitions | Recoverable from IaC |
| **Purview configuration** | Glossary, classification schemas, and policy definitions exported periodically to version control (JSON export) | Git history (indefinite) |
| **Pipeline definitions** | ADF ARM templates in Git; Databricks notebooks in Git (Repos); DLT pipeline code in Git | Git history (indefinite) |
| **SAS code & models** | SAS programs, macros, model code in Git; SAS Model Manager metadata exported periodically | Git history (indefinite) |
**DR testing cadence:** Tabletop exercise quarterly; partial failover test (storage failover + compute redeploy for one domain) semi-annually; full DR simulation annually.
---
## 10. Cost Management & FinOps
### 10.1 Cost Allocation Architecture
Cost visibility is built into the infrastructure from day one through mandatory tagging (enforced by Azure Policy), subscription-level isolation, and platform-native cost tools.
**Mandatory tags (enforced via Azure Policy — deny if missing):**
| Tag Key | Purpose | Example Values |
|---|---|---|
| `Environment` | Environment segregation | `prod`, `nonprod`, `sandbox`, `dr` |
| `CostCenter` | Chargeback to business unit | `CC-CDO-1234`, `CC-INS-5678`, `CC-RISK-9012` |
| `Platform` | Technology pillar identification | `databricks`, `fabric`, `sas`, `governance`, `shared` |
| `Owner` | Operational contact | `team-data-eng@Greenfield.com` |
| `DataDomain` | Domain-level cost attribution | `customer`, `claims`, `risk`, `finance`, `shared` |
| `DataClassification` | Align cost with sensitivity tier | `public`, `internal`, `confidential`, `restricted` |
**Recommended tags (enforced via Azure Policy — audit mode):**
| Tag Key | Purpose | Example Values |
|---|---|---|
| `ManagedBy` | IaC tool tracking | `terraform`, `manual`, `bicep` |
| `Horizon` | Implementation phase tracking | `h1`, `h2`, `h3` |
| `ExpiryDate` | Temporary resource cleanup | `2026-09-30` (for POCs, sandboxes) |
### 10.2 Budget Controls
**Subscription-level budgets:** Set for each subscription with alerts at 50%, 75%, 90%, and 100% of monthly budget. Breaches above 100% trigger automatic notification to CDO office and FinOps team.
**Databricks cost controls:**
- Cluster policies enforce max cluster sizes, auto-termination, and spot instance ratios
- Databricks Account Console provides DBU consumption dashboards by workspace, cluster type, job, and user
- Databricks budgets feature (if available) configured per workspace with alerts
- SQL Warehouses configured with economy scaling to minimize DBU for BI serving workload
**Fabric capacity governance:**
- Capacity sized strictly for BI serving only (AD-03)
- Auto-pause enabled (22:0006:00 EST) saves ~33% of capacity cost
- Smoothing enabled for burst absorption without over-provisioning
- Any request to increase capacity for non-BI workloads triggers architecture review
- Monthly Fabric utilization review by FinOps + Fabric admin
**Storage cost controls:**
- Lifecycle policies (Section 6.3) auto-tier aged data to lower-cost tiers
- Storage growth alerts (Section 8.3) detect anomalous growth
- Monthly review of staging zone sizes to prevent bloat
### 10.3 Reserved Capacity & Savings Plans
| Resource | Commitment Type | Term | Estimated Savings |
|---|---|---|---|
| Databricks DBU | Databricks Committed Use Discount (DBCU) | 1-year (based on 6-month usage baseline after Horizon 1 stabilization) | 2035% vs. pay-as-you-go |
| Azure VMs (SAS Viya AKS nodes) | Azure Savings Plan for Compute | 1-year | 1525% vs. pay-as-you-go |
| ADLS Gen2 Storage | Azure Reserved Capacity (hot tier) | 1-year (based on projected growth model) | Up to 30% on hot tier |
| Fabric Capacity | Fabric reservation (evaluate availability) | Evaluate after 6 months of usage data | TBD at GA pricing |
| ExpressRoute | ExpressRoute circuit commitment | Already committed (enterprise) | N/A existing circuit |
### 10.4 Chargeback Model
Cost is allocated to business units via `CostCenter` tags. The CDO's FinOps practice produces monthly chargeback reports:
- **Direct costs** (Databricks DBU, Fabric CU, SAS compute) allocated to the business unit whose workloads consumed them
- **Shared costs** (storage, networking, governance, monitoring) allocated proportionally by data volume and query consumption per domain
- **Platform overhead** (hub networking, Key Vault, Purview, Manta) allocated as a shared infrastructure charge across all consuming business units
---
## 11. DevOps & Infrastructure as Code
### 11.1 Terraform — Primary IaC Tool
All Azure infrastructure is managed through Terraform. Resources are organized as reusable modules stored in a central Git repository with a clear module hierarchy.
| Terraform Module | Scope | State Backend |
|---|---|---|
| `terraform-module-networking` | Hub-spoke VNets, subnets, NSGs, route tables, Private DNS Zones, Azure Firewall, ExpressRoute, VNet peering | Remote state in Azure Storage (`sub-data-management`); state locking via blob lease |
| `terraform-module-databricks` | Databricks workspaces, cluster policies, Unity Catalog metastore, secret scopes, workspace configuration, IP access lists | Remote state per environment (prod/nonprod); separate state file per workspace |
| `terraform-module-storage` | ADLS Gen2 accounts, containers, lifecycle policies, private endpoints, RBAC role assignments, CMK configuration | Remote state per environment |
| `terraform-module-governance` | Purview account, Purview collections, scan rules, Purview private endpoints, diagnostic settings | Remote state (governance) |
| `terraform-module-keyvault` | Key Vault instances, access policies, private endpoints, diagnostic settings, purge protection | Remote state (security) |
| `terraform-module-sas` | AKS cluster, node pools, AKS networking, SAS service principal, SAS storage | Remote state (SAS) |
| `terraform-module-monitoring` | Log Analytics workspace, diagnostic settings (applied to all resources), alert rules, action groups, Sentinel workspace | Remote state (monitoring) |
| `terraform-module-fabric` | Fabric capacity resource, Fabric admin configuration | Remote state (fabric) |
| `terraform-module-policy` | Azure Policy definitions and assignments at management group level (required tags, allowed regions, deny public endpoints) | Remote state (governance) |
**State management:** All Terraform state files are stored in a dedicated storage account in `sub-data-management` with: blob lease-based state locking (prevents concurrent applies), versioning enabled (state history for rollback), encryption at rest (MMK), access restricted to CI/CD service principal only.
### 11.2 CI/CD Pipeline Design
Pipelines run in Azure DevOps (Greenfield standard). Self-hosted agents run in `vnet-mgmt-cc` to access private endpoints. The pipeline follows a 4-stage model:
**Stage 1 — Validate (on every Pull Request):**
- `terraform fmt -check` (formatting compliance)
- `terraform validate` (syntax and configuration validation)
- `tflint` (Terraform linting for best practices)
- `tfsec` / `Checkov` security scanning: validates that all resources comply with security policies (public endpoints disabled, encryption enabled, required tags present, no unmanaged secrets, private endpoints configured)
- Results posted as PR comment; PR blocked if critical findings
**Stage 2 — Plan (on PR approval):**
- `terraform plan` generates execution plan against target environment
- Plan output posted to PR as a formatted comment for human review
- No changes applied review gate ensures human approval of all changes
**Stage 3 — Apply Non-Production (on merge to `develop` branch):**
- `terraform apply` deploys changes to non-production environment
- Automated smoke tests: verify VNet connectivity, private endpoint DNS resolution, Databricks workspace health, Key Vault accessibility, storage account reachability
- If smoke tests fail, automatic rollback via `terraform apply` with previous state
**Stage 4 — Apply Production (on merge to `main` branch + manual approval):**
- Manual approval gate (requires two approvals from platform engineering leads)
- `terraform apply` deploys to production
- Post-deployment validation: resource health checks, Databricks workspace connectivity, pipeline reachability, monitoring integration verification
- Deployment tagged in Git with version number and timestamp
### 11.3 Data Engineering CI/CD (Separate from Infrastructure)
Data engineering artifacts follow their own CI/CD lifecycle, independent of infrastructure IaC:
| Artifact Type | Version Control | CI/CD Tool | Deployment Method |
|---|---|---|---|
| DLT pipeline definitions | Databricks Repos (Git-backed) | Azure DevOps pipeline | Databricks Asset Bundles (DABs) `databricks bundle deploy` to prod workspace |
| Databricks notebooks | Databricks Repos | Azure DevOps | DABs or Repos-based deployment on merge to main |
| dbt models (if used) | Git repository | Azure DevOps | `dbt run` executed by Databricks Workflow job |
| ADF pipelines | ARM template export in Git | Azure DevOps | ARM template deployment to ADF instance |
| Great Expectations suites | Git repository | Azure DevOps | Deployed as part of pipeline package |
| SAS programs & macros | Git repository (external to SAS) | Azure DevOps | File deployment to SAS Compute Server shared filesystem; SAS Model Manager registration |
| Power BI semantic models | Power BI Desktop files in Git | Azure DevOps + Fabric REST API | Automated deployment via Fabric deployment pipelines (dev test prod) |
**Branching strategy:** Gitflow feature branches `develop` (integration) `release` (staging validation) `main` (production). Hotfixes branch from `main` and merge back to both `main` and `develop`.
**Testing strategy for data pipelines:**
- **Unit tests:** DLT expectations serve as row-level quality tests; Great Expectations suites validate cross-table invariants
- **Integration tests:** End-to-end pipeline run in non-prod workspace against sample datasets (masked production data); validates Bronze Silver Gold flow, quality gates, and data contract compliance
- **Regression tests:** Compare Gold output schema and row counts against baseline after pipeline changes
- **Performance tests:** Benchmark pipeline duration and DBU consumption against historical baselines; flag regressions exceeding 20% threshold
---
## Appendix A — Resource Naming Convention
General pattern: `{type}-{workload}-{environment}-{region}`
| Resource Type | Prefix | Example |
|---|---|---|
| Resource Group | `rg-` | `rg-databricks-prod-cc` |
| Virtual Network | `vnet-` | `vnet-data-prod-cc` |
| Subnet | `snet-` | `snet-dbx-host-prod` |
| NSG | `nsg-` | `nsg-dbx-host-prod` |
| Storage Account (ADLS) | `stadls` | `stadlsgoldprod` (no hyphens Azure restriction) |
| Key Vault | `kv-` | `kv-data-platform-prod` |
| Databricks Workspace | `dbw-` | `dbw-data-eng-prod` |
| SQL Warehouse | `sqlwh-` | `sqlwh-bi-serving` |
| Log Analytics | `law-` | `law-data-platform-prod` |
| Azure Data Factory | `adf-` | `adf-data-platform-prod` |
| AKS Cluster | `aks-` | `aks-sas-viya-prod` |
| Private Endpoint | `pe-` | `pe-stadlsgoldprod-dfs` |
| Managed Identity | `id-` | `id-databricks-prod` |
| Service Principal | `sp-` | `sp-sas-compute-prod` |
| Purview Account | `pv-` | `pv-data-governance-prod` |
| Fabric Capacity | `fc-` | `fc-bi-serving-prod` |
| Azure Firewall | `afw-` | `afw-hub-canadacentral` |
Region abbreviations: `cc` = Canada Central, `ce` = Canada East.
---
This covers the full Azure infrastructure deployment design for the Greenfield Modern Data Platform. When you're ready, I can generate this as a formatted `.docx` document for download the computer tools appear to have been experiencing intermittent issues, but I'm happy to retry whenever you'd like.