|
|
# Greenfield Modern Data Platform — Azure Infrastructure Deployment Guide
|
|
|
|
|
|
**Companion to:** Modern Data Platform – High-Level Reference Architecture v8.0
|
|
|
**Classification:** Internal – Confidential | **Date:** March 2026
|
|
|
|
|
|
---
|
|
|
|
|
|
## 1. Azure Landing Zone Architecture
|
|
|
|
|
|
### 1.1 Design Philosophy
|
|
|
|
|
|
The Greenfield Modern Data Platform deploys within an Azure Landing Zone aligned with Microsoft's Cloud Adoption Framework (CAF) Enterprise-Scale architecture. The landing zone provides the secure, governed foundation required for a regulated financial institution operating under AMF, OSFI, and Law 25 requirements. Every design decision traces back to the reference architecture's principles: Security by Design, Right Tool for the Right Workload, and Unified Governance with Federated Execution.
|
|
|
|
|
|
### 1.2 Hub-Spoke Topology
|
|
|
|
|
|
The deployment follows a hub-spoke network topology — the standard pattern for enterprise-scale Azure deployments in regulated financial services. Shared network services (firewall, DNS, on-premises connectivity) are centralized in a hub virtual network, while each workload environment operates in its own spoke VNet peered to the hub.
|
|
|
|
|
|
| Component | VNet Name | Purpose |
|
|
|
|---|---|---|
|
|
|
| **Connectivity Hub** | `vnet-hub-canadacentral` | Azure Firewall (Premium), ExpressRoute gateway to on-premises data centers, Azure Bastion for secure admin access, centralized Private DNS Zones, VPN gateway for backup connectivity |
|
|
|
| **Data Platform – Production** | `vnet-data-prod-cc` | Databricks workspaces (VNet-injected), ADLS Gen2 private endpoints, Databricks SQL Warehouse endpoints, Purview private endpoints, Key Vault private endpoints |
|
|
|
| **Data Platform – Non-Production** | `vnet-data-nonprod-cc` | Development and staging Databricks workspaces, test storage accounts, sandbox environments for data scientists |
|
|
|
| **SAS Viya** | `vnet-sas-prod-cc` | SAS Viya Compute Server pods on AKS (or VMs), JDBC connectivity to Databricks SQL Warehouses routed through hub firewall |
|
|
|
| **Management** | `vnet-mgmt-cc` | Azure DevOps self-hosted agents, Terraform runners, monitoring infrastructure, Manta lineage engine deployment (if IaaS) |
|
|
|
| **Fabric (Overlay)** | N/A (PaaS) | Microsoft Fabric is fully managed PaaS — no VNet required. Connectivity to ADLS Gen2 is through OneLake shortcuts and Microsoft-managed private endpoints from the Fabric tenant |
|
|
|
|
|
|
All spoke VNets peer to the hub. Inter-spoke traffic routes through Azure Firewall for inspection. No direct spoke-to-spoke peering is permitted — this ensures all cross-environment traffic is logged and inspectable.
|
|
|
|
|
|
### 1.3 Hub Services Detail
|
|
|
|
|
|
**Azure Firewall (Premium SKU).** Centralized egress filtering with TLS inspection for outbound traffic, FQDN-based application rules, network rules for port-level control, and threat intelligence–based filtering. All spoke VNets route their default route (0.0.0.0/0) through the hub firewall via user-defined routes (UDRs). The Premium SKU is required for TLS inspection and IDPS capabilities mandated by OSFI guidelines.
|
|
|
|
|
|
**ExpressRoute Gateway.** Dedicated, private connectivity between Greenfield' on-premises data centers (Lévis/Montréal) and Azure. The ExpressRoute circuit terminates in the hub VNet with route propagation to all spokes. This is the primary path for source system connectivity — core banking (DB2, Oracle), insurance policy administration, mainframe extracts, and CRM systems all feed data through this circuit into ADF and Auto Loader ingestion pipelines.
|
|
|
|
|
|
**Private DNS Zones.** Centralized in the hub subscription and linked to every spoke VNet. Zones include `privatelink.dfs.core.windows.net`, `privatelink.blob.core.windows.net`, `privatelink.vaultcore.azure.net`, `privatelink.azuredatabricks.net`, `privatelink.purview.azure.com`, `privatelink.servicebus.windows.net`, and others. This ensures all workloads resolve private endpoint FQDNs to their private IPs regardless of which spoke they run in.
|
|
|
|
|
|
**Azure Bastion.** Browser-based RDP/SSH access to management VMs and jump boxes without exposing public IPs. This is the only permitted path for interactive administrative access to infrastructure resources.
|
|
|
|
|
|
### 1.4 Region Strategy
|
|
|
|
|
|
| | Primary | Secondary |
|
|
|
|---|---|---|
|
|
|
| **Region** | Canada Central (Toronto) | Canada East (Québec City) |
|
|
|
| **Role** | All production and non-production workloads | Disaster recovery, geo-redundant storage replication target |
|
|
|
| **Rationale** | Lower latency to majority of Azure services; broader service availability | Data residency compliance (Law 25, OSFI); geographic separation for DR |
|
|
|
|
|
|
**Critical constraint:** All data must remain within Canadian Azure regions. No replication, backup, or compute spillover to non-Canadian regions is permitted. This constraint is enforced via Azure Policy at the management group level (`allowedLocations` = `canadacentral`, `canadaeast`).
|
|
|
|
|
|
---
|
|
|
|
|
|
## 2. Resource Organization
|
|
|
|
|
|
### 2.1 Management Group Hierarchy
|
|
|
|
|
|
The data platform subscriptions inherit organization-wide Azure Policies from Greenfield' root management group (allowed regions, required tags, prohibited resource types, mandatory encryption). The platform-specific hierarchy:
|
|
|
|
|
|
```
|
|
|
Greenfield (Root MG)
|
|
|
└── Data & AI Platform (MG)
|
|
|
├── Production (MG)
|
|
|
│ ├── sub-data-platform-prod
|
|
|
│ ├── sub-data-sas-prod
|
|
|
│ └── sub-data-fabric-prod
|
|
|
├── Non-Production (MG)
|
|
|
│ └── sub-data-platform-nonprod
|
|
|
└── Connectivity & Shared Services (MG)
|
|
|
├── sub-data-connectivity
|
|
|
└── sub-data-management
|
|
|
```
|
|
|
|
|
|
Azure Policies applied at the "Data & AI Platform" management group level include: deny public endpoint creation on storage/Key Vault/Purview, require specific tags on all resources, enforce diagnostic settings on all supported resources, deny creation of unmanaged disks, and require TLS 1.2 minimum.
|
|
|
|
|
|
### 2.2 Subscription Layout
|
|
|
|
|
|
| Subscription | Purpose | Key Resources |
|
|
|
|---|---|---|
|
|
|
| `sub-data-connectivity` | Hub networking, shared services | Azure Firewall, ExpressRoute gateway, Bastion, Private DNS Zones, VPN gateway |
|
|
|
| `sub-data-platform-prod` | Production data platform workloads | Databricks workspaces (prod), ADLS Gen2 accounts (bronze/silver/gold/staging), Key Vault, Purview account, Event Hub namespaces |
|
|
|
| `sub-data-platform-nonprod` | Development, staging, sandbox | Databricks workspaces (dev/stg/sandbox), ADLS Gen2 (dev/stg), Key Vault (non-prod) |
|
|
|
| `sub-data-sas-prod` | SAS Viya Compute Server | AKS cluster (or VM scale sets), SAS-specific staging storage, SAS license server |
|
|
|
| `sub-data-fabric-prod` | Microsoft Fabric capacity | Fabric F-SKU capacity resources, Fabric admin workspace for capacity management |
|
|
|
| `sub-data-management` | Platform operations & DevOps | Azure DevOps self-hosted agents, Terraform state storage account, Log Analytics workspace, Manta deployment (if IaaS), Azure Monitor resources, Microsoft Sentinel |
|
|
|
|
|
|
**Rationale for subscription isolation:** Separate subscriptions provide independent RBAC boundaries (SAS team cannot accidentally modify Databricks resources), independent Azure Cost Management scopes for precise chargeback, independent quota management, and separate blast radii for misconfigurations.
|
|
|
|
|
|
### 2.3 Resource Group Strategy
|
|
|
|
|
|
Within `sub-data-platform-prod`:
|
|
|
|
|
|
| Resource Group | Contents |
|
|
|
|---|---|
|
|
|
| `rg-databricks-prod-cc` | Databricks workspaces, managed resource groups (auto-created by Databricks), associated NSGs |
|
|
|
| `rg-storage-prod-cc` | ADLS Gen2 storage accounts (bronze, silver, gold, staging, archive), storage private endpoints |
|
|
|
| `rg-governance-prod-cc` | Microsoft Purview account, Purview managed storage, Purview private endpoints, Purview managed Event Hub |
|
|
|
| `rg-keyvault-prod-cc` | Azure Key Vault instances (platform secrets + encryption keys), Key Vault private endpoints |
|
|
|
| `rg-monitoring-prod-cc` | Log Analytics workspace (prod), diagnostic settings resources, Azure Monitor action groups, alert rules, workbooks |
|
|
|
| `rg-networking-prod-cc` | VNet, subnets, NSGs, route tables (UDRs), private endpoint NICs |
|
|
|
| `rg-ingestion-prod-cc` | Azure Data Factory instance, Event Hub namespaces (for streaming ingestion), ADF private endpoints |
|
|
|
|
|
|
---
|
|
|
|
|
|
## 3. Networking Architecture
|
|
|
|
|
|
### 3.1 VNet & Subnet Design — Production Data Platform
|
|
|
|
|
|
The production spoke VNet (`vnet-data-prod-cc`) uses a /16 address space (e.g., 10.10.0.0/16) to accommodate Databricks' substantial IP requirements (VNet injection demands large subnets) and future growth.
|
|
|
|
|
|
| Subnet | CIDR | Purpose | Delegation / Notes |
|
|
|
|---|---|---|---|
|
|
|
| `snet-dbx-host-prod` | 10.10.0.0/22 | Databricks cluster host VMs (VNet injection) | Delegation: `Microsoft.Databricks/workspaces`; NSG: `nsg-dbx-host-prod` |
|
|
|
| `snet-dbx-container-prod` | 10.10.4.0/22 | Databricks cluster container network (VNet injection) | Delegation: `Microsoft.Databricks/workspaces`; NSG: `nsg-dbx-container-prod` |
|
|
|
| `snet-private-endpoints` | 10.10.8.0/24 | Private endpoints for ADLS, Key Vault, Purview, Event Hub | No delegation; NSG restricts inbound to platform VNets only |
|
|
|
| `snet-sqlwarehouse-prod` | 10.10.9.0/24 | Databricks Serverless SQL Warehouse connectivity | Network Connectivity Config (NCC) for serverless compute |
|
|
|
| `snet-adf-prod` | 10.10.10.0/24 | ADF Integration Runtime (self-hosted, if needed for on-prem sources) | VMs running IR software |
|
|
|
| `snet-services-prod` | 10.10.11.0/24 | Supporting services, internal load balancers, utility VMs | General-purpose |
|
|
|
|
|
|
**Sizing note:** Databricks VNet injection requires /22 or larger subnets for host and container in production to support auto-scaling across multiple concurrent clusters (data engineering, SQL warehouses, ML training). Each running node consumes 2 IPs (one host, one container). A /22 provides ~1,022 usable IPs per subnet, supporting approximately 500 concurrent nodes — sufficient for Greenfield' production workload with headroom.
|
|
|
|
|
|
### 3.2 Private Endpoints
|
|
|
|
|
|
Every platform service that supports private endpoints **must** be configured with public network access disabled. This is non-negotiable for a regulated financial institution.
|
|
|
|
|
|
| Service | Private Endpoint Sub-Resource | Private DNS Zone |
|
|
|
|---|---|---|
|
|
|
| ADLS Gen2 (each account × 2) | `dfs`, `blob` | `privatelink.dfs.core.windows.net`, `privatelink.blob.core.windows.net` |
|
|
|
| Azure Key Vault (each instance) | `vault` | `privatelink.vaultcore.azure.net` |
|
|
|
| Microsoft Purview | `account`, `portal` | `privatelink.purview.azure.com`, `privatelink.purviewstudio.azure.com` |
|
|
|
| Databricks Workspace | `databricks_ui_api` | `privatelink.azuredatabricks.net` |
|
|
|
| Azure Event Hub | `namespace` | `privatelink.servicebus.windows.net` |
|
|
|
| Azure Data Factory | `dataFactory`, `portal` | `privatelink.datafactory.azure.net`, `privatelink.adf.azure.com` |
|
|
|
| Azure Container Registry (for SAS) | `registry` | `privatelink.azurecr.io` |
|
|
|
| Databricks Unity Catalog Metastore | (via workspace PE) | Covered by Databricks workspace PE |
|
|
|
|
|
|
**Total private endpoints (production):** Approximately 25–30 PEs across all services. Each PE creates a NIC in `snet-private-endpoints` with a private IP, plus a DNS A record in the corresponding Private DNS Zone.
|
|
|
|
|
|
### 3.3 Network Security Groups (NSGs)
|
|
|
|
|
|
NSGs are applied at the subnet level with a default deny-all posture. Explicit allow rules are created only for required traffic flows.
|
|
|
|
|
|
**Databricks host & container subnets:**
|
|
|
- Allow all inbound/outbound between `snet-dbx-host-prod` and `snet-dbx-container-prod` (required for Spark executor ↔ driver communication)
|
|
|
- Allow outbound to `snet-private-endpoints` on ports 443 (ADLS HTTPS), 443 (Key Vault), 1433 (if Azure SQL is used)
|
|
|
- Allow outbound to Databricks control plane IPs (published by Databricks, region-specific) on port 443
|
|
|
- Allow outbound to Azure Service Tags (`AzureActiveDirectory`, `AzureMonitor`) for authentication and telemetry
|
|
|
- Deny all inbound from internet
|
|
|
- Deny all other outbound (forced through Azure Firewall via UDR)
|
|
|
|
|
|
**Private endpoint subnet:**
|
|
|
- Allow inbound from `snet-dbx-host-prod`, `snet-dbx-container-prod`, `snet-adf-prod`, `snet-services-prod`, and `vnet-sas-prod-cc` (via peering) on service-specific ports
|
|
|
- Deny inbound from all other sources
|
|
|
- No outbound rules needed (PEs are inbound-only destinations)
|
|
|
|
|
|
### 3.4 Azure Firewall Rules
|
|
|
|
|
|
The hub Azure Firewall controls all egress from spoke VNets. Key application rule collections:
|
|
|
|
|
|
| Rule Collection | FQDN Targets | Protocol | Purpose |
|
|
|
|---|---|---|---|
|
|
|
| `rc-databricks-control` | `*.azuredatabricks.net`, Databricks control plane IPs | HTTPS (443) | Databricks workspace communication with Azure-managed control plane |
|
|
|
| `rc-azure-services` | `login.microsoftonline.com`, `management.azure.com`, `*.monitor.azure.com` | HTTPS (443) | Entra ID authentication, ARM management, Azure Monitor telemetry |
|
|
|
| `rc-package-repos` | `pypi.org`, `files.pythonhosted.org`, `repo.anaconda.com`, `conda.anaconda.org` | HTTPS (443) | Python/Conda package installation for Databricks clusters and SAS |
|
|
|
| `rc-databricks-artifacts` | `dbartifactsprodcac.blob.core.windows.net` (region-specific) | HTTPS (443) | Databricks runtime artifacts, libraries, Spark distributions |
|
|
|
| `rc-sas-licensing` | SAS licensing endpoints (provided by SAS) | HTTPS (443) | SAS Viya license validation |
|
|
|
| `rc-deny-all` | `*` | Any | Default deny — all egress not matching an explicit rule is blocked and logged |
|
|
|
|
|
|
### 3.5 Connectivity to On-Premises
|
|
|
|
|
|
Source systems (core banking on DB2, Oracle, SQL Server; insurance policy administration; CRM; mainframe extracts) connect to the Azure data platform through ExpressRoute. ADF pipelines and Databricks Auto Loader access on-premises sources through the hub VNet's ExpressRoute gateway, with traffic routing controlled by BGP route propagation and UDRs. Self-hosted Integration Runtimes (in `snet-adf-prod`) are deployed for sources that require a local agent.
|
|
|
|
|
|
---
|
|
|
|
|
|
## 4. Identity & Access Management
|
|
|
|
|
|
### 4.1 Azure Entra ID as the Identity Foundation
|
|
|
|
|
|
All platform access is brokered through Azure Entra ID (formerly Azure Active Directory) with mandatory multi-factor authentication enforced via Conditional Access policies. Conditional Access policies require: MFA for all users, compliant device, access from Greenfield corporate network or approved VPN locations, and session lifetime controls (re-authentication every 12 hours for interactive sessions).
|
|
|
|
|
|
| Component | Identity Integration | Authentication Method |
|
|
|
|---|---|---|
|
|
|
| Databricks Workspaces | Entra ID SSO via SCIM provisioning; Entra security groups synced to Unity Catalog for RBAC | OAuth 2.0 / SAML; MFA via Conditional Access |
|
|
|
| Microsoft Fabric | Native Entra ID integration via Microsoft 365 tenant | Entra ID SSO; MFA enforced |
|
|
|
| SAS Viya | Entra ID integration via SAML 2.0 federation or direct OIDC (depending on SAS version) | SAML 2.0 federated with Entra ID; MFA enforced |
|
|
|
| Microsoft Purview | Native Entra ID; Purview collection-level roles mapped to Entra security groups | Entra ID SSO |
|
|
|
| Azure Data Factory | Managed identity for pipeline execution; Entra groups for authoring RBAC | System-assigned Managed Identity |
|
|
|
| ADLS Gen2 | Entra ID RBAC at Azure resource level (Storage Blob Data roles) + Unity Catalog ACLs at the data layer | Managed Identity / Service Principal |
|
|
|
|
|
|
### 4.2 RBAC Model — Azure Resource Layer
|
|
|
|
|
|
Azure RBAC provides coarse-grained access at the resource level. Data-level fine-grained access (RLS, CLS, DDM) is enforced by Unity Catalog at the compute layer.
|
|
|
|
|
|
| Entra Security Group | Azure RBAC Role | Scope |
|
|
|
|---|---|---|
|
|
|
| `sg-data-platform-admins` | Contributor | `sub-data-platform-prod`, `sub-data-platform-nonprod` |
|
|
|
| `sg-data-engineers` | Databricks Contributor, Storage Blob Data Contributor | Databricks workspaces, ADLS Gen2 accounts |
|
|
|
| `sg-data-scientists` | Databricks Contributor (sandbox workspace only), Storage Blob Data Reader | Sandbox workspace, Silver/Gold ADLS (read) |
|
|
|
| `sg-data-analysts` | Reader | Databricks analytics workspace (SQL Warehouse access via Unity Catalog grants) |
|
|
|
| `sg-governance-admins` | Purview Data Curator, Key Vault Administrator | Purview account, Key Vault |
|
|
|
| `sg-sas-developers` | Contributor | `sub-data-sas-prod` |
|
|
|
| `sg-fabric-admins` | Fabric Capacity Administrator | Fabric capacity resource |
|
|
|
| `sg-finops` | Cost Management Reader, Billing Reader | All data platform subscriptions |
|
|
|
|
|
|
### 4.3 Service Principals & Managed Identities
|
|
|
|
|
|
Automated workloads use managed identities (preferred) or service principals (where managed identities are not supported). **Shared secrets and embedded credentials are explicitly prohibited** per the reference architecture.
|
|
|
|
|
|
| Workload | Identity Type | Scope / Role Assignments |
|
|
|
|---|---|---|
|
|
|
| Databricks workspace (system) | System-assigned Managed Identity | Storage Blob Data Contributor on ADLS Gen2 accounts; Key Vault Secrets User on `kv-data-platform-prod` |
|
|
|
| ADF pipelines | System-assigned Managed Identity | Storage Blob Data Contributor on ADLS; Databricks workspace access via linked service token |
|
|
|
| SAS Viya Compute Server | Service Principal (`sp-sas-compute-prod`) | Storage Blob Data Reader on **authorized ADLS paths only** (non-sensitive datasets); JDBC access to Databricks SQL Warehouses scoped via Unity Catalog grants |
|
|
|
| Purview scanners | System-assigned Managed Identity | Storage Blob Data Reader on all ADLS accounts; Databricks metastore reader for Unity Catalog metadata harvesting |
|
|
|
| Manta lineage engine | Service Principal (`sp-manta-prod`) | Read-only access to Databricks repos, ADF metadata APIs, SAS code repositories; Purview Data Curator for lineage publishing |
|
|
|
| CI/CD pipelines (Terraform) | Service Principal (`sp-terraform-prod`) | Contributor on data platform subscriptions; User Access Administrator for RBAC assignments |
|
|
|
|
|
|
**SAS read-path security enforcement:** Because SAS Compute Server accesses ADLS files using its service principal's Azure RBAC permissions (bypassing Unity Catalog's fine-grained RLS/CLS/DDM), the `sp-sas-compute-prod` service principal is granted Storage Blob Data Reader **only** on non-sensitive, pre-authorized ADLS paths. Those paths are registered as Unity Catalog external locations with restricted scope. For sensitive datasets, SAS **must** read through JDBC LIBNAME to Databricks SQL Warehouses, where Unity Catalog enforcement applies at query time.
|
|
|
|
|
|
---
|
|
|
|
|
|
## 5. Compute Sizing & Configuration
|
|
|
|
|
|
### 5.1 Databricks Workspaces
|
|
|
|
|
|
| Workspace | Environment | Purpose | Configuration |
|
|
|
|---|---|---|---|
|
|
|
| `dbw-data-eng-prod` | Production | Data engineering: Bronze→Silver→Gold DLT pipelines, Auto Loader streaming ingestion, data quality workflows | Unity Catalog enabled; VNet injected; cluster policies enforce approved instance types (`Standard_DS4_v2` 8 vCPU/28 GB as default workers, `Standard_DS5_v2` 16 vCPU/56 GB for memory-intensive), auto-terminate 30 min, spot instances (60/40 spot/on-demand) for non-critical batch jobs; Photon enabled for DLT pipelines |
|
|
|
| `dbw-analytics-prod` | Production | SQL analytics, Databricks AI/BI Dashboards, Genie NL querying, ad-hoc analyst queries | Serverless SQL Warehouses (preferred); Pro SQL Warehouses for workloads requiring Classic compute; auto-suspend 10 min |
|
|
|
| `dbw-mlops-prod` | Production | ML/AI: MLflow experiments, model training, Feature Store serving, Model Serving endpoints, GenAI workloads | GPU instance pools (`Standard_NC6s_v3` 6 vCPU/112 GB/1× V100 for training; `Standard_NC4as_T4_v3` for inference); CPU pools for feature engineering; Model Serving auto-scale (min replicas: 0 for dev, 1 for prod SLA endpoints) |
|
|
|
| `dbw-data-eng-dev` | Non-Prod | Development and testing of data engineering pipelines | Reduced cluster sizes (max 4 workers); mandatory auto-terminate 15 min; restricted to `Standard_DS3_v2`; Unity Catalog dev metastore (separate from prod) |
|
|
|
| `dbw-sandbox` | Non-Prod | Data science exploration, POCs, experimentation | Read-only access to Silver/Gold via Unity Catalog; no production write access; budget cap enforced ($X/month per user via tag-based cost alerts); auto-terminate 10 min |
|
|
|
|
|
|
**Cluster policy governance:** All workspaces enforce cluster policies that prevent creation of non-compliant clusters. Policies define: allowed instance types (approved VM families only), maximum workers per cluster (10 for dev, 50 for prod data eng, 20 for ML), mandatory auto-termination, mandatory Unity Catalog mode (no legacy clusters), spot instance minimum ratio, custom tags (CostCenter, DataDomain required), and prohibited features (no local file system access in prod, no public IP assignment).
|
|
|
|
|
|
### 5.2 Databricks SQL Warehouses
|
|
|
|
|
|
| SQL Warehouse | Size (DBU) | Purpose | Configuration |
|
|
|
|---|---|---|---|
|
|
|
| `sqlwh-bi-serving` | Medium (4–8 DBU base) | JDBC endpoint consumed by SAS Compute Server (JDBC LIBNAME) and indirectly by Power BI (via OneLake shortcut refresh queries) | Serverless; auto-suspend 10 min; scaling policy: economy (cost-optimized, queues queries rather than scaling aggressively); spot instances enabled; query result caching enabled |
|
|
|
| `sqlwh-analytics` | Small–Medium (2–8 DBU) | Ad-hoc analyst queries via SQL editor, JDBC/ODBC for downstream applications, notebook SQL exploration | Serverless; auto-suspend 10 min; query queuing enabled; intelligent workload management |
|
|
|
| `sqlwh-etl-support` | Medium–Large (4–16 DBU) | Data quality validation queries run during pipeline execution, data contract SLA checks (freshness, completeness), Gold layer aggregation queries | Pro; scheduled scaling aligned with pipeline windows (04:00–08:00 EST peak for T+1 processing); auto-suspend after pipeline completion |
|
|
|
|
|
|
### 5.3 SAS Viya Compute Server
|
|
|
|
|
|
SAS Viya runs on Compute Server engine (not CAS — this is AD-04). Processing is sequential and batch-oriented, so compute is sized for single-threaded throughput with high memory for in-process data manipulation, not for horizontal parallelism.
|
|
|
|
|
|
**Deployment option: AKS (recommended)**
|
|
|
|
|
|
| Component | VM Size / Pod Spec | Quantity | Notes |
|
|
|
|---|---|---|---|
|
|
|
| SAS Compute Server pods | `Standard_E16s_v5` (16 vCPU, 128 GB RAM) | 2–4 pods | Primary workhorses for actuarial model development, risk model validation, moderate batch scoring; memory-optimized for SAS DATA step and PROC SQL in-memory operations |
|
|
|
| SAS Programming Runtime (heavy) | `Standard_E32s_v5` (32 vCPU, 256 GB RAM) | 1–2 pods | Large actuarial reserving models (IFRS 17, IBNR) requiring extended memory; scheduled availability during nightly batch windows to control cost |
|
|
|
| SAS License Server | `Standard_D4s_v5` (4 vCPU, 16 GB RAM) | 1 (HA pair) | SAS license management service; always-on |
|
|
|
| SAS Model Manager | `Standard_D8s_v5` (8 vCPU, 32 GB RAM) | 1–2 pods | Model governance, model registration, champion/challenger tracking for regulatory model management |
|
|
|
| AKS System Node Pool | `Standard_D4s_v5` | 3 nodes (across AZs) | AKS system services (CoreDNS, kube-proxy, metrics-server) |
|
|
|
|
|
|
**AKS configuration:** Private AKS cluster (no public API endpoint); Azure CNI networking for VNet integration in `vnet-sas-prod-cc`; Entra ID integration for RBAC; node pool auto-scaling (min 2 / max 6 for compute pods); Azure Disk CSI driver for persistent volumes (SAS work directories); Azure Files for shared SAS configuration.
|
|
|
|
|
|
### 5.4 Microsoft Fabric Capacity
|
|
|
|
|
|
Fabric capacity is sized **strictly for the BI serving workload** (AD-03). Any request to increase capacity for data engineering or warehousing triggers an architecture review per the reference architecture's anti-pattern rules.
|
|
|
|
|
|
| Capacity SKU | CU | Use Case | Governance |
|
|
|
|---|---|---|---|
|
|
|
| F64 (Production) | 64 Capacity Units | Power BI Direct Lake semantic models serving 55,000 users, report rendering, paginated reports for regulatory distribution, Copilot-assisted analytics | Auto-pause during non-business hours (22:00–06:00 EST); smoothing enabled for burst absorption; capacity alerts at 70% and 90% utilization; Fabric admin monitors for non-BI workloads |
|
|
|
| F32 (Non-Production) | 32 CU | Development and testing of Power BI semantic models, report prototyping, Direct Lake connectivity validation | Auto-pause aggressive (off outside business hours); BI workloads only; no Fabric notebooks or Spark |
|
|
|
| F16 (Fabric IQ POC — Horizon 2) | 16 CU | Ontology POC on Customer 360 domain, Data Agent evaluation with controlled user group (per Section 4.5 prerequisites) | Isolated capacity; time-limited (6-month evaluation); requires Architecture Review Board approval before provisioning |
|
|
|
|
|
|
**Cost note:** F64 at list price is approximately $8,000–9,000 USD/month. This is significantly less than the Databricks DBU cost that would be incurred if 55,000 Power BI users queried Databricks SQL Warehouses via DirectQuery — the entire economic justification for Fabric in this architecture (AD-03).
|
|
|
|
|
|
---
|
|
|
|
|
|
## 6. Storage Architecture
|
|
|
|
|
|
### 6.1 ADLS Gen2 Account Layout
|
|
|
|
|
|
ADLS Gen2 is the shared storage substrate across all three platforms (AD-07). All accounts have hierarchical namespace (HNS) enabled for Delta Lake compatibility and are configured with private endpoints only (public access disabled at the account level).
|
|
|
|
|
|
| Storage Account | Access Tier | Container Structure | Access Pattern |
|
|
|
|---|---|---|---|
|
|
|
| `stadlsbronzeprod` | Hot | `/bronze/{source_system}/{entity}/` (e.g., `/bronze/core_banking/accounts/`, `/bronze/insurance/claims/`) | **Write:** ADF managed identity, Auto Loader managed identity; **Read:** Databricks data engineering workspace MI, Purview scanner MI |
|
|
|
| `stadlssilverprod` | Hot | `/silver/{domain}/{entity}/` (e.g., `/silver/customer/individual/`, `/silver/claims/claim_header/`) | **Write:** Databricks DLT pipelines; **Read:** Databricks analytics & ML workspaces, SAS via JDBC (Unity Catalog enforced) |
|
|
|
| `stadlsgoldprod` | Hot | `/gold/{data_product}/{entity}/` (e.g., `/gold/customer_360/member_profile/`, `/gold/risk_features/credit_scores/`) | **Write:** Databricks DLT/SQL pipelines; **Read:** Power BI (OneLake shortcut → Direct Lake), Databricks SQL, SAS JDBC LIBNAME, ML Feature Store, REST API services |
|
|
|
| `stadlsstagingprod` | Hot | `/staging/ingestion/{source}/`, `/staging/sas_writeback/{model_domain}/{model_name}/{run_date}/`, `/staging/purview_dq/` | **Write:** ADF (ingestion landing), SAS ADLS LIBNAME (writeback staging); **Read:** Purview DQ scanner (pre-Bronze assessment), Databricks promotion pipelines |
|
|
|
| `stadlsarchiveprod` | Cool → Archive | `/archive/{domain}/{entity}/{year}/` | **Write:** Lifecycle policy tiering; **Read:** Rare, on-demand rehydration for regulatory retrieval |
|
|
|
|
|
|
**Why separate storage accounts (not just separate containers)?** Separate accounts provide independent RBAC boundaries (the SAS service principal has no RBAC on `stadlsbronzeprod`), independent throughput limits (each account gets its own IOPS/bandwidth quota), independent private endpoints (enabling subnet-level NSG control), independent lifecycle policies, and independent diagnostic logging.
|
|
|
|
|
|
### 6.2 Delta Lake Configuration
|
|
|
|
|
|
All data within Bronze, Silver, and Gold containers is stored in Delta Lake format (AD-01).
|
|
|
|
|
|
**Time Travel retention:** 90 days minimum for Bronze (regulatory auditability per AMF requirements); 30 days for Silver and Gold (sufficient for pipeline replay and debugging), extended to 90 days for regulatory domain Gold tables (Financial Aggregates, Risk Features).
|
|
|
|
|
|
**VACUUM & OPTIMIZE schedule (via Databricks Workflows):**
|
|
|
- Bronze: `VACUUM` retaining 90 days, runs weekly (low urgency, append-only)
|
|
|
- Silver: `VACUUM` retaining 30 days, runs nightly; `OPTIMIZE` with `ZORDER BY` on business keys, runs nightly
|
|
|
- Gold: `VACUUM` retaining 30 days, runs nightly; `OPTIMIZE` with `LIQUID CLUSTERING` on high-cardinality filter columns (`date`, `business_unit`, `product_code`, `member_id`), runs nightly post-refresh; table statistics maintained via `ANALYZE TABLE` for SQL Warehouse query optimizer
|
|
|
|
|
|
**Schema evolution:** `mergeSchema` enabled on Bronze tables (Auto Loader schema inference accommodates upstream changes without pipeline failures). Silver and Gold tables enforce strict schemas via DLT expectations and data contracts — schema-breaking changes require a data product lifecycle change request.
|
|
|
|
|
|
### 6.3 Lifecycle Policies
|
|
|
|
|
|
| Storage Account | Policy Rule | Action / Rationale |
|
|
|
|---|---|---|
|
|
|
| `stadlsbronzeprod` | Blobs not modified in 180 days → Cool tier; not modified in 365 days → Archive tier | Cost optimization for aged raw data while maintaining regulatory access; Delta Time Travel handles recent rollback needs |
|
|
|
| `stadlsstagingprod` | Ingestion staging blobs > 30 days → delete; SAS writeback blobs > 14 days after promotion validation → delete | Prevent staging zone bloat; staging data is ephemeral by design |
|
|
|
| `stadlsarchiveprod` | Archive tier by default; rehydrate on demand (Standard priority, 15 hours) | Lowest cost for long-term regulatory retention; rare access pattern |
|
|
|
| All accounts | Soft delete enabled (14-day retention); blob versioning enabled; container soft delete (7 days) | Accidental deletion recovery baseline |
|
|
|
|
|
|
### 6.4 Encryption
|
|
|
|
|
|
All storage accounts use encryption at rest with **Microsoft-managed keys (MMK)** as the baseline. For data classified as **Restricted** (per Purview auto-classification — e.g., NAS/SIN numbers, certain financial details), **customer-managed keys (CMK)** stored in Azure Key Vault (`kv-data-encryption-prod`) are applied at the storage account level. In practice, this means `stadlssilverprod` and `stadlsgoldprod` use CMK (these contain processed, queryable sensitive data), while `stadlsbronzeprod` uses MMK (raw data is further protected by restricted RBAC — only data engineering roles have access).
|
|
|
|
|
|
Encryption in transit: TLS 1.2 minimum enforced on all storage endpoints. `Secure transfer required` = true on all accounts.
|
|
|
|
|
|
---
|
|
|
|
|
|
## 7. Security Architecture
|
|
|
|
|
|
### 7.1 Azure Key Vault
|
|
|
|
|
|
| Key Vault Instance | Purpose | Access Policy |
|
|
|
|---|---|---|
|
|
|
| `kv-data-platform-prod` | Platform operational secrets: database connection strings, SAS license keys, API keys for external data feeds, ADF linked service credentials | Databricks MI, ADF MI, SAS SP → Key Vault Secrets User role; CI/CD SP → Key Vault Administrator (for secret rotation automation) |
|
|
|
| `kv-data-encryption-prod` | Customer-managed encryption keys: ADLS CMK, Databricks workspace CMK (managed services encryption, DBFS encryption) | Storage account MI, Databricks workspace MI → Key Vault Crypto User role; Key Vault admin group → Key Vault Administrator |
|
|
|
| `kv-data-platform-nonprod` | Non-production secrets (same structure, isolated values) | Non-prod workspace MIs and SPs |
|
|
|
|
|
|
**Hardening:**
|
|
|
- Key Vault Firewall: enabled; access restricted to private endpoint only (from `snet-private-endpoints`)
|
|
|
- Purge protection: enabled on all instances (prevents permanent deletion, even by admins, for 90 days)
|
|
|
- Soft delete: enabled with 90-day retention
|
|
|
- Diagnostic settings: all Key Vault operations streamed to Log Analytics; alert rules configured for unexpected access patterns (access from unknown service principals, bulk secret reads, key deletion attempts)
|
|
|
- Secret rotation: automated rotation policies for secrets with 90-day expiry; CI/CD pipeline rotates service principal credentials quarterly
|
|
|
|
|
|
### 7.2 Private Endpoints — Defense in Depth
|
|
|
|
|
|
All platform services operate exclusively through private endpoints. Public endpoints are disabled at the resource level and blocked by Azure Policy. Network traffic between services never traverses the public internet.
|
|
|
|
|
|
The full private endpoint inventory (approximately 25–30 PEs in production) is detailed in Section 3.2. Each PE creates: a NIC with private IP in `snet-private-endpoints`, a DNS A record in the corresponding Private DNS Zone, and NSG rules governing which subnets can reach it.
|
|
|
|
|
|
### 7.3 Data Encryption — Layers
|
|
|
|
|
|
| Layer | Encryption | Key Management |
|
|
|
|---|---|---|
|
|
|
| **At rest — ADLS Gen2** | AES-256; MMK (default) or CMK for Restricted data | CMK in `kv-data-encryption-prod`; auto-rotation annually |
|
|
|
| **At rest — Databricks DBFS** | AES-256 with CMK | CMK in `kv-data-encryption-prod` |
|
|
|
| **At rest — Databricks managed services** | CMK for notebook results, job results, ML artifacts | Workspace-level CMK configuration |
|
|
|
| **In transit — all services** | TLS 1.2 minimum | Azure-managed certificates; Databricks enforces TLS for all cluster ↔ storage and cluster ↔ control plane communication |
|
|
|
| **In transit — ExpressRoute** | MACsec (Layer 2 encryption at peering) | Optional but recommended for Express Route Direct circuits |
|
|
|
|
|
|
### 7.4 Databricks Workspace Security Configuration
|
|
|
|
|
|
| Security Control | Configuration Detail |
|
|
|
|---|---|
|
|
|
| **Workspace encryption** | CMK from `kv-data-encryption-prod` for managed services and DBFS |
|
|
|
| **Cluster policies** | Mandatory: enforce approved instance types, auto-termination, spot ratios, max workers, Unity Catalog mode, custom tags; prohibit legacy table ACLs, local file download, init scripts from untrusted sources |
|
|
|
| **Unity Catalog** | External metastore on dedicated ADLS storage; all data access routed through UC; legacy Hive metastore disabled; default catalog = none (users must specify catalog explicitly) |
|
|
|
| **Token management** | Personal Access Tokens (PATs) **disabled** in production workspaces; service principals use OAuth 2.0 M2M (client credentials flow) |
|
|
|
| **IP access lists** | Workspace accessible only from Greenfield corporate network IP ranges (via ExpressRoute) and Azure Bastion subnet |
|
|
|
| **Audit logging** | All workspace audit logs streamed to Log Analytics via diagnostic settings; captures: data access events, cluster lifecycle, admin configuration changes, SQL query history, secrets access |
|
|
|
| **Secret scopes** | All Databricks secret scopes backed by Azure Key Vault (`kv-data-platform-prod`); Databricks-native secret storage disabled |
|
|
|
| **Secure cluster connectivity (No Public IP)** | Enabled — cluster nodes have no public IPs; all communication via VNet and private links |
|
|
|
|
|
|
### 7.5 Data Loss Prevention & Exfiltration Controls
|
|
|
|
|
|
- **Egress restriction:** All compute cluster outbound traffic routes through Azure Firewall; only whitelisted FQDNs are permitted (Section 3.4)
|
|
|
- **DBFS download restriction:** Local file download from Databricks notebooks disabled in production via workspace admin settings
|
|
|
- **Purview DLP:** Microsoft Purview DLP policies integrated with M365 prevent sensitive data from leaving via email, Teams, or SharePoint
|
|
|
- **Clipboard/export controls:** Fabric Power BI tenant settings restrict data export from reports (no CSV export of Restricted-classified datasets)
|
|
|
|
|
|
---
|
|
|
|
|
|
## 8. Monitoring & Observability
|
|
|
|
|
|
### 8.1 Centralized Monitoring Stack
|
|
|
|
|
|
All monitoring data converges on a central Log Analytics workspace (`law-data-platform-prod` in `sub-data-management`) to provide a single-pane-of-glass view.
|
|
|
|
|
|
| Component | Technology | Data Collected |
|
|
|
|---|---|---|
|
|
|
| **Infrastructure** | Azure Monitor metrics + Log Analytics | VM metrics (CPU, memory, disk), AKS node/pod metrics, storage account metrics (transactions, latency, capacity), network flow logs (NSG flow logs v2), Azure Firewall logs (application + network rules) |
|
|
|
| **Databricks** | Databricks audit logs → Diagnostic Settings → Log Analytics; Databricks system tables (`system.billing.usage`, `system.compute.clusters`, `system.query.history`) | Cluster utilization & idle time, job run durations/statuses/failures, SQL Warehouse query performance (duration, bytes scanned, cache hit ratio), DBU consumption by workspace/cluster/job/user, Unity Catalog audit trail (who accessed what data) |
|
|
|
| **Data pipelines** | ADF monitoring → Log Analytics; Databricks Workflows run logs | Pipeline run status, duration, row counts, error details; DLT pipeline quality metrics (rows passed/failed expectations); ingestion audit log (source, timestamp, row counts, schema version) |
|
|
|
| **Data quality** | Purview DQ scores (Tier 1) + DLT expectation metrics (Tier 2–3) → unified DQ dashboard (Power BI) | Quality scores by domain, source system, data product; quarantine volumes; SLA compliance trends; CDE completeness tracking |
|
|
|
| **Security / SIEM** | Microsoft Sentinel connected to Log Analytics | Security events aggregated from all services; anomalous data access patterns; DLP alerts; Key Vault access audit; failed authentication attempts; service principal credential anomalies |
|
|
|
| **Cost** | Azure Cost Management + Databricks Account Console | Subscription-level cost by resource group and tag; Databricks DBU consumption by workspace, cluster type, job, and user; Fabric CU consumption; storage growth trends |
|
|
|
|
|
|
### 8.2 Key Dashboards
|
|
|
|
|
|
| Dashboard | Audience | Content |
|
|
|
|---|---|---|
|
|
|
| **Platform Health** | Platform engineering team | Cluster availability, pipeline success rates, SQL Warehouse queue times, storage latency, private endpoint health, AKS node status |
|
|
|
| **Data Quality Governance** | CDO office, data stewards | Unified DQ scores (all three tiers), quarantine volumes, SLA compliance by data product, CDE completeness trends, steward action items |
|
|
|
| **FinOps & Cost** | FinOps team, CDO office | DBU spend by team/domain, Fabric CU utilization, storage growth, reserved vs. on-demand ratio, cost anomaly detection, budget burn rate |
|
|
|
| **Security & Compliance** | Security team, CISO office | Sentinel incident summary, data access audit summary, DLP alert trends, service principal activity, Key Vault operations |
|
|
|
|
|
|
### 8.3 Alerting Strategy
|
|
|
|
|
|
| Alert Category | Trigger | Severity | Action |
|
|
|
|---|---|---|---|
|
|
|
| **Pipeline failure** | Production ingestion or DLT pipeline fails | Sev 1 (critical domains: Customer 360, Financial Aggregates, Risk Features) / Sev 2 (others) | Page on-call data engineer; auto-retry with exponential backoff (3 retries); escalate to lead if all retries fail |
|
|
|
| **Data quality SLA breach** | Gold data product fails freshness SLA (e.g., T+1 by 06:00 EST not met) or completeness threshold (<99.5% on CDEs) | Sev 1 | Notify data product owner + steward; defer Gold refresh; CDO office escalation if unresolved within 4 hours |
|
|
|
| **Security anomaly** | Unexpected SP data access; bulk data download; Key Vault access from unknown identity | Sev 1 | Sentinel incident auto-created; SOC investigation; auto-block identity if high-confidence threat |
|
|
|
| **Cluster over-provisioning** | Databricks cluster idle >30 min or CPU utilization <20% for sustained period | Sev 3 | Notify FinOps team; recommend right-sizing; auto-terminate if policy allows |
|
|
|
| **Fabric capacity saturation** | CU utilization >90% sustained for 1 hour | Sev 2 | Notify Fabric admin; evaluate workload scheduling or capacity burst |
|
|
|
| **Storage anomaly** | Storage account growth rate exceeds 2× historical 7-day baseline | Sev 3 | Notify data engineering lead; investigate potential staging bloat or data duplication |
|
|
|
|
|
|
---
|
|
|
|
|
|
## 9. Disaster Recovery & Business Continuity
|
|
|
|
|
|
### 9.1 Recovery Objectives
|
|
|
|
|
|
| Tier | Workloads | RPO | RTO | Strategy |
|
|
|
|---|---|---|---|---|
|
|
|
| **Tier 1 — Critical** | Gold data products (Customer 360, Financial Aggregates, Risk Features), regulatory reporting pipelines, active ML model serving endpoints | ≤ 1 hour | ≤ 4 hours | GRS/RA-GRS storage replication to Canada East; Databricks workspace deployable via IaC in <2 hours; pre-provisioned standby networking in Canada East; SQL Warehouse endpoints re-creatable from IaC |
|
|
|
| **Tier 2 — Important** | Silver layer, ingestion pipelines (ADF + Auto Loader), SAS Viya actuarial models, Feature Store | ≤ 4 hours | ≤ 8 hours | GRS storage replication; IaC-based compute rebuild in secondary region; SAS model code in version control (Git); ADF pipeline definitions exported and version-controlled |
|
|
|
| **Tier 3 — Standard** | Bronze layer, development/sandbox environments, non-production workloads | ≤ 24 hours | ≤ 24 hours | GRS storage (async replication); rebuild from IaC; accept data loss up to last successful replication point |
|
|
|
|
|
|
### 9.2 Storage Replication
|
|
|
|
|
|
All production ADLS Gen2 accounts are configured with **Geo-Redundant Storage (GRS)**, replicating data asynchronously to Canada East. For Tier 1 storage accounts (`stadlsgoldprod`), **Read-Access GRS (RA-GRS)** is enabled, allowing read access from the secondary region during a regional outage without waiting for Microsoft-initiated failover.
|
|
|
|
|
|
Delta Lake's transaction log (`_delta_log/`) is included in GRS replication. After failover, replicated Delta tables are query-consistent — the transaction log ensures that only committed transactions are visible, even if replication was mid-flight during the outage.
|
|
|
|
|
|
### 9.3 Compute Recovery
|
|
|
|
|
|
All infrastructure is defined as code (Terraform — see Section 12). In a DR scenario:
|
|
|
|
|
|
1. **Networking:** Standby hub-spoke VNets in Canada East are pre-provisioned (warm standby) with peering, NSGs, and route tables ready. ExpressRoute circuit has a secondary connection to Canada East.
|
|
|
2. **Databricks:** Workspace configuration, cluster policies, Unity Catalog metastore settings, and secret scope bindings are all in Terraform. A `terraform apply` targeting the Canada East module deploys a functional workspace in ~60–90 minutes.
|
|
|
3. **SAS Viya:** AKS cluster definition is in Terraform; SAS container images are in Azure Container Registry (geo-replicated to Canada East). Deployment time ~2–3 hours including SAS configuration restore.
|
|
|
4. **Fabric:** Fabric capacity is region-specific. A standby F-SKU in Canada East can be provisioned on-demand (manual, ~30 minutes). Power BI semantic models would need to be re-pointed to the secondary ADLS endpoints.
|
|
|
|
|
|
### 9.4 Backup Strategy
|
|
|
|
|
|
| Component | Backup Mechanism | Retention |
|
|
|
|---|---|---|
|
|
|
| **Delta Lake data** | Delta Time Travel (point-in-time recovery within retention window) + GRS replication | 90 days Bronze, 30 days Silver/Gold |
|
|
|
| **Azure Key Vault** | Soft delete (90 days) + purge protection; secrets/keys recoverable even after deletion | 90 days |
|
|
|
| **Unity Catalog metastore** | Databricks account-level configuration backup + IaC definitions | Recoverable from IaC |
|
|
|
| **Purview configuration** | Glossary, classification schemas, and policy definitions exported periodically to version control (JSON export) | Git history (indefinite) |
|
|
|
| **Pipeline definitions** | ADF ARM templates in Git; Databricks notebooks in Git (Repos); DLT pipeline code in Git | Git history (indefinite) |
|
|
|
| **SAS code & models** | SAS programs, macros, model code in Git; SAS Model Manager metadata exported periodically | Git history (indefinite) |
|
|
|
|
|
|
**DR testing cadence:** Tabletop exercise quarterly; partial failover test (storage failover + compute redeploy for one domain) semi-annually; full DR simulation annually.
|
|
|
|
|
|
---
|
|
|
|
|
|
## 10. Cost Management & FinOps
|
|
|
|
|
|
### 10.1 Cost Allocation Architecture
|
|
|
|
|
|
Cost visibility is built into the infrastructure from day one through mandatory tagging (enforced by Azure Policy), subscription-level isolation, and platform-native cost tools.
|
|
|
|
|
|
**Mandatory tags (enforced via Azure Policy — deny if missing):**
|
|
|
|
|
|
| Tag Key | Purpose | Example Values |
|
|
|
|---|---|---|
|
|
|
| `Environment` | Environment segregation | `prod`, `nonprod`, `sandbox`, `dr` |
|
|
|
| `CostCenter` | Chargeback to business unit | `CC-CDO-1234`, `CC-INS-5678`, `CC-RISK-9012` |
|
|
|
| `Platform` | Technology pillar identification | `databricks`, `fabric`, `sas`, `governance`, `shared` |
|
|
|
| `Owner` | Operational contact | `team-data-eng@Greenfield.com` |
|
|
|
| `DataDomain` | Domain-level cost attribution | `customer`, `claims`, `risk`, `finance`, `shared` |
|
|
|
| `DataClassification` | Align cost with sensitivity tier | `public`, `internal`, `confidential`, `restricted` |
|
|
|
|
|
|
**Recommended tags (enforced via Azure Policy — audit mode):**
|
|
|
|
|
|
| Tag Key | Purpose | Example Values |
|
|
|
|---|---|---|
|
|
|
| `ManagedBy` | IaC tool tracking | `terraform`, `manual`, `bicep` |
|
|
|
| `Horizon` | Implementation phase tracking | `h1`, `h2`, `h3` |
|
|
|
| `ExpiryDate` | Temporary resource cleanup | `2026-09-30` (for POCs, sandboxes) |
|
|
|
|
|
|
### 10.2 Budget Controls
|
|
|
|
|
|
**Subscription-level budgets:** Set for each subscription with alerts at 50%, 75%, 90%, and 100% of monthly budget. Breaches above 100% trigger automatic notification to CDO office and FinOps team.
|
|
|
|
|
|
**Databricks cost controls:**
|
|
|
- Cluster policies enforce max cluster sizes, auto-termination, and spot instance ratios
|
|
|
- Databricks Account Console provides DBU consumption dashboards by workspace, cluster type, job, and user
|
|
|
- Databricks budgets feature (if available) configured per workspace with alerts
|
|
|
- SQL Warehouses configured with economy scaling to minimize DBU for BI serving workload
|
|
|
|
|
|
**Fabric capacity governance:**
|
|
|
- Capacity sized strictly for BI serving only (AD-03)
|
|
|
- Auto-pause enabled (22:00–06:00 EST) — saves ~33% of capacity cost
|
|
|
- Smoothing enabled for burst absorption without over-provisioning
|
|
|
- Any request to increase capacity for non-BI workloads triggers architecture review
|
|
|
- Monthly Fabric utilization review by FinOps + Fabric admin
|
|
|
|
|
|
**Storage cost controls:**
|
|
|
- Lifecycle policies (Section 6.3) auto-tier aged data to lower-cost tiers
|
|
|
- Storage growth alerts (Section 8.3) detect anomalous growth
|
|
|
- Monthly review of staging zone sizes to prevent bloat
|
|
|
|
|
|
### 10.3 Reserved Capacity & Savings Plans
|
|
|
|
|
|
| Resource | Commitment Type | Term | Estimated Savings |
|
|
|
|---|---|---|---|
|
|
|
| Databricks DBU | Databricks Committed Use Discount (DBCU) | 1-year (based on 6-month usage baseline after Horizon 1 stabilization) | 20–35% vs. pay-as-you-go |
|
|
|
| Azure VMs (SAS Viya AKS nodes) | Azure Savings Plan for Compute | 1-year | 15–25% vs. pay-as-you-go |
|
|
|
| ADLS Gen2 Storage | Azure Reserved Capacity (hot tier) | 1-year (based on projected growth model) | Up to 30% on hot tier |
|
|
|
| Fabric Capacity | Fabric reservation (evaluate availability) | Evaluate after 6 months of usage data | TBD at GA pricing |
|
|
|
| ExpressRoute | ExpressRoute circuit commitment | Already committed (enterprise) | N/A — existing circuit |
|
|
|
|
|
|
### 10.4 Chargeback Model
|
|
|
|
|
|
Cost is allocated to business units via `CostCenter` tags. The CDO's FinOps practice produces monthly chargeback reports:
|
|
|
|
|
|
- **Direct costs** (Databricks DBU, Fabric CU, SAS compute) → allocated to the business unit whose workloads consumed them
|
|
|
- **Shared costs** (storage, networking, governance, monitoring) → allocated proportionally by data volume and query consumption per domain
|
|
|
- **Platform overhead** (hub networking, Key Vault, Purview, Manta) → allocated as a shared infrastructure charge across all consuming business units
|
|
|
|
|
|
---
|
|
|
|
|
|
## 11. DevOps & Infrastructure as Code
|
|
|
|
|
|
### 11.1 Terraform — Primary IaC Tool
|
|
|
|
|
|
All Azure infrastructure is managed through Terraform. Resources are organized as reusable modules stored in a central Git repository with a clear module hierarchy.
|
|
|
|
|
|
| Terraform Module | Scope | State Backend |
|
|
|
|---|---|---|
|
|
|
| `terraform-module-networking` | Hub-spoke VNets, subnets, NSGs, route tables, Private DNS Zones, Azure Firewall, ExpressRoute, VNet peering | Remote state in Azure Storage (`sub-data-management`); state locking via blob lease |
|
|
|
| `terraform-module-databricks` | Databricks workspaces, cluster policies, Unity Catalog metastore, secret scopes, workspace configuration, IP access lists | Remote state per environment (prod/nonprod); separate state file per workspace |
|
|
|
| `terraform-module-storage` | ADLS Gen2 accounts, containers, lifecycle policies, private endpoints, RBAC role assignments, CMK configuration | Remote state per environment |
|
|
|
| `terraform-module-governance` | Purview account, Purview collections, scan rules, Purview private endpoints, diagnostic settings | Remote state (governance) |
|
|
|
| `terraform-module-keyvault` | Key Vault instances, access policies, private endpoints, diagnostic settings, purge protection | Remote state (security) |
|
|
|
| `terraform-module-sas` | AKS cluster, node pools, AKS networking, SAS service principal, SAS storage | Remote state (SAS) |
|
|
|
| `terraform-module-monitoring` | Log Analytics workspace, diagnostic settings (applied to all resources), alert rules, action groups, Sentinel workspace | Remote state (monitoring) |
|
|
|
| `terraform-module-fabric` | Fabric capacity resource, Fabric admin configuration | Remote state (fabric) |
|
|
|
| `terraform-module-policy` | Azure Policy definitions and assignments at management group level (required tags, allowed regions, deny public endpoints) | Remote state (governance) |
|
|
|
|
|
|
**State management:** All Terraform state files are stored in a dedicated storage account in `sub-data-management` with: blob lease-based state locking (prevents concurrent applies), versioning enabled (state history for rollback), encryption at rest (MMK), access restricted to CI/CD service principal only.
|
|
|
|
|
|
### 11.2 CI/CD Pipeline Design
|
|
|
|
|
|
Pipelines run in Azure DevOps (Greenfield standard). Self-hosted agents run in `vnet-mgmt-cc` to access private endpoints. The pipeline follows a 4-stage model:
|
|
|
|
|
|
**Stage 1 — Validate (on every Pull Request):**
|
|
|
- `terraform fmt -check` (formatting compliance)
|
|
|
- `terraform validate` (syntax and configuration validation)
|
|
|
- `tflint` (Terraform linting for best practices)
|
|
|
- `tfsec` / `Checkov` security scanning: validates that all resources comply with security policies (public endpoints disabled, encryption enabled, required tags present, no unmanaged secrets, private endpoints configured)
|
|
|
- Results posted as PR comment; PR blocked if critical findings
|
|
|
|
|
|
**Stage 2 — Plan (on PR approval):**
|
|
|
- `terraform plan` generates execution plan against target environment
|
|
|
- Plan output posted to PR as a formatted comment for human review
|
|
|
- No changes applied — review gate ensures human approval of all changes
|
|
|
|
|
|
**Stage 3 — Apply Non-Production (on merge to `develop` branch):**
|
|
|
- `terraform apply` deploys changes to non-production environment
|
|
|
- Automated smoke tests: verify VNet connectivity, private endpoint DNS resolution, Databricks workspace health, Key Vault accessibility, storage account reachability
|
|
|
- If smoke tests fail, automatic rollback via `terraform apply` with previous state
|
|
|
|
|
|
**Stage 4 — Apply Production (on merge to `main` branch + manual approval):**
|
|
|
- Manual approval gate (requires two approvals from platform engineering leads)
|
|
|
- `terraform apply` deploys to production
|
|
|
- Post-deployment validation: resource health checks, Databricks workspace connectivity, pipeline reachability, monitoring integration verification
|
|
|
- Deployment tagged in Git with version number and timestamp
|
|
|
|
|
|
### 11.3 Data Engineering CI/CD (Separate from Infrastructure)
|
|
|
|
|
|
Data engineering artifacts follow their own CI/CD lifecycle, independent of infrastructure IaC:
|
|
|
|
|
|
| Artifact Type | Version Control | CI/CD Tool | Deployment Method |
|
|
|
|---|---|---|---|
|
|
|
| DLT pipeline definitions | Databricks Repos (Git-backed) | Azure DevOps pipeline | Databricks Asset Bundles (DABs) — `databricks bundle deploy` to prod workspace |
|
|
|
| Databricks notebooks | Databricks Repos | Azure DevOps | DABs or Repos-based deployment on merge to main |
|
|
|
| dbt models (if used) | Git repository | Azure DevOps | `dbt run` executed by Databricks Workflow job |
|
|
|
| ADF pipelines | ARM template export in Git | Azure DevOps | ARM template deployment to ADF instance |
|
|
|
| Great Expectations suites | Git repository | Azure DevOps | Deployed as part of pipeline package |
|
|
|
| SAS programs & macros | Git repository (external to SAS) | Azure DevOps | File deployment to SAS Compute Server shared filesystem; SAS Model Manager registration |
|
|
|
| Power BI semantic models | Power BI Desktop files in Git | Azure DevOps + Fabric REST API | Automated deployment via Fabric deployment pipelines (dev → test → prod) |
|
|
|
|
|
|
**Branching strategy:** Gitflow — feature branches → `develop` (integration) → `release` (staging validation) → `main` (production). Hotfixes branch from `main` and merge back to both `main` and `develop`.
|
|
|
|
|
|
**Testing strategy for data pipelines:**
|
|
|
- **Unit tests:** DLT expectations serve as row-level quality tests; Great Expectations suites validate cross-table invariants
|
|
|
- **Integration tests:** End-to-end pipeline run in non-prod workspace against sample datasets (masked production data); validates Bronze → Silver → Gold flow, quality gates, and data contract compliance
|
|
|
- **Regression tests:** Compare Gold output schema and row counts against baseline after pipeline changes
|
|
|
- **Performance tests:** Benchmark pipeline duration and DBU consumption against historical baselines; flag regressions exceeding 20% threshold
|
|
|
|
|
|
---
|
|
|
|
|
|
## Appendix A — Resource Naming Convention
|
|
|
|
|
|
General pattern: `{type}-{workload}-{environment}-{region}`
|
|
|
|
|
|
| Resource Type | Prefix | Example |
|
|
|
|---|---|---|
|
|
|
| Resource Group | `rg-` | `rg-databricks-prod-cc` |
|
|
|
| Virtual Network | `vnet-` | `vnet-data-prod-cc` |
|
|
|
| Subnet | `snet-` | `snet-dbx-host-prod` |
|
|
|
| NSG | `nsg-` | `nsg-dbx-host-prod` |
|
|
|
| Storage Account (ADLS) | `stadls` | `stadlsgoldprod` (no hyphens — Azure restriction) |
|
|
|
| Key Vault | `kv-` | `kv-data-platform-prod` |
|
|
|
| Databricks Workspace | `dbw-` | `dbw-data-eng-prod` |
|
|
|
| SQL Warehouse | `sqlwh-` | `sqlwh-bi-serving` |
|
|
|
| Log Analytics | `law-` | `law-data-platform-prod` |
|
|
|
| Azure Data Factory | `adf-` | `adf-data-platform-prod` |
|
|
|
| AKS Cluster | `aks-` | `aks-sas-viya-prod` |
|
|
|
| Private Endpoint | `pe-` | `pe-stadlsgoldprod-dfs` |
|
|
|
| Managed Identity | `id-` | `id-databricks-prod` |
|
|
|
| Service Principal | `sp-` | `sp-sas-compute-prod` |
|
|
|
| Purview Account | `pv-` | `pv-data-governance-prod` |
|
|
|
| Fabric Capacity | `fc-` | `fc-bi-serving-prod` |
|
|
|
| Azure Firewall | `afw-` | `afw-hub-canadacentral` |
|
|
|
|
|
|
Region abbreviations: `cc` = Canada Central, `ce` = Canada East.
|
|
|
|
|
|
---
|
|
|
|
|
|
This covers the full Azure infrastructure deployment design for the Greenfield Modern Data Platform. When you're ready, I can generate this as a formatted `.docx` document for download — the computer tools appear to have been experiencing intermittent issues, but I'm happy to retry whenever you'd like.
|