21 KiB
Databricks-Primary Architecture — Implementation Plan
Greenfield Modern Data Platform (MDP) v8.0 Scope: Full platform (end-to-end) | Timeline: 6 months | Starting point: Greenfield Date: March 2026 | Classification: Internal — Draft
1. Target Architecture Summary
Technology Pillars
| Component | Technology | Role |
|---|---|---|
| Data Engineering & Warehousing | Databricks on Azure | Primary compute, medallion architecture (Bronze/Silver/Gold), Unity Catalog |
| BI & Reporting | Microsoft Fabric (Power BI) | Semantic models, Direct Lake via Mirrored Databricks Catalog |
| Actuarial Analytics | SAS Viya on Azure | Regulated actuarial workloads, consumes Gold-layer data directly |
| Storage | Azure Data Lake Storage Gen2 (ADLS) | OneLake-aligned storage, managed by Unity Catalog external locations |
| Governance | Purview + Unity Catalog + Manta | Catalog, lineage, sensitivity labels, SAS lineage |
| Orchestration | Databricks Workflows + Azure Data Factory | Pipeline orchestration, cross-platform triggers |
| Security | Azure AD (Entra ID) + Private Link + NSGs | Identity, network isolation, data exfiltration protection |
| Infrastructure as Code | Terraform + Databricks Asset Bundles (DABs) | Repeatable, auditable deployments |
Network Topology
┌─────────────────────────────────────────────────────────────────┐
│ Azure Landing Zone │
│ │
│ ┌──────────────┐ ┌──────────────────────────────────────┐ │
│ │ Transit VNet │ │ Databricks VNet (Injected) │ │
│ │ │ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ Front-end │──▶│ │ Host Subnet│ │Container Subnet│ │ │
│ │ Private │ │ └────────────┘ └────────────────┘ │ │
│ │ Endpoint │ │ │ │ │ │
│ └──────────────┘ │ ▼ ▼ │ │
│ │ │ NAT Gateway (stable egress IPs) │ │
│ │ └──────────────────────────────────────┘ │
│ │ │
│ ┌──────▼──────────────────────────────────────────────────┐ │
│ │ Private Endpoints │ │
│ │ ADLS Gen2 │ Key Vault │ Purview │ SQL (SAS) │ Fabric │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Regions: Canada Central (prod) │ Canada East (DR) │
└─────────────────────────────────────────────────────────────────┘
Medallion Architecture
Sources ──▶ [Bronze] ──▶ [Silver] ──▶ [Gold] ──▶ Consumers
│ │ │
Raw ingest Cleansed Business- ┌─ Power BI (Direct Lake)
CDC / ELT Conformed ready ├─ Databricks Notebooks
Delta tables Delta Metric Views ├─ SAS Viya (Gold tables)
Schema-on- tables Aggregates ├─ AI/ML Model Serving
read └─ APIs / Data Sharing
2. Phased Implementation — 6 Months
Phase 0 — Foundation (Weeks 1–3)
Objective: Establish the Azure landing zone, network security, and Databricks workspace with Unity Catalog. Nothing runs on the platform yet — this is pure infrastructure.
| Deliverable | Details |
|---|---|
| Azure subscription & resource group design | Prod subscription (Canada Central), DR subscription (Canada East), DevTest subscription. Follow Greenfield's existing CAF landing zone patterns. |
| Networking | VNet injection for Databricks (host + container subnets, /24 each). Transit VNet for front-end Private Link. NAT Gateway for stable egress. NSGs restricting inter-subnet traffic. Private Endpoints for ADLS, Key Vault, Purview, SQL endpoints. |
| Databricks Account & Workspace | Account-level setup in Azure. Unity Catalog metastore (one per region — Canada Central). Account-level SCIM provisioning from Entra ID. Premium tier (required for Unity Catalog, Private Link, IP access lists). |
| Unity Catalog bootstrap | Create the metastore and bind to workspace. Configure root storage credential (managed identity on ADLS). Create initial catalog structure: raw, curated, analytics, sandbox. External locations for existing ADLS containers. |
| ADLS Gen2 storage | Storage account with HNS enabled, private endpoints, customer-managed keys (CMK). Container layout: bronze/, silver/, gold/, landing/, archive/. |
| Key Vault | Secrets for service connections (SAS, ADF, external sources). Databricks secret scopes backed by Azure Key Vault. |
| Terraform / IaC | All resources deployed via Terraform modules. Databricks provider for workspace config. State stored in Azure Storage with state locking. CI/CD pipeline (Azure DevOps or GitHub Actions) for plan/apply. |
| IP access lists & conditional access | Restrict workspace access to Greenfield corporate network. Conditional Access policies in Entra ID for MFA on Databricks. |
Exit criteria: Workspace accessible via Private Link from corporate network. Unity Catalog metastore operational. Terraform state clean. Security review sign-off.
Team: 2 cloud infrastructure engineers, 1 Databricks platform admin, 1 security architect.
Phase 1 — Data Engineering Core (Weeks 4–8)
Objective: Build the Bronze and Silver layers. Ingest the first 3–5 priority data domains. Establish the medallion pattern as a repeatable template.
| Deliverable | Details |
|---|---|
| Ingestion framework | Parameterized Auto Loader (Structured Streaming) for file-based sources (CSV, JSON, Parquet). CDC ingestion via Databricks Delta Live Tables (DLT) for database sources (SQL Server, Oracle, DB2). ADF pipelines for sources requiring SSIS-style orchestration or on-prem gateway. |
| Bronze layer | Raw ingestion into Delta tables with append-only semantics. Schema evolution enabled (mergeSchema). Ingestion metadata columns: _ingest_ts, _source_file, _batch_id. Partition strategy aligned with query patterns (typically date-based). |
| Silver layer | DLT pipelines for cleansing, deduplication, type casting, null handling. Slowly Changing Dimension (SCD) Type 2 for reference data. Quality expectations defined in DLT (EXPECT, EXPECT OR DROP, EXPECT OR FAIL). |
| Unity Catalog governance | Table ownership assigned to data stewards. Column-level tags for PII, sensitivity classification. Grants: SELECT on Silver for analysts, ALL PRIVILEGES on Bronze for engineers only. |
| Databricks Workflows | Orchestration DAGs for ingestion → Bronze → Silver. Alerting on failure (email + ServiceNow integration). SLA monitoring for freshness. |
| Cluster policies | Standardized cluster configurations: job clusters (auto-termination, spot instances), interactive clusters (per-team, size-capped). Single-node clusters for development. Serverless SQL Warehouse for ad-hoc queries. |
| DABs (Databricks Asset Bundles) | All pipeline code, DLT definitions, and workflow configs packaged as DABs. CI/CD: PR → lint → unit test → deploy to dev → integration test → deploy to prod. |
| Priority data domains (3–5) | Select based on business value and regulatory priority. Suggested: Customer 360, Transactions, Products, Claims (insurance), Policies. |
Exit criteria: Bronze and Silver tables populated for priority domains. DLT pipelines running on schedule. Data quality expectations enforced. CI/CD pipeline operational for DABs.
Team: 3 data engineers, 1 Databricks platform admin, 1 data steward, 1 DevOps engineer.
Phase 2 — Gold Layer, Governance & SAS Viya Integration (Weeks 9–14)
Objective: Build the Gold layer with business-ready datasets. Connect SAS Viya to Gold tables. Establish full governance with Purview and Manta.
| Deliverable | Details |
|---|---|
| Gold layer | Dimensional models (star schemas) materialized as Delta tables. Aggregate tables for high-frequency BI queries. Customer 360 unified view. Reference data (MDM) as managed tables in UC. |
| Unity Catalog Metric Views | Define business metrics (revenue, loss ratio, CLV, etc.) as YAML-based Metric Views in UC. Version-control via DABs in Git. Metrics queryable via SQL by all consumers. |
| SAS Viya integration | SAS Viya configured to read Gold-layer Delta tables via ADLS connector or Databricks SQL endpoint (ODBC). SAS libraries mapped to Gold catalog schemas. Network path: SAS VNet peered to Databricks VNet via Private Endpoints. Validate actuarial models run correctly against Gold data. |
| Microsoft Purview | Register Unity Catalog as a data source in Purview. Automated scanning: metadata, schema, lineage from UC. Sensitivity labels applied to Gold-layer assets (PII, PHI, financial). Data Product definitions for governed datasets. |
| IBM Manta (lineage) | Manta configured to parse SAS code programs (DATA steps, PROC SQL). Lineage stitching: SAS → Databricks → Power BI end-to-end. Manta connector for Databricks UC + Fabric Power BI. |
| Data quality (Purview DQ) | DQ rules on Gold-layer tables: completeness, uniqueness, conformity, accuracy. Incremental scans scheduled post-refresh. Error-record publishing for remediation. Cross-dataset referential integrity checks via Databricks notebooks (Purview cannot do this natively). |
| Access control hardening | Row-level security (RLS) on sensitive Gold tables via dynamic views in UC. Column masking for PII fields (SSN, email, phone) via UC masking policies. Audit logging enabled (UC audit log → Azure Monitor → SIEM). |
Exit criteria: Gold layer populated for priority domains. SAS Viya actuarial models validated against Gold data. Purview scanning UC metadata. Manta producing end-to-end lineage. DQ scores visible in Purview Unified Catalog.
Team: 2 data engineers, 1 SAS administrator, 1 governance/Purview specialist, 1 Manta specialist, 1 data steward.
Phase 3 — BI, AI/ML & Semantic Layer (Weeks 15–20)
Objective: Enable Power BI consumption via Mirrored Catalog (Scenario D). Deploy initial AI/ML workloads. Establish the semantic layer.
| Deliverable | Details |
|---|---|
| Mirrored Databricks Catalog | Configure Fabric Mirrored Catalog for Gold-layer UC catalog. Auto-sync validated (tables, views mirrored to OneLake). Verify sync latency meets batch-refresh SLA. |
| Power BI semantic models (thin) | Thin semantic models on mirrored tables (Direct Lake mode). Formatting, hierarchies, display folders — no business logic in DAX. Tabular Editor Semantic Bridge evaluated for automating Metric View → PBI translation. Power BI workspace governance: endorsement, certification, RLS passthrough. |
| Fabric workspace setup | Fabric capacity provisioned (F64 or higher for production Direct Lake). Dev/Test capacity separate. Purview sensitivity labels propagating to Fabric items. |
| AI/ML platform | Mosaic AI Model Serving configured for inference endpoints. MLflow Tracking Server integrated with Unity Catalog (model registry). Feature Engineering tables in Gold layer, queryable via Metric Views. Initial ML use case deployed (e.g., churn prediction, fraud scoring). |
| LakehouseIQ / Genie | Genie configured against Metric Views for natural-language queries. Business user pilot group defined. |
| Delta Sharing | Outbound shares configured for inter-business-unit data sharing. Clean rooms evaluated for partner data collaboration (if applicable). |
Exit criteria: Power BI dashboards running on Direct Lake via mirrored catalog. Sub-second query performance validated. ML model serving endpoint live. Genie pilot running.
Team: 2 BI developers, 1 Fabric admin, 2 data scientists/ML engineers, 1 Databricks platform admin.
Phase 4 — Hardening, DR & Go-Live (Weeks 21–24)
Objective: Production readiness. Disaster recovery. Operational handover.
| Deliverable | Details |
|---|---|
| Disaster recovery | Canada East workspace (passive) with Terraform-identical config. ADLS geo-replication (GRS) to Canada East. UC metastore registered in Canada East (active-passive). DR runbook: RTO < 4h, RPO < 1h for Gold layer. DR test executed and documented. |
| Monitoring & alerting | Azure Monitor dashboards for Databricks cluster utilization, job failures, DLT pipeline health. Purview DQ score trends. Fabric capacity utilization alerts. Custom alerts: data freshness SLA breaches, Unity Catalog permission changes, failed login attempts. Integration with Greenfield's ServiceNow / PagerDuty. |
| Cost management | Databricks DBU consumption dashboards (by team, by workload type). Fabric CU utilization tracking. FinOps tagging strategy enforced (cost center, project, environment). Spot instance policies for job clusters (70% savings target). Serverless SQL Warehouse auto-scaling tuned. |
| Security hardening | Penetration test on Private Link endpoints. Data exfiltration protection validated (no public egress from compute plane). Customer-managed key rotation tested. Compliance audit: OSFI, AMF, Law 25 checklist completed. |
| Operational runbooks | Incident response playbook for data pipeline failures. Capacity scaling procedures. User onboarding process (Entra ID → SCIM → UC grants). Change management process for schema evolution. |
| Knowledge transfer | Platform admin training (Databricks, Fabric, Purview). Data engineering onboarding (DLT, DABs, Medallion patterns). BI developer onboarding (Mirrored Catalog, Direct Lake, thin semantic models). Data steward onboarding (UC governance, DQ rules, Purview). |
| Go-Live gate | Sign-off from: Security, Architecture, Operations, Business stakeholders. Hypercare period: 4 weeks post-go-live with dedicated support. |
Exit criteria: DR test passed. Monitoring operational. Security audit cleared. Operations team trained. Go-Live approved.
Team: 1 platform architect, 1 security engineer, 1 SRE/ops engineer, 1 FinOps analyst, all domain teams for validation.
3. Gantt Summary
Week 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
├──────────┤
Phase 0: Foundation (Infra, Network, UC)
├────────────────────┤
Phase 1: Data Engineering (Bronze/Silver, Ingestion, CI/CD)
├──────────────────────┤
Phase 2: Gold, Governance, SAS Viya
├──────────────────────┤
Phase 3: BI, AI/ML, Semantic Layer
├───────────┤
Phase 4: DR, Hardening, Go-Live
Phases overlap intentionally. Phase 1 starts before Phase 0 is fully complete (infrastructure is ready by week 2, UC bootstrap by week 3). Phase 3 starts while Phase 2 Gold tables are being built (Fabric setup can proceed in parallel).
4. Team Composition
| Role | Phase 0 | Phase 1 | Phase 2 | Phase 3 | Phase 4 | FTE Total |
|---|---|---|---|---|---|---|
| Platform Architect (Lead) | ● | ● | ● | ● | ● | 1 |
| Cloud Infrastructure Engineers | ●● | ● | ● | 2 | ||
| Databricks Platform Admin | ● | ● | ● | ● | ● | 1 |
| Data Engineers | ●●● | ●● | 3 | |||
| Data Stewards | ● | ● | ● | 1 | ||
| Governance / Purview Specialist | ● | 1 | ||||
| Manta Lineage Specialist | ● | 1 | ||||
| SAS Administrator | ● | 1 | ||||
| BI Developers | ●● | 2 | ||||
| Data Scientists / ML Engineers | ●● | 2 | ||||
| Fabric Admin | ● | 1 | ||||
| Security Architect | ● | ● | 1 | |||
| DevOps / SRE | ● | ● | 1 | |||
| FinOps Analyst | ● | 1 | ||||
| Peak concurrent FTEs | 5 | 7 | 7 | 7 | 6 | ~12 unique |
5. Key Risks & Mitigations
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Mirrored Catalog sync latency exceeds SLA | BI dashboards show stale data | Low (GA since July 2025) | Monitor sync latency from week 15. Fallback to Import mode for critical dashboards. |
| SAS Viya ODBC performance on Gold tables | Actuarial models run slower than on-prem | Medium | Size Databricks SQL Warehouse for SAS query patterns. Pre-materialize SAS-specific aggregates in Gold. |
| Unity Catalog metastore region limitation | Cross-region governance complexity | Low | One metastore per region by design. DR metastore in Canada East with sync tooling. |
| Law 25 / OSFI compliance gaps | Regulatory non-compliance | Medium | Engage compliance team in Phase 0. Column masking + audit logging from day one. Purview sensitivity labels enforced before Gold layer is exposed. |
| Metric Views maturity (GA late 2025) | Edge cases not covered by YAML spec v1.1 | Medium | Start with simple metrics. Complex calculations stay in Gold materialized views until Metric Views spec matures. |
| Terraform state drift | Infrastructure inconsistency | Low | State locking in Azure Storage. Drift detection in CI/CD pipeline. No manual changes to production resources. |
| Team skill gaps (Databricks, DLT, DABs) | Delivery delays | High | Databricks Academy training in weeks 1–2. Pair programming with Databricks CSA during Phase 1. |
6. Dependencies & Assumptions
Dependencies
- Azure subscription provisioning and networking (Greenfield cloud team) — needed by week 1
- Entra ID group structure for RBAC — needed by week 2
- Databricks Premium contract signed — needed by week 1
- Fabric capacity reservation — needed by week 14
- SAS Viya environment available on Azure — needed by week 9
- IBM Manta license and deployment — needed by week 10
- Source system access (JDBC credentials, firewall rules, on-prem gateway) — needed by week 4
Assumptions
- Greenfield's Azure landing zone (CAF) is in place or will be established as part of Phase 0
- Databricks-primary architecture (v8.0) is the approved reference — not Fabric-primary (v1.0)
- 3–5 priority data domains are identified before Phase 1 starts
- Existing data mapping / data dictionaries are available for priority domains
- Greenfield's DevOps tooling (Azure DevOps or GitHub) is available for CI/CD
7. Cost Estimate (Order of Magnitude)
| Component | Monthly Estimate (CAD) | Notes |
|---|---|---|
| Databricks Premium (DBUs) | $40,000–$80,000 | Depends on cluster sizing, spot usage, serverless adoption |
| ADLS Gen2 Storage | $3,000–$8,000 | ~50 TB initial, growing with Bronze retention |
| Azure Networking (Private Link, NAT GW) | $2,000–$4,000 | Fixed cost, scales with endpoints |
| Fabric Capacity (F64) | $12,000–$15,000 | Reserved capacity recommended |
| Azure Key Vault + Monitor | $500–$1,000 | |
| IBM Manta License | TBD | Enterprise licensing, negotiate with IBM |
| SAS Viya on Azure | Existing contract | Assumed already budgeted |
| Total platform (excl. Manta & SAS) | $57,500–$108,000/mo | $690K–$1.3M/year |
Personnel costs (12 FTEs × 6 months) are separate and depend on Greenfield's internal vs. consulting mix.
8. Success Criteria
| Metric | Target | Measured At |
|---|---|---|
| Gold-layer tables available for priority domains | 3–5 domains | End of Phase 2 |
| Power BI Direct Lake dashboards operational | ≥ 5 dashboards | End of Phase 3 |
| Data quality scores visible in Purview | 100% of Gold tables scanned | End of Phase 2 |
| End-to-end lineage (SAS → Databricks → PBI) | Operational in Manta | End of Phase 2 |
| SAS Viya actuarial models validated on Gold data | 100% of migrated models | End of Phase 2 |
| DR failover test passed | RTO < 4h, RPO < 1h | End of Phase 4 |
| Security audit cleared (OSFI, AMF, Law 25) | No critical findings | End of Phase 4 |
| ML model serving endpoint live | ≥ 1 use case | End of Phase 3 |