Greenfield – Modern Data Platform Architecture CONFIDENTIAL GREENFIELD Data & AI Solutions Modern Data Platform High-Level Reference Architecture Classification: Internal – Confidential Status: Draft v8.0 – Fabric IQ Evaluation Date: March 2026 Owner: VP, Chief Data Officer – Data & AI Solutions # 1. Executive Summary This document presents the high-level reference architecture for the Greenfield Modern Data Platform. It defines the strategic blueprint for a unified, secure, and scalable data ecosystem that serves the entire organization’s data and AI needs, from operational reporting to advanced machine learning and generative AI. The architecture is designed to support Greenfield’ federated data operating model under the leadership of the Chief Data Officer, enabling both centralized governance and decentralized execution across business units. It leverages three core technology pillars—Databricks, Microsoft Fabric, and SAS Viya—running on Microsoft Azure, each assigned clear, non-overlapping responsibilities to maximize return on investment. Databricks serves as the primary data engineering, warehousing, and AI/ML platform; Microsoft Fabric is scoped strictly as a BI serving layer for Power BI Direct Lake consumption; SAS Viya (Compute Server) is retained for regulated actuarial and risk analytics with batch processing. A rationalized three-tier governance model integrates Microsoft Purview, Databricks Unity Catalog, and Manta (IBM) for end-to-end lineage, while retiring overlapping IBM Knowledge Catalog capabilities. Key design drivers: regulatory compliance (AMF, OSFI, Privacy Act), risk management rigor, data democratization at scale, AI-readiness, and cost optimization through workload-appropriate technology placement. # 2. Vision & Guiding Principles ## 2.1 Vision Statement Establish a trusted, enterprise-wide data and AI platform that treats data as a strategic asset, enables self-service analytics at all levels of the organization, and accelerates the responsible adoption of artificial intelligence to serve Greenfield’ members and clients. ## 2.2 Architecture Principles Data as a Product. Every critical dataset is managed with the rigor of a product: documented, quality-measured, discoverable, and owned by an accountable data owner. Unified Governance, Federated Execution. Governance policies, standards, and security rules are defined centrally; execution of data engineering and analytics can be delegated to business units with mature capabilities. Security by Design. Zero-trust principles, row-level and column-level security, dynamic data masking, and comprehensive lineage tracking are embedded at every layer, not bolted on. Cloud-Native & Open Standards. Prefer open formats (Delta Lake, Parquet, Iceberg), open APIs, and cloud-native services to avoid vendor lock-in and ensure long-term portability. Right Tool for the Right Workload. Each technology platform is assigned workloads that align with its strengths; duplication of capabilities across platforms is actively avoided. Self-Service with Guardrails. Enable business users and data scientists to access data autonomously within a governed perimeter that enforces compliance, quality, and cost controls. AI-Ready by Default. The platform is designed from the ground up to support the full AI lifecycle: feature engineering, model training, evaluation, deployment, monitoring, and responsible AI compliance. # 3. Logical Architecture The Modern Data Platform follows a layered architecture aligned with industry standards (DAMA-DMBOK) and the EDM Council’s DCAM framework. Each layer has well-defined responsibilities, interfaces, and technology assignments. ## 3.1 Architecture Layers Overview The platform is organized into seven logical layers, from source ingestion to consumption, wrapped by cross-cutting concerns for governance, security, and operations. | Layer | Responsibility | Key Components | |---|---|---| | 1. Ingestion | Acquire data from source systems with full lineage and change tracking | ADF, Databricks Auto Loader, Kafka/Event Hub, SAS Data Connectors, APIs | | 2. Raw / Bronze | Store raw data as-is in immutable format for auditability and reprocessing | ADLS Gen2 (Delta Lake format), Unity Catalog external tables | | 3. Curated / Silver | Cleanse, conform, validate, and integrate data; apply quality rules and business keys | Databricks Delta Live Tables, dbt, Great Expectations, SAS Data Quality | | 4. Business / Gold | Produce business-ready data products: dimensional models, aggregates, feature stores, KPIs | Databricks SQL Warehouses, Fabric OneLake shortcuts (Direct Lake), SAS Viya Compute Server, Semantic Models | | 5. Semantic | Provide a unified business vocabulary and metrics layer; single source of truth for KPIs | Fabric Power BI Semantic Models, Databricks AI/BI Dashboards, dbt Metrics | | 6. Serving / Consumption | Deliver data to end users, applications, and AI models through appropriate interfaces | Power BI, SAS Visual Analytics, Databricks SQL, REST APIs, JDBC/ODBC | | 7. AI/ML | Support the full AI lifecycle from experimentation to production model serving | Databricks MLflow, Model Serving, Feature Store, Azure OpenAI, SAS Model Manager | ## 3.2 Cross-Cutting Layers Three transversal layers span the entire architecture and enforce enterprise standards across all data activities. | Cross-Cutting Layer | Scope & Capabilities | |---|---| | Governance & Metadata | Three-tier catalog (Microsoft Purview + Databricks Unity Catalog + Manta lineage engine), business glossary, end-to-end data lineage, data quality monitoring, policy management, data classification, stewardship workflows | | Security & Privacy | Authentication (Azure AD/Entra ID), RBAC, ABAC, row-level security (RLS), column-level security (CLS), dynamic data masking (DDM), encryption at rest/in transit, DLP policies, privacy controls (consent, anonymization, pseudonymization) | | Observability & Operations | Pipeline monitoring (ADF, Databricks Workflows), cost management (Azure Cost Management, Databricks Account Console), SLA tracking, alerting, incident management, capacity planning | # 4. Technology Platform Mapping The architecture assigns clear primary responsibilities to each of the three technology platforms to avoid functional overlap, reduce cost, and simplify governance. The guiding principle is that each platform excels where it is strongest, and data flows between them through controlled interfaces. ## 4.1 Platform Roles | Capability | Databricks | Microsoft Fabric | SAS Viya | |---|---|---|---| | Data Engineering | PRIMARY
Large-scale ETL, Delta Live Tables, streaming | RESTRICTED
Minimal Dataflows Gen2 for Power BI semantic model prep only; no general ETL | TARGETED
SAS Compute Server data prep for actuarial/risk models | | Data Warehousing | PRIMARY
Databricks SQL Warehouses, Lakehouse | BI SERVING LAYER
OneLake shortcuts to Delta tables; Direct Lake mode for Power BI only | — | | BI & Reporting | SECONDARY
Databricks AI/BI Dashboards, Genie | PRIMARY
Power BI (enterprise standard) | TARGETED
SAS Visual Analytics for actuarial/risk | | ML / AI | PRIMARY
MLflow, Feature Store, Model Serving, GenAI | — | PRIMARY (Specialized)
Actuarial models, risk scoring, SAS Model Manager | | Advanced Analytics | PRIMARY
Python/R/Scala notebooks, collaborative DS | — | PRIMARY (Specialized)
SAS procedures, econometrics, regulatory models | | Real-Time / Streaming | PRIMARY
Structured Streaming, Auto Loader | SECONDARY
Eventhouse (KQL-based analytics) | — | | Data Catalog & Governance | PRIMARY (Technical)
Unity Catalog (fine-grained ACL, lineage) | PRIMARY (Enterprise)
Microsoft Purview (classification, glossary, policies) | TARGETED
SAS Information Catalog for SAS assets | | Cross-Platform Lineage | CONTRIBUTING
Unity Catalog column-level lineage (within lakehouse) | CONTRIBUTING
Purview lineage visualization (consumer) | PRIMARY – Manta
Automated code-level lineage across all platforms incl. SAS | | Semantic Intelligence | CONTRIBUTING
Genie NL querying; Gold data products as ontology source | EVALUATE (Horizon 2–3)
Fabric IQ ontology, graph engine, Data/Operations Agents | — | ## 4.2 Data Flow Between Platforms A critical design decision is how data flows across platforms while maintaining a single source of truth. The architecture enforces the following data flow principles: ADLS Gen2 as the shared storage substrate. All three platforms read and write to a common Azure Data Lake Storage Gen2 layer using Delta Lake as the canonical format. This avoids data duplication and ensures consistency. Unity Catalog as the access control plane. Databricks Unity Catalog provides fine-grained access control (RLS, CLS, DDM) enforced at query time for all Databricks and external workloads consuming Delta tables. OneLake shortcuts as a BI serving layer. Microsoft Fabric does not hold its own copy of the warehouse. Instead, it accesses Databricks-managed Delta tables through OneLake shortcuts—zero-copy pointers that enable Power BI Direct Lake mode. Direct Lake reads Delta/Parquet files directly into Power BI’s VertiPaq in-memory engine, delivering import-like performance with DirectQuery-like freshness. This is the sole architectural justification for Fabric in the data warehousing layer: it serves as the optimized BI delivery channel for the 55,000-employee user base, not as a parallel data platform. SAS Viya Compute Server integration. SAS Viya Compute Server connects to the curated/gold layers through JDBC LIBNAME to Databricks SQL Warehouses (preferred, as it enforces Unity Catalog RLS/CLS/DDM) or through ADLS LIBNAME for non-sensitive, pre-authorized datasets. Data is processed sequentially by the Compute Server engine and results are written to the staging zone for promotion to Gold. There is no intermediate in-memory layer. Manta as the lineage fabric. Manta continuously scans code and metadata across all three platforms to produce a unified, end-to-end lineage graph. This lineage is published to Purview for enterprise visualization and to Unity Catalog for technical consumers, closing the lineage gap that no single platform can bridge alone—especially for SAS Viya workloads. ## 4.3 Microsoft Fabric – Scope, Guardrails & Anti-Patterns Given the functional overlap between Databricks SQL Warehouses and Microsoft Fabric’s Lakehouse/Warehouse capabilities, a precise scoping of Fabric’s role is essential to prevent duplication and control costs. The architecture enforces the following boundary: Fabric’s Authorized Scope Power BI semantic models connected to Databricks-managed Delta tables via OneLake shortcuts in Direct Lake mode. Power BI report and dashboard development, including paginated reports, Copilot-assisted analytics, and capacity management for the enterprise BI workload. Minimal Dataflows Gen2 transformations strictly limited to last-mile BI semantic model preparation (e.g., calculated columns, measure groups) that are better expressed in Power Query than SQL. Fabric Eventhouse (KQL) for real-time operational dashboards where KQL-native analytics are the best fit. Fabric IQ (Horizon 2–3, pending GA): Ontology modeling of business entities bootstrapped from existing Power BI semantic models, graph engine for relationship traversal, and Data/Operations Agents for AI-grounded reasoning. Subject to GA release, Purview integration validation, and confirmation that Unity Catalog RLS/CLS/DDM enforcement applies to agent data access paths. See Section 4.5 for the full evaluation framework. Prohibited Anti-Patterns The following patterns are explicitly prohibited and should be enforced through architecture review gates and Fabric capacity governance: Fabric as a parallel warehouse. Building a Fabric Warehouse or Lakehouse with its own ETL pipelines, ingestion flows, or data models that duplicate data already managed in the Databricks Lakehouse. Fabric Data Factory as a primary orchestrator. All data engineering orchestration runs through Azure Data Factory or Databricks Workflows. Fabric Data Factory pipelines should not be used to create a shadow ETL layer. Fabric notebooks for data engineering. Spark notebooks for data transformation must run in Databricks, where Unity Catalog governance, lineage, and cost controls are enforced. Fabric Spark should not be used as an alternative compute. Uncontrolled Fabric capacity growth. Fabric F-SKU capacity must be right-sized for the BI serving workload only. Requests to expand capacity for data engineering or warehousing workloads trigger an architecture review. Fabric IQ as justification for data migration to OneLake. Fabric IQ’s ontology and agents must consume Gold data products through governed access paths (OneLake shortcuts to Databricks Delta tables or JDBC to Databricks SQL Warehouses). The introduction of Fabric IQ must not become a rationale for moving ETL, data engineering, or warehousing workloads into Fabric. The data platform remains Databricks; Fabric IQ is a semantic intelligence layer, not a data platform. Rationale: Serving Power BI directly from Databricks SQL Warehouses via DirectQuery is technically possible but imposes a significant cost and performance penalty at enterprise scale. Every dashboard refresh from 55,000 users would consume Databricks DBU compute, and DirectQuery latency (per-query round-trips) degrades the BI experience compared to Direct Lake’s in-memory reads. The Fabric BI serving layer exists solely to solve this cost-performance equation. It is not a second data platform. ## 4.4 Medallion Architecture – Layer-by-Layer Implementation This section details how each layer of the Medallion architecture is implemented across the technology stack, what transformations occur at each transition, and how data physically flows from source systems to consumption endpoints. The Medallion pattern (Bronze → Silver → Gold) provides progressive refinement: each layer adds structure, quality, and business meaning while preserving full auditability back to the raw source. Figure 1 below provides the end-to-end architectural view of the Medallion data flow, from source systems through ingestion, Bronze, Silver, and Gold layers, to the five consumption endpoints. Cross-cutting governance, security, and observability layers span the entire architecture. Figure 1 – Medallion Architecture: End-to-End Data Flow (Source Systems → Ingestion → Bronze → Silver → Gold → Serving & Consumption) ### 4.4.1 Source Systems & Ingestion Purpose: Acquire data from operational and external source systems and land it reliably into the Bronze layer with full change tracking and provenance metadata. | Ingestion Pattern | Technology | Use Cases | |---|---|---| | Batch – File-based | Databricks Auto Loader (cloudFiles) with schema inference and evolution | Flat files (CSV, JSON, XML) from SFTP, mainframe extracts, regulatory feeds, partner data drops | | Batch – Database | Azure Data Factory (ADF) Copy Activity with change data capture (CDC) or watermark-based incremental loads | Core banking systems (DB2, Oracle, SQL Server), insurance policy administration, CRM | | Near Real-Time / CDC | ADF CDC connectors or Debezium → Azure Event Hub → Databricks Structured Streaming | Transaction feeds, account balance updates, claims status changes | | Real-Time / Streaming | Azure Event Hub → Databricks Structured Streaming (with Auto Loader for exactly-once guarantees) | Fraud detection event streams, payment transactions, IoT telemetry (telematics insurance) | | API-based | ADF Web Activity or Databricks notebooks with REST client libraries, orchestrated by Databricks Workflows | Market data feeds, credit bureau lookups, external regulatory APIs, SaaS platform extracts | | SAS-Native Ingestion | SAS Viya Compute Server (JDBC LIBNAME to Databricks SQL Warehouse or ADLS LIBNAME for authorized paths) | Actuarial data feeds requiring SAS-native processing; reads from the governed lake via JDBC (enforces UC policies) or ADLS direct for pre-authorized datasets | Orchestration: All ingestion pipelines are orchestrated by Databricks Workflows (preferred for Databricks-native jobs) or Azure Data Factory (for hybrid and cross-system orchestration). Each ingestion job writes provenance metadata (source system, extraction timestamp, row counts, schema version) as Delta table properties and to a centralized ingestion audit log. #### 4.4.1a Pre-Bronze Quality Gate – Ingestion-Level DQ Assessment Purpose: Detect structural and statistical anomalies at the point of ingestion, before data enters the Bronze layer. This early warning system catches source-side problems (schema breaks, missing files, data drift, encoding corruption, anomalous distributions) before they propagate through the Medallion pipeline. The assessment uses a sampling approach to avoid the cost and latency of full data scans. Technology: Microsoft Purview Data Quality, integrated with ADLS Gen2 staging zone assets registered in the Purview catalog. Purview DQ scans are triggered by pipeline completion (via ADF/Databricks Workflows REST API call) or run on schedule for continuous feeds. | Data Domain Criticality | Sampling Strategy | Checks Applied | |---|---|---| | Critical Data Elements (CDEs) | 10–20% of ingested batch; 100% for low-volume regulatory feeds. CDEs include customer identifiers, account numbers, financial amounts, regulatory fields | Completeness (null rates), format validation, referential integrity to reference data, value range checks | | Non-CDE Columns | 1–5% sample sufficient to detect structural anomalies on descriptive attributes, metadata, free text fields | Schema conformance, cardinality drift, encoding validation, basic type checks | | Streaming / CDC | Sampling window: every 1,000 records or every 5-minute micro-batch. Quality profile run on the sampled window | Volume anomaly detection (count vs. historical baseline), null spike detection, schema drift | | All Ingestions | Applied to every batch regardless of sampling rate | Row count thresholds (vs. historical baseline to detect truncated or duplicate loads), schema conformance (expected columns, types) | Flow: Source → ADF/Auto Loader → ADLS Staging Zone → Purview DQ Sampling Assessment → Bronze Delta Table (if pass) or Quarantine + Alert (if fail). Quality scores are published to the Purview governance hub and feed the unified DQ dashboard. Disposition logic: Critical failures (CDE null rate above threshold, schema break, volume anomaly exceeding ± 50% of baseline) block promotion to Bronze and route the batch to quarantine with an automated alert to the data engineering team and the responsible data steward. Non-critical deviations (minor drift in non-CDE columns) are logged as warnings, the batch is promoted to Bronze, and the deviation is flagged for steward review within the SLA window. ### 4.4.2 Bronze Layer – Raw Immutable Store Purpose: Store ingested data in its original form with minimal transformation. The Bronze layer is the system of record for auditability and reprocessing. Data is append-only and immutable. Only batches that have passed the pre-Bronze quality gate (Section 4.4.1a) are promoted to this layer; failed batches reside in the quarantine staging area. | Aspect | Implementation Detail | |---|---| | Storage | ADLS Gen2 in Delta Lake format, organized by source system and entity (e.g., /bronze/core_banking/accounts/) | | Catalog | Registered in Unity Catalog under a dedicated bronze catalog (e.g., bronze_core_banking.accounts). Access restricted to data engineering roles only | | Schema Management | Schema-on-read with Auto Loader schema inference; schema evolution enabled (mergeSchema) to accommodate upstream changes without pipeline failures | | Transformations | Minimal: add ingestion metadata columns (_ingested_at, _source_file, _batch_id), cast to consistent types where necessary, preserve original column names | | Retention | Delta Time Travel retained for 90 days minimum (regulatory requirement); VACUUM policy set accordingly. Full historical snapshots retained for audit-critical domains | | Quality Checks | Structural validation only: row count thresholds, null-check on primary keys, schema drift detection. Failed ingestions quarantined to a _quarantine sub-table | Transition to Silver: Bronze → Silver pipelines are implemented as Databricks Delta Live Tables (DLT) declarative pipelines. DLT provides automatic dependency management, built-in expectations (quality gates), and lineage tracking integrated with Unity Catalog. Each Bronze table maps to one or more Silver tables through a transformation specification defined in the pipeline code. ### 4.4.3 Silver Layer – Curated & Conformed Purpose: Cleanse, conform, deduplicate, and integrate data into an enterprise-consistent shape. The Silver layer is where data becomes trustworthy and reusable across business domains. It applies business keys, enforces referential integrity, and runs the bulk of data quality validations. | Aspect | Implementation Detail | |---|---| | Storage | ADLS Gen2 in Delta Lake format, organized by business domain (e.g., /silver/customer/, /silver/claims/, /silver/transactions/) | | Catalog | Registered in Unity Catalog under domain-specific silver catalogs (e.g., silver_customer.individual, silver_claims.claim_header). Read access granted to analysts and data scientists with appropriate clearance | | Transformations | Cleansing: standardize formats (dates, phone numbers, addresses), handle nulls, trim whitespace, normalize encodings
Conforming: apply business keys from MDM, map source codes to enterprise reference data, align to canonical data model
Deduplication: entity resolution and merge logic for multi-source entities (e.g., same customer from banking and insurance)
SCD Type 2: slowly changing dimensions tracked with effective_from/effective_to columns for historical accuracy | | Quality Gates | DLT expectations enforced at this layer: completeness (no null PKs), validity (referential integrity to reference data), accuracy (business rule validations), timeliness (freshness SLA checks). Rows failing expectations are routed to quarantine tables with diagnostic metadata | | Security | Unity Catalog RLS and CLS enforced from this layer onwards. Sensitive columns (NAS/SIN, account numbers) masked by default via column mask functions; unmasked access requires explicit grant | | Technology | Databricks Delta Live Tables (primary), complemented by Great Expectations for complex cross-table validations. SAS Data Quality used for SAS-specific data prep workflows feeding actuarial models | Transition to Gold: Silver → Gold pipelines aggregate, denormalize, and reshape Silver tables into business-ready data products. These pipelines are also implemented in DLT or as scheduled Databricks SQL queries, depending on complexity. The transition is gated by data quality scores: a Silver table must meet its quality SLA (e.g., ≥99.5% completeness on CDEs) before its downstream Gold products are refreshed. ### 4.4.4 Gold Layer – Business-Ready Data Products Purpose: Deliver consumption-optimized data products shaped for specific business use cases. The Gold layer contains dimensional models (star schemas), pre-computed aggregates, KPI tables, feature tables for ML, and certified data products with formal SLAs and data contracts. | Aspect | Implementation Detail | |---|---| | Storage | ADLS Gen2 in Delta Lake format, organized by data product domain (e.g., /gold/customer_360/, /gold/risk_features/, /gold/financial_aggregates/) | | Catalog | Registered in Unity Catalog under product-specific gold catalogs (e.g., gold_customer_360.member_profile, gold_risk.credit_features). Broad read access for authorized business users, analysts, and downstream applications | | Data Models | Dimensional models: star schemas with conformed dimensions (time, geography, product, organization) shared across fact tables
Wide denormalized tables: flattened views optimized for Direct Lake consumption in Power BI
Feature tables: registered in Databricks Feature Store for ML training and online serving
Aggregate/KPI tables: pre-computed metrics aligned with Power BI semantic model measures and regulatory reporting requirements | | Data Contracts | Each Gold data product has a formal data contract specifying: schema (columns, types, nullability), freshness SLA (e.g., T+1 by 06:00 EST), quality thresholds (e.g., ≥99.5% completeness on CDEs), access policies (which roles/groups), lineage (upstream Silver sources), and owner (business product owner + engineering contact) | | Serving Paths | Power BI: Gold Delta tables → OneLake shortcuts → Fabric Direct Lake semantic models
Databricks SQL: analysts and applications query Gold tables via SQL Warehouses (JDBC/ODBC)
ML / AI: feature tables consumed by MLflow training jobs and Feature Serving endpoints
SAS Viya: Compute Server reads Gold tables via JDBC LIBNAME (enforces Unity Catalog policies) for actuarial and risk modeling
APIs: REST/GraphQL APIs served through Databricks Model Serving or custom Azure Functions for application integration | | Optimization | Z-ORDER / LIQUID clustering on frequently filtered columns (date, business_unit, product_code); OPTIMIZE scheduled nightly; VACUUM per retention policy; table statistics maintained for query optimizer | ### 4.4.5 End-to-End Data Flow Summary The following describes the canonical data flow through the Medallion layers, from source to consumption, with the technology responsible at each stage. | # | Stage | Technology | Action | Output | |---|---|---|---|---| | 1 | Extract | ADF / Auto Loader / Event Hub | Pull data from source systems (batch, CDC, streaming, API) | Raw files or streams landed in ADLS staging zone | | 2 | Pre-Bronze DQ | Purview Data Quality (sampling) | Sample-based DQ assessment: schema conformance, completeness, volume anomaly detection, distribution drift | Pass → promote to Bronze; Fail → quarantine + alert. DQ scores published to Purview | | 3 | Bronze Load | Databricks Auto Loader / DLT | Ingest into Delta tables with schema inference; append ingestion metadata | Bronze Delta tables in Unity Catalog (immutable, append-only) | | 4 | Bronze → Silver | Databricks DLT pipelines | Cleanse, conform, deduplicate, apply business keys, validate quality expectations | Silver Delta tables – enterprise-conformed, quality-gated | | 5 | Silver → Gold | Databricks DLT / SQL | Aggregate, denormalize, build star schemas, compute features and KPIs | Gold Delta tables – data products with SLAs and data contracts | | 6a | Serve → BI | OneLake shortcuts → Fabric Direct Lake | Power BI semantic models read Gold Delta tables via zero-copy shortcuts | Power BI dashboards and reports for 55,000 users | | 6b | Serve → Analytics | Databricks SQL Warehouses | Ad-hoc queries, JDBC/ODBC for applications, notebook exploration | Analyst queries, application data feeds, data science exploration | | 6c | Serve → AI/ML | Databricks Feature Store / Model Serving | Feature lookup for training and inference; model endpoints for real-time scoring | ML model predictions, real-time fraud scores, recommendation outputs | | 6d | Serve → SAS | SAS Viya Compute Server via JDBC/ADLS LIBNAME | Compute Server reads Gold tables via JDBC (governed) or ADLS (pre-authorized) for actuarial and risk modeling | SAS analytics outputs, regulatory model results (written back to Gold via Databricks) | Write-back pattern: SAS Viya Compute Server writes model outputs using a hybrid approach. For large batch outputs (scored portfolios, reserving projections, millions of rows), SAS writes to a dedicated ADLS staging zone (/staging/sas_writeback/{model_domain}/{model_name}/{run_date}/) via the ADLS LIBNAME engine; a Databricks pipeline then validates, applies lineage tags, and promotes the results into the Gold layer as properly registered Delta tables. For small structured outputs (model parameters, regulatory summaries, validation metrics), SAS writes directly via JDBC LIBNAME to Databricks SQL Warehouses, which routes through Unity Catalog governance natively. Direct SAS writes to Gold ADLS paths outside the staging zone are explicitly prohibited. SAS read-path security: Because SAS Compute Server accesses ADLS files using the service principal’s Azure RBAC permissions (bypassing Unity Catalog’s fine-grained RLS/CLS/DDM), the architecture mandates that SAS reads sensitive datasets exclusively through JDBC LIBNAME to Databricks SQL Warehouses, where Unity Catalog enforcement applies at query time. ADLS LIBNAME is permitted only for non-sensitive datasets where the SAS service principal has been explicitly authorized through the governance process. The ADLS paths accessible to SAS are registered as Unity Catalog external locations with restricted scope. Lineage continuity: End-to-end lineage across all stages is captured by three complementary mechanisms: Unity Catalog tracks column-level lineage within Databricks (stages 2–5); Manta parses SAS Compute Server code (DATA steps, PROC SQL, macros) to produce cross-platform lineage regardless of runtime engine; and Purview aggregates this into an enterprise lineage view accessible to stewards and compliance teams. ## 4.5 Microsoft Fabric IQ – Semantic Intelligence Layer (Evaluation) Announced at Microsoft Ignite in November 2025 and currently in public preview, Fabric IQ is a new Fabric workload that elevates the platform from a data platform to an intelligence platform. This section evaluates how Fabric IQ would fit into the Greenfield architecture, defines the conditions for adoption, and establishes guardrails to preserve the architecture’s integrity. ### 4.5.1 What Fabric IQ Provides Fabric IQ introduces four capabilities that address a gap the current architecture does not yet fill: giving AI agents and operational systems a structured, governed understanding of business meaning. | Capability | Description | Relevance to Greenfield | |---|---|---| | Ontology | A no-code visual model of business entities (e.g., Member, Account, Policy, Claim, Branch), their relationships, properties, rules, and constraints — bound to real data in OneLake. Can be bootstrapped from existing Power BI semantic models | Maps directly to Greenfield’ Gold data products (Customer 360, Risk Feature Store, Financial Aggregates). Existing Power BI semantic models in Direct Lake provide the bootstrapping input | | Graph Engine | Native graph storage and compute for nodes, edges, and multi-hop traversals. Integrated with the ontology for visual exploration of entity relationships, path finding, and dependency analysis | Multi-hop reasoning across member-account-product-transaction relationships; dependency analysis for regulatory impact assessments; fraud pattern detection across connected entities | | Data Agent | Conversational Q&A agent that connects to the ontology as a source, enabling business users to ask questions grounded in governed business terminology rather than raw tables and columns | Natural-language exploration of data products for 55,000 users. Complements Power BI dashboards with ad-hoc, conversational analytics grounded in the enterprise ontology | | Operations Agent | AI agent that monitors real-time data streams, reasons over live conditions using ontology rules and constraints, evaluates trade-offs, and triggers workflows or alerts when business constraints are violated | Real-time monitoring of payment streams, claims events, and risk indicators. Agent reasons over ontology-defined business rules (e.g., “at-risk member” as a composite of missed payments + lapsed insurance + address change) and triggers intervention workflows | ### 4.5.2 Architectural Positioning Fabric IQ sits as a semantic intelligence layer above the Gold layer and the existing Power BI semantic models. It does not replace any component of the current architecture — it extends the platform upward toward agentic AI by bridging the gap between where data lives (Databricks Gold layer) and how AI agents and business users reason about business meaning. Relationship to the current semantic layer. Power BI semantic models define measures, hierarchies, and table relationships for BI consumption. Fabric IQ’s ontology models higher-order business entities with typed relationships, rules, and constraints — not just tables with foreign keys. Power BI semantic models become the bootstrapping input for the ontology, meaning the investment in well-designed Gold star schemas and semantic models becomes the foundation that Fabric IQ builds on. There is no rework; there is extension. Relationship to Microsoft Purview. Purview owns the authoritative business glossary, classification schemas, and governance policies in the three-tier catalog model. Fabric IQ’s ontology is also a business vocabulary but with operational semantics attached (rules, constraints, actions). The architectural boundary must be: Purview governs meaning (definitions, classification, lineage, policy); Fabric IQ operationalizes meaning (runtime rules, entity relationships, agent grounding). The ontology should consume Purview definitions and extend them, never contradict or duplicate them. Microsoft’s Purview-IQ integration must be validated before enterprise deployment. Relationship to Databricks. Databricks remains the data platform. Fabric IQ consumes Gold data products through OneLake shortcuts (zero-copy) or through JDBC connections to Databricks SQL Warehouses. A critical governance question must be validated before adoption: does Unity Catalog’s RLS, CLS, and DDM enforcement apply when Fabric IQ agents access data through the ontology? If agents bypass Unity Catalog fine-grained policies, the security model is compromised. Until this is confirmed, all agent data access paths must route through Databricks SQL Warehouses via JDBC, mirroring the SAS Compute Server read-path mandate. Relationship to the GenAI strategy. This is where Fabric IQ’s primary value emerges. The current GenAI architecture (Azure OpenAI + RAG via Databricks Vector Search) retrieves text chunks but lacks structured business context. Fabric IQ’s ontology gives AI agents a semantic map: an agent doesn’t search for “delinquent accounts,” it understands that a delinquent account is a specific state of a Member–Account–Payment relationship with a rule (payment overdue > 90 days) and can traverse related entities. This is materially better for accuracy and explainability than unstructured RAG alone. ### 4.5.3 Adoption Prerequisites Fabric IQ adoption is conditioned on the following prerequisites, which align with Greenfield’ risk culture and the architecture’s governance principles: | Prerequisite | Validation Criteria | Target Horizon | |---|---|---| | Fabric IQ reaches General Availability (GA) | Microsoft officially exits preview; SLA, support, and compliance documentation available for regulated financial services | Must be met before any production deployment | | Unity Catalog enforcement validated for agent data access | Confirm that RLS, CLS, and DDM policies apply when Fabric IQ agents query data through the ontology and OneLake shortcuts. If not, mandate JDBC-only path | Horizon 2 POC validation gate | | Purview-IQ integration confirmed | Glossary terms and classification labels from Purview propagate to ontology entities. No duplicate business vocabulary | Horizon 2 POC validation gate | | Gold data products mature and contracted | Priority domain Gold data products (Customer 360, Financial Aggregates) have active data contracts, quality SLAs, and stable schemas before ontology modeling begins | Horizon 1 completion required | | Responsible AI framework covers agentic decisions | Greenfield’ Responsible AI governance extends to Fabric IQ agents: human-in-the-loop for high-risk actions, audit logging of agent decisions, explainability of ontology-grounded reasoning | Must be defined before Operations Agent deployment | ### 4.5.4 Adoption Roadmap Horizon 2 — Evaluate & Pilot: Build an ontology POC on a single business domain (recommended: Customer 360). Bootstrap from the existing Customer 360 Power BI semantic model. Model Member, Account, Product, Transaction, and Branch as ontology entities with relationships and rules. Validate governance prerequisites (Unity Catalog enforcement, Purview integration). Deploy a Data Agent for conversational Q&A over the Customer 360 ontology with a controlled user group. Horizon 3 — Scale & Operationalize: Extend the ontology to additional domains (Risk, Claims, Financial Products). Deploy Operations Agents monitoring real-time streams through Fabric Eventhouse, grounded in the enterprise ontology. Integrate ontology-aware reasoning into the GenAI platform (replacing pure RAG with ontology + RAG for structured + unstructured retrieval). Establish ontology lifecycle governance: versioning, change management, stewardship aligned with the federated governance operating model. ### 4.5.5 Critical Guardrail Fabric IQ is a semantic intelligence layer, not a data platform. The introduction of Fabric IQ must not become a justification for migrating data engineering, ETL, or warehousing workloads into Fabric. The ontology consumes Gold data products from the Databricks lakehouse through governed access paths (OneLake shortcuts or JDBC). All prohibited anti-patterns defined in Section 4.3 remain in force regardless of Fabric IQ adoption. Microsoft’s positioning of Fabric IQ as a reason to consolidate data into OneLake must be evaluated critically against the architecture’s principle that the data platform is Databricks. # 5. Data Governance Architecture The governance layer is the backbone of the platform, ensuring trust, compliance, and operational excellence. It is aligned with the DAMA-DMBOK knowledge areas and the EDM Council’s DCAM maturity dimensions. ## 5.1 Governance Operating Model Greenfield operates a federated governance model with a strong central function. The CDO’s office centrally defines policies, standards, and architectures, while business units that have demonstrated maturity may execute data activities within these guardrails. The following responsibilities are always centralized: Data governance policies, standards, and classification schemas Enterprise data catalog and business glossary management Data quality rules definition, monitoring frameworks, and escalation procedures Security and privacy policies (RLS, CLS, DDM, consent management) Reference data and master data management (MDM) Platform architecture decisions and technology selection Decentralized responsibilities (for mature business units) include data engineering pipeline development, domain-specific analytics and model building, data stewardship execution, and data product lifecycle management within their domain. ## 5.2 Metadata & Catalog Strategy – Three-Tier Model The enterprise catalog strategy employs a three-tier architecture, each tier assigned a clear, non-overlapping responsibility. This model is the result of a deliberate rationalization of the governance technology stack to eliminate redundancy and minimize the number of systems of record for any given governance artifact. Tier 1 – Enterprise Governance Plane: Microsoft Purview Microsoft Purview is the single pane of glass for business users, data stewards, compliance officers, and executive stakeholders. It is the authoritative system of record for the business glossary, data classification schemas, sensitivity labels, governance policies, and stewardship workflows. Its selection as the enterprise governance plane is driven by native integration with the Azure ecosystem, Microsoft 365 DLP policies, Power BI/Fabric, and Azure Entra ID. Purview harvests metadata from both Unity Catalog and Manta to provide a consolidated enterprise view. Tier 2 – Technical Enforcement Plane: Databricks Unity Catalog Databricks Unity Catalog is the runtime enforcement engine for all data access within the lakehouse. It manages fine-grained access controls (RLS, CLS, DDM), provides column-level lineage within Databricks workloads, and enforces data contracts at query time. Classification labels and policies defined in Purview propagate down to Unity Catalog to ensure consistent enforcement. Unity Catalog is the technical truth for who can access what data, enforced at the compute layer. Tier 3 – Cross-Platform Lineage Engine: Manta (IBM) Manta serves as the dedicated, cross-platform lineage engine. Its unique value lies in automated code-level lineage parsing across heterogeneous technologies—Databricks notebooks and Delta Live Tables pipelines, Azure Data Factory orchestrations, Fabric Dataflows, SQL stored procedures, and critically, SAS Viya code (SAS programs, data steps, PROC SQL). Neither Purview nor Unity Catalog can produce this depth of lineage automatically for the SAS estate, which is essential for regulatory auditability of actuarial and risk models. Manta publishes its unified lineage graph upstream to Purview for enterprise visualization and to Unity Catalog for technical consumers. Rationalization: IBM Knowledge Catalog (IKC) As part of this architecture, IBM Knowledge Catalog’s business glossary and data classification functions are retired in favor of Purview, which provides equivalent capabilities with superior integration into the Azure-native stack. IKC’s glossary content is migrated domain-by-domain to Purview during Horizon 1, aligned with the Medallion architecture rollout. This rationalization eliminates the “two glossaries” problem, reduces Cloud Pak for Data licensing costs, and removes the operational overhead of maintaining a Kubernetes-based IBM platform alongside the Azure-native infrastructure. Manta is retained as a standalone lineage service, either within a minimal Cloud Pak footprint or as a SaaS deployment, depending on IBM’s licensing model at the time of implementation. ## 5.3 Data Quality Framework – Three-Tier Model Data quality is managed through a three-tier control model that mirrors the Medallion architecture. Each tier has a distinct scope, technology, and enforcement philosophy. Quality scores from all three tiers are published to Microsoft Purview, creating a unified quality lineage: any defect can be traced to the layer where it was introduced or caught. | Tier | Stage | Technology | Scope | Disposition | |---|---|---|---|---| | Tier 1 | Ingestion (Pre-Bronze) | Purview Data Quality (sampling-based) | Structural validation, anomaly detection, early warning. Lightweight, fast, non-blocking for non-critical deviations | Critical fail → quarantine + alert; Non-critical → log warning, promote to Bronze | | Tier 2 | Bronze → Silver | DLT Expectations + Great Expectations | Business rule validation, referential integrity, CDE completeness, entity deduplication. Heavy quality gate where data becomes trustworthy | Fail expectations → quarantine sub-table; quality scores feed Silver table SLA metrics | | Tier 3 | Silver → Gold | DLT Expectations + Data Contract SLA checks | Data product certification: freshness SLA compliance, aggregation accuracy, contract schema validation, business KPI consistency checks | Silver must meet SLA (≥99.5% CDE completeness) before Gold refresh proceeds | Tier 1 – Ingestion DQ (Purview Data Quality) The ingestion-level assessment uses a sampling approach differentiated by data criticality (see Section 4.4.1a for detailed sampling rates). Purview DQ scans run against ADLS Gen2 staging zone assets, triggered by pipeline completion events. The assessment focuses on structural and statistical anomalies: schema conformance, completeness thresholds, volume baselines (detecting truncated or duplicate loads), format validation, and distribution drift (comparing the sample against a reference profile to detect upstream source changes). This tier is deliberately lightweight — it is an early warning system, not a comprehensive business rule engine. Tier 2 – Medallion Transition DQ (DLT Expectations + Great Expectations) The Bronze-to-Silver transition is where the heavy quality enforcement occurs. Databricks Delta Live Tables expectations validate every row at transformation time: completeness on primary keys and CDEs, referential integrity against master data and reference data, business rule accuracy (e.g., transaction amounts within plausible ranges, dates in valid periods), and cross-source consistency (e.g., customer attributes matching across banking and insurance feeds). Great Expectations complements DLT for complex cross-table validations that span multiple Silver datasets. Rows failing expectations are routed to quarantine tables with diagnostic metadata enabling root-cause analysis. Tier 3 – Data Product Certification DQ (SLA Checks) The Silver-to-Gold transition is gated by data product SLA compliance. Before a Gold data product is refreshed, the pipeline verifies that its upstream Silver sources meet their contracted quality thresholds (e.g., ≥99.5% completeness on CDEs, freshness within T+1 by 06:00 EST, aggregation accuracy within tolerance). If an upstream source fails its SLA, the Gold product refresh is deferred and the data product owner is notified. This prevents degraded data from reaching business users and Power BI dashboards. Unified DQ Observability Quality scores from all three tiers are published to Purview and aggregated on a centralized DQ governance dashboard. This dashboard tracks quality trends by domain, by source system, and by data product, enabling proactive identification of deteriorating sources before they impact downstream consumers. Critical Data Elements identified through the DCAM framework receive enhanced monitoring with automated alerting, configurable escalation paths (steward → data owner → CDO office), and monthly quality review cadences aligned with the governance operating model. ## 5.4 Master Data & Reference Data The CDO’s office owns and operates the enterprise master data hub, which manages key domains such as Customer 360, product reference data, organizational hierarchy, and counterparty data. The MDM solution integrates with the platform through a publish-subscribe model: master data is published to the Gold layer as certified data products, and downstream consumers bind to these certified versions rather than source-system extracts. This ensures a consistent, authoritative view of critical entities across all business units. # 6. Security & Privacy Architecture Security is the non-negotiable foundation of the platform. As a regulated financial institution, Greenfield must enforce strict controls over data access while enabling the agility that modern analytics and AI require. ## 6.1 Identity & Access Management All platform access is brokered through Azure Entra ID (Azure Active Directory) with mandatory multi-factor authentication. Role-Based Access Control (RBAC) is implemented at the Azure resource level, while Attribute-Based Access Control (ABAC) is enforced through Unity Catalog for data-level permissions. Service principals and managed identities are used for automated workloads, eliminating the use of shared secrets or embedded credentials. ## 6.2 Fine-Grained Data Security | Control | Description | Enforcement Point | |---|---|---| | Row-Level Security (RLS) | Restrict data rows visible to a user based on their business unit, region, or role attributes | Unity Catalog row filters, Power BI RLS, SAS via JDBC (enforced by Unity Catalog) | | Column-Level Security (CLS) | Restrict access to sensitive columns (e.g., NAS, SIN, account numbers) based on user clearance | Unity Catalog column masks, Purview sensitivity labels propagated | | Dynamic Data Masking (DDM) | Apply runtime masking functions (hash, partial reveal, null) so raw sensitive data is never exposed to unauthorized users | Unity Catalog column masks with masking functions | | Data Classification | Automatically classify data by sensitivity level (Public, Internal, Confidential, Restricted) using scanning and ML classifiers | Microsoft Purview auto-classification, propagated to Unity Catalog tags | ## 6.3 Data Loss Prevention & Exfiltration Controls Preventing unauthorized data exfiltration is critical. The architecture implements defense-in-depth controls: private endpoints for all platform services, network segmentation via Azure VNets and Private Link, egress restrictions on compute clusters, disabling of local file downloads from notebooks in production environments, and Microsoft Purview DLP policies integrated with Microsoft 365 to prevent sensitive data from leaving the organization through email, Teams, or SharePoint. ## 6.4 Privacy & Consent The platform supports privacy requirements under Quebec’s Act Respecting the Protection of Personal Information in the Private Sector (Law 25) and federal PIPEDA. Capabilities include consent management integration, data subject access request (DSAR) automation, right-to-erasure workflows, and pseudonymization and anonymization services available as reusable functions in the curated layer. # 7. AI & Machine Learning Architecture The AI layer is designed to support the full spectrum of AI workloads: from traditional statistical models used by actuaries and risk teams, to modern deep learning and generative AI applications. ## 7.1 ML Lifecycle The platform standardizes on Databricks as the primary ML engineering environment, using MLflow as the unified experiment tracking and model registry across the organization. The ML lifecycle follows four stages: experimentation (sandbox with governed data access), development (version-controlled notebooks and pipelines), staging (automated testing, bias detection, validation against champion models), and production (model serving with monitoring and automated drift detection). ## 7.2 Feature Store Databricks Feature Store provides a centralized repository of curated, reusable features that are versioned and documented. Feature tables are registered in Unity Catalog, making them discoverable and governed. Online feature stores support low-latency inference, while offline feature stores feed batch training pipelines. This eliminates the pervasive problem of duplicated feature engineering across teams. ## 7.3 Generative AI & LLM Strategy Generative AI workloads are served through a controlled architecture. Azure OpenAI Service provides access to large language models within Azure’s compliance boundary. Databricks Model Serving enables deployment of fine-tuned open-source models. All GenAI applications are subject to Greenfield’ Responsible AI framework, which mandates human-in-the-loop for high-risk decisions, output guardrails, prompt injection protections, and comprehensive logging for audit purposes. Retrieval-Augmented Generation (RAG) patterns use Databricks Vector Search over enterprise knowledge bases. Future evolution — Ontology-grounded AI: As Fabric IQ matures toward GA (see Section 4.5), the GenAI architecture will evolve from pure RAG (unstructured vector retrieval) to ontology + RAG (structured semantic reasoning complemented by unstructured retrieval). Fabric IQ’s ontology gives AI agents a typed, relationship-aware understanding of business entities, enabling more accurate, explainable, and governed reasoning than vector search alone. Data Agents and Operations Agents grounded in the enterprise ontology become the primary interface for business users interacting with the GenAI platform, while Databricks Model Serving and Azure OpenAI remain the underlying compute and model infrastructure. ## 7.4 SAS Viya Specialized AI SAS Viya retains its strategic role for regulated analytics workloads where SAS’s built-in model governance (SAS Model Manager), regulatory-grade auditability, and actuarial modeling libraries are required. SAS Viya runs on the Compute Server engine (not CAS), meaning processing is sequential and batch-oriented rather than distributed in-memory. This makes it best suited for model development, validation, and moderate-scale batch scoring. Large-scale scoring workloads requiring parallelism should be migrated to Databricks (MLflow-served models or Spark-based scoring pipelines). SAS models can be containerized and deployed alongside Databricks-served models through a unified API gateway, ensuring consistent model monitoring regardless of the underlying technology. # 8. Enterprise Data Products The CDO’s office is responsible for producing and maintaining a set of strategic, transversal data products that serve the entire organization. ## 8.1 Core Data Products | Data Product | Description | Primary Consumers | |---|---|---| | Customer 360 | Unified cross-entity view of members and clients across banking, insurance, wealth management, and credit; includes demographics, relationships, products held, interaction history, and risk profile | All business units, marketing, risk, compliance | | Reference Data Hub | Authoritative repository for codes, classifications, and lookup tables (currencies, country codes, product taxonomies, regulatory codes) | All data engineering teams, reporting | | Financial Aggregates | Pre-computed financial KPIs, balances, and regulatory metrics aligned with IFRS and AMF reporting requirements | Finance, risk, regulatory reporting | | Risk Feature Store | Curated risk features (credit scores, fraud indicators, market exposure metrics) available for real-time and batch consumption | Risk management, fraud detection, credit underwriting | | Enterprise Event Backbone | Standardized event streams (transactions, interactions, state changes) for real-time analytics and event-driven architectures | Real-time analytics, operational systems, fraud monitoring | ## 8.2 Data Product Lifecycle Each data product follows a defined lifecycle: proposal and business case, design and data contract definition, build and quality certification, publish to catalog, operate and monitor SLAs, and eventually retire. Data contracts specify schema, freshness SLA, quality thresholds, access policies, and lineage. The data product owner (a business-side role) and data product engineer (a technical role) share accountability for the product’s health. # 9. Operating Model & Organization The platform’s operating model reflects the CDO’s federated-with-centralized-governance philosophy. ## 9.1 Practice Areas | Practice | Mandate | Operating Mode | |---|---|---| | Governance & Data Quality | Policies, standards, quality frameworks, stewardship coordination, DCAM maturity tracking | Centralized (always) | | Architecture & Data Engineering | Platform design, technology standards, ingestion patterns, pipeline frameworks, DevOps | Centralized platform; federated pipeline dev for mature BUs | | Analytics & Valorization | BI strategy, self-service analytics, data science CoE, AI/ML model development | Hybrid: central CoE + embedded data scientists | | MDM & Integration | Master data management, Customer 360, reference data, data integration services | Centralized (always) | | Security & Data Privacy | Access control policies, DLP, privacy compliance, consent management, incident response | Centralized (always) | | Data Literacy & Transformation | Training programs, change management, data culture, community of practice, adoption metrics | Centralized design; federated delivery | ## 9.2 Interaction with IT Sub-Divisions The CDO’s organization operates as a horizontal sub-division within the 9,000-person IT division. It interfaces with vertical sub-divisions (business-aligned IT teams) through data product APIs and shared data contracts, and with horizontal sub-divisions (infrastructure, network, middleware) through platform service requests and infrastructure-as-code pipelines. Clear RACI matrices define handoffs for provisioning, networking, security, and operations. # 10. Implementation Roadmap The platform will be delivered in three strategic horizons, each building on the previous one and delivering incremental business value. Horizon 1 – Foundation (0–12 months) Establish ADLS Gen2 + Delta Lake as the unified storage layer Deploy Databricks Unity Catalog as the technical governance plane Integrate Microsoft Purview with Unity Catalog and Manta for three-tier catalog model Migrate IBM IKC business glossary to Purview (domain-by-domain, starting with priority domains) Deploy Manta lineage engine connected to Databricks, ADF, and SAS Viya code repositories Implement Bronze/Silver/Gold Medallion architecture for priority domains Migrate high-priority Power BI workloads to Fabric with Direct Lake mode Deploy foundational security controls: RLS, CLS, DDM across all platforms Deploy Purview Data Quality for pre-Bronze ingestion assessment on priority domains; define sampling strategies and CDE thresholds Establish DataOps and CI/CD pipelines for data engineering Horizon 2 – Scale & Intelligence (12–24 months) Extend Medallion architecture to all business domains Launch enterprise Feature Store and MLOps pipeline with MLflow Deploy Generative AI platform (Azure OpenAI + Databricks Model Serving + RAG) Operationalize data product lifecycle management with formal data contracts Extend three-tier DQ model to all domains; deploy unified DQ governance dashboard with trend analysis, automated escalation, and monthly quality reviews Expand self-service analytics with governed sandboxes for all business units Achieve DCAM Level 3 maturity across core capabilities Complete IKC decommissioning; full Manta lineage coverage across all data domains Evaluate Fabric IQ: build ontology POC on Customer 360 domain; validate Unity Catalog enforcement, Purview integration, and Responsible AI prerequisites (see Section 4.5) Horizon 3 – Optimize & Innovate (24–36 months) Achieve full real-time data mesh with event-driven architecture Deploy AI-powered data governance (automated classification, anomaly detection, auto-healing pipelines) Implement advanced cost optimization through workload-aware auto-scaling and FinOps practices Enable federated data sharing with external partners through secure clean rooms Achieve DCAM Level 4+ maturity with continuous improvement loops If Fabric IQ prerequisites validated: extend enterprise ontology to Risk, Claims, and Financial Products domains; deploy Operations Agents on real-time streams; evolve GenAI from pure RAG to ontology + RAG Evaluate emerging technologies (Apache Iceberg interop, Delta UniForm, Fabric Real-Time Intelligence) # 11. Key Architecture Decisions & Rationale The following table summarizes the most significant architecture decisions embedded in this reference architecture. These decisions should be formally ratified by the Data Architecture Review Board. | # | Decision | Rationale | Trade-offs | |---|---|---|---| | AD-01 | Delta Lake as canonical storage format | ACID transactions, time travel, schema evolution; native to Databricks and readable by Fabric/SAS | Vendor proximity to Databricks; mitigated by Delta UniForm for Iceberg compatibility | | AD-02 | Databricks as primary data engineering and ML platform | Superior performance for large-scale ETL, best-in-class ML tooling (MLflow, Model Serving), strong Unity Catalog governance | Higher compute costs for small workloads; addressed by routing BI workloads to Fabric | | AD-03 | Microsoft Fabric scoped strictly as BI serving layer (Direct Lake for Power BI); not a parallel data platform | Direct Lake delivers import-like performance at DirectQuery freshness for 55,000 users; eliminates costly Databricks DBU consumption for dashboard serving; Fabric ETL/Spark/Warehouse explicitly prohibited to prevent duplication | Dependency on OneLake shortcut stability; requires architecture review gates to enforce anti-patterns | | AD-04 | Retain SAS Viya (Compute Server only, no CAS) for regulated analytics; mandate JDBC read path for sensitive data | Compute Server simplifies infrastructure (no CAS cluster), reduces cost, and aligns with batch-oriented actuarial workloads. JDBC reads enforce Unity Catalog RLS/CLS/DDM. Large-scale scoring migrated to Databricks for parallelism | Sequential processing limits throughput for large scoring jobs; JDBC read latency higher than ADLS direct; mitigated by scoping SAS to model dev/validation and moderate batch scoring | | AD-05 | Three-tier catalog: Purview (enterprise) + Unity Catalog (enforcement) + Manta (lineage) | Each tier has a non-overlapping mandate; Manta uniquely provides automated code-level lineage across SAS, Databricks, and Fabric | Three systems to integrate; mitigated by clear tier boundaries and unidirectional metadata flows | | AD-06 | Retire IBM IKC glossary and classification; retain Manta only | Eliminates dual-glossary problem, reduces licensing cost, removes Cloud Pak operational overhead; Purview provides equivalent glossary/classification with native Azure integration | Migration effort for existing IKC glossary; Manta licensing model to be negotiated separately from Cloud Pak | | AD-07 | ADLS Gen2 as shared storage substrate across all platforms | Single physical storage layer avoids data duplication, reduces cost, and simplifies governance | Cross-platform locking and concurrency require careful engineering | | AD-08 | Three-tier DQ model with Purview sampling-based assessment at ingestion (pre-Bronze), DLT Expectations at transitions, and SLA checks at Gold | Catches source defects before Bronze propagation; Purview DQ aligns with catalog rationalization (no IKC DQ dependency); sampling avoids full-scan cost; unified quality scores in Purview | Purview DQ less mature than IKC for complex rules; mitigated by scoping Tier 1 to structural/statistical checks only, with heavy rules at Tier 2 (DLT) | | AD-09 | Fabric IQ positioned as future semantic intelligence layer (Horizon 2–3); adoption gated by GA, Unity Catalog enforcement, Purview integration, and Responsible AI prerequisites | Ontology grounds AI agents in structured business meaning; bootstraps from existing Power BI semantic models (protects investment); addresses RAG accuracy gap; no rework of current architecture | Preview maturity risk; potential vocabulary overlap with Purview glossary; Microsoft positioning may pressure toward Fabric data consolidation (explicit guardrail in Section 4.3/4.5) | # 12. Next Steps To advance this reference architecture from vision to execution, the following immediate actions are recommended: Architecture Review Board Ratification: Present this document to the Data Architecture Review Board and IT Architecture Council for formal endorsement of the key architecture decisions (Section 11). Detailed Design Phase: Commission detailed design documents for each layer, starting with the storage and governance layers (Horizon 1 priorities). DCAM Baseline Assessment: Conduct a formal DCAM capability assessment to establish the current maturity baseline and calibrate the roadmap. Proof of Concept: Execute a POC with a priority business domain (e.g., credit risk or claims) that exercises the full stack from ingestion through AI/ML serving. Financial Planning: Develop a 3-year TCO model covering licensing, compute, storage, professional services, and internal FTE requirements for each horizon. Document prepared by the Office of the Chief Data Officer, Data & AI Solutions, Greenfield. Data & AI Solutions Page