Architectural Considerations for OpenShift On-Prem vs. Microsoft Fabric

· updated · post microsoft fabric openshift

Architectural Comparison between OpenShift and Microsoft Fabric

TL;DR

Introduction

For the modern Data Architect, the choice between building an on-premises data platform on Red Hat OpenShift or adopting a SaaS ecosystem like Microsoft Fabric is not merely a selection of tools; it is a selection of philosophy. It represents a fundamental decision between Platform Engineering (owning the stack) and Analytics Engineering (owning the logic).

While both platforms ultimately serve the same business goal—transforming raw data into Business Intelligence (BI) and AI insights—the operational realities, required skill sets, and total cost of ownership (TCO) models are diametrically opposed. Furthermore, while there is functional overlap—both can run Spark, manage pipelines, and handle IoT streams—there are “hard limits” regarding what a SaaS platform can physically do compared to an edge-capable container platform.

This article breaks down the decision framework, the hidden requirements of each, and the strategic implications for your enterprise.

Part 1: The OpenShift Approach (The “Sovereign Cloud”)

Choosing OpenShift is a decision to build a Private Data Cloud. You are not just a consumer of software; you are a provider of infrastructure.

The Philosophy: “Composable and Controlled”

OpenShift treats the data platform as a collection of microservices. You are bringing the compute to the data, which is often necessary when the data has “high gravity”—meaning it is too large, too sensitive, or requires too low latency to leave the building (e.g., Factory IoT, Healthcare Imaging, High-Frequency Trading).

The Architectural “Bill of Materials”

When you buy Microsoft Fabric, the platform is ready. When you install OpenShift, you have a kernel. To replicate the functionality of a modern data platform, the architect must explicitly design and deploy the following components:

  1. The Storage Layer (The Foundation)

    • OpenShift does not store data; it manages compute. You must integrate a storage solution.
    • Requirement: You need OpenShift Data Foundation (ODF), MinIO, or Ceph to provide S3-compatible object storage (your Data Lake). You also need high-performance Block Storage (CSI drivers) for databases like Postgres or reduced-latency logs.
    • Architect’s Note: You are responsible for the replication, backup, and disaster recovery strategies of this storage.
  2. The Compute Engines (The Operators)

    • You do not simply “run SQL.” You deploy engines via Kubernetes Operators.
    • Requirement: You will deploy Strimzi to run Kafka for streaming. You will deploy Spark clusters (likely via the Radanalytics operator or simple pods) for processing. You might deploy Trino or Presto for federated querying.
    • Architect’s Note: You must manage the version compatibility between these tools. Does Spark 3.4 work with your version of the Kafka connector? That is now your problem to solve.
  3. The Control Plane (Orchestration & Gateway)

    • How do you trigger jobs? How do users access the data?
    • Requirement: You need Apache Airflow (or OpenShift Pipelines/Tekton) to orchestrate the ETL.
    • Requirement: You need an API Gateway (like Red Hat 3scale, Kong, or Istio) to expose your data products safely to the corporate network.
  4. The Missing Link: Governance

    • OpenShift has no native concept of a “Data Catalog.”
    • Requirement: You must deploy and maintain a tool like DataHub, Amundsen, or Atlas to track lineage and schemas.

Part 2: The Microsoft Fabric Approach (The “Unified SaaS”)

Choosing Fabric is a decision to embrace integration over isolation. It is an opinionated stack that forces you to work the “Microsoft Way,” but rewards you with immense speed to market.

The Philosophy: “The OneLake Paradigm”

Fabric fundamentally changes the architecture by abstracting storage entirely. OneLake acts as the “OneDrive for Data.” Whether you are doing Data Science (Spark), Warehousing (SQL), or Real-time Analytics (KQL), you are operating on the same copy of data in the Delta-Parquet format.

The Architectural Reality: What is actually included?

In Fabric, the “Bill of Materials” is largely virtual, but the architectural challenges shift from installation to configuration and optimization.

  1. Storage & Compute (Separated):

    • Storage is cheap (Azure Data Lake Storage Gen2). Compute is purchased in “Capacity Units” (F-SKUs).
    • The Integration: You do not need to mount volumes or configure storage classes. It just works.
  2. The “Hidden” Requirements for Fabric:

    • On-Premises Data Gateways: If your ERP or manufacturing systems are on-prem, Fabric cannot reach them by magic. You must architect a secure Gateway layer to tunnel data into the cloud.
    • Identity Architecture (Entra ID): Security in Fabric is granular (Row-Level Security). This requires a pristine Active Directory setup. If your AD groups are messy, your data security will be messy.
    • FinOps Governance: In OpenShift, a bad query slows down the server. In Fabric, a bad query costs actual money (or burns through your capacity, throttling everyone else). You need strict monitoring policies.

Part 3: The Capability Gap: What OpenShift Can Do That Fabric Cannot

It is true that Fabric supports IoT analysis, Notebooks, and Pipelines. However, a common misconception is that feature parity exists between a SaaS Data Platform and a Container Orchestrator.

There is a hard technical line where Fabric stops and OpenShift begins. This line is usually defined by Physicality, Latency, and Runtime Flexibility.

1. The “Air-Gapped” Requirement (The Disconnected Stack)

Fabric is a SaaS product. It lives in an Azure Region. It requires connectivity.

2. Sub-Millisecond “Closed Loop” Control

Fabric is excellent for analyzing IoT data (e.g., “The machine vibrated abnormally 5 minutes ago”). It is poor at acting on it in real-time.

3. Arbitrary Containers & Legacy Code

Fabric runs specific, curated runtimes: Spark, SQL, KQL, and Python environments.

4. Granular Hardware Control

Fabric abstracts the hardware. You buy “Capacity,” not specifications.

Part 4: The Decision Matrix for Architects

When standing at this crossroads, the Data Architect must weigh four critical dimensions:

1. The Talent Dimension

2. The Data Gravity & Latency Dimension

3. The Cost Model (CapEx vs. OpEx)

4. Integration vs. Customization

Conclusion: The Hybrid Reality

Rarely is this a binary choice. The most sophisticated enterprises often adopt a Hybrid Architecture:

They use OpenShift at the Edge/On-Prem to handle the “heavy lifting,” closed-loop control, and sensitive aggregation. They then push the high-value, aggregated “Gold” data to Microsoft Fabric for user-facing analytics, dashboards, and integration with the corporate ecosystem.