top of page
Franco Arda

Franco Arda, Ph.D.

Data Enginer
Microsoft Fabric

Experience:       Deutsche Bank · Daimler · Siemens · DHL · Swisscom
Languages:       Swiss German · German · English
Education:         Ph.D. in Data Science · MBA

 

Tools:                 

Microsoft Fabric: Python · PySpark · SQL · T-SQL · KQL · Data Pipelines · Dataflow Gen2 · Medallion Architecture · Lakehouse · Warehouse · Real-Time Analytics · Semantic Models · Power BI · DAX · Deployment Pipelines · CI/CD · Data Science · Data Agents · Swiss–US Data Privacy

Python vs. PySpark Notebooks in Microsoft Fabric


A simple rule of thumb: for small to medium datasets (under ~100M rows), plain Python is typically faster and cheaper on CU costs. PySpark is the better choice once you're dealing with large datasets — think 100M+ rows or 10 GB+.
 

One thing worth keeping in mind: processing costs don't just apply to notebooks. Data Pipelines also consume CUs, and they can rack up costs quickly if you're not careful.
 

In a recent project, I needed to reduce a dataset down to 310 million rows. Given the scale, I ran the entire pipeline in PySpark — and it was the right call.

Python vs. PySpark

Power BI in the Browser with Microsoft Fabric


I grew up on desktop tools — Power BI Desktop, Tableau, RStudio. But with Microsoft Fabric, I do everything in the browser, and honestly? I love it.
 

Notebooks, semantic models, Power BI reports — all in one place. Load the data, transform it, set up a pipeline, move it through Bronze → Silver → Gold, build the semantic model, and publish the report. No context switching. No desktop installs.


Sure, the browser isn't 100% there yet — advanced modelling and Row-Level Security still need the desktop. But we're at 80%, and that 80% covers most of the work.
 

One environment. One flow. That's the win.

Power BI in Fabric

From API to Report: Building an End-to-End Data Pipeline in Microsoft Fabric

A walkthrough of a fully automated pipeline that pulls data from a REST API, transforms it through a Lakehouse and Warehouse, and surfaces insights in a Power BI report — all orchestrated inside Fabric.

flow.png

Why Fabric?

Microsoft Fabric brings together data engineering, warehousing, and BI into a single unified platform. Instead of stitching together multiple services and credentials, everything lives in one workspace — from the raw API call to the finished Power BI report.
 

Step 1 — Fetch data from the API (notebook)

The pipeline kicks off by setting a timestamp variable, then firing a Fabric notebook that calls the REST API. The raw JSON responses land in the Files section of the API Lakehouse, preserving the original payload for full auditability.
 

Step 2 — Parse JSON into a Lakehouse table (notebook)

A second notebook reads those JSON files and flattens them into a structured Delta table inside the same Lakehouse. This is where field mapping, type casting, and any light cleansing happen — keeping the transformation logic version-controlled and reproducible.
 

Step 3 — Copy data from Lakehouse to Warehouse (copy activity)

A Copy data activity moves the cleansed rows from the Lakehouse Delta table into a staging table in the API Warehouse. The separation between Lakehouse and Warehouse keeps raw storage costs low while giving the downstream SQL layer proper query performance and access control.
 

Step 4 — Load the reporting table via stored procedure

A stored procedure merges the staging data into the final reporting table and updates the accompanying reporting view. Encapsulating this logic in a stored procedure means the pipeline step stays thin — and the merge logic can be tested and iterated independently.
 

Step 5 — Refresh the semantic model and serve the report

The final pipeline step triggers a semantic model refresh in Power BI. Once the model is updated, the report surfaces the latest data — completing the journey from raw API payload to interactive dashboard, fully automated and repeatable on any schedule.

Data Pipeline "ForEach" = Python's for loop


Microsoft Fabric's ForEach activity works just like a Python for loop — iterate over a collection, run some logic for each item, repeat.
 

Processing files? Feed it ["file1.csv", "file2.csv"], wrap a Copy activity inside, and use @item() to reference each file dynamically. Same mental model as looping through a list in Python.

Microsoft_Fabric_Data_Pipeline_ForEach.png

Combining Medallion Architecture with CI/CD in Microsoft Fabric
Everyone agrees Medallion architecture delivers competitive advantages — faster insights, better data trust, greater scalability. But knowing why it works is only half the battle. The part that's rarely talked about is how to actually implement it.

My take: pair it with CI/CD. Map the Bronze, Silver, and Gold layers across your development, test, and production pipeline, and suddenly Medallion stops being a theoretical framework and starts being something you can actually ship.

Medallion_Architecture.png

Reading PDFs: Azure AI Document Intelligence vs. LLMs
 

Traditional PDF extraction tools may feel outdated compared to today's powerful LLMs — but the distinction matters. Traditional OCR-based tools like Azure AI Document Intelligence extract rather than infer, which is their key advantage: they return field-level confidence scores that make outputs fully auditable.
 

LLMs are harder to audit. More critically, they carry hallucination risk — under uncertainty, an LLM may confabulate (fabricate to fill gaps!!!) a plausible-looking number rather than return a low confidence score or flag the ambiguity. In a high-stakes context like financial reporting, that silent failure is arguably worse than a flagged extraction failure.

Azure_AI_Document_Intelligence.png
omg2.png

Example Queries for Microsoft Fabric Data Agents

When you set up a Data Agent in Microsoft Fabric, you can give it a set of example queries — sometimes called "few-shot examples" — to help it understand how to respond to questions.
 

Think of them as cheat sheet entries: you provide a sample question alongside the query logic that should answer it. The agent learns from these patterns and uses them to handle similar questions on its own.
 

In the image below, you can see this in action:

  • (1) is where you define your example queries as a creator

  • (2) shows the agent's answer when that example is applied

  • (3) is where it gets interesting — here I simply asked the agent to visualize the data, and it knew exactly what to do, drawing on the patterns it had learned


That last part is the magic of few-shot examples: once the agent has good patterns to work from, it can go beyond the examples themselves and handle new requests with confidence.

FICTIVE DATA - NOT REAL NAMES

udf_spark.png

UDF (User Defined Functions) in PySpark
 

Standard DataFrame operations cover most needs, but UDFs fill the gaps when you need custom business logic that can't be expressed with built-in Spark functions — things like complex string parsing, conditional transformations involving multiple rules, or calling external libraries (e.g., dateutil, re, custom validators).
 

Simple Example: Classifying a Customer's Spend Tier

Imagine you have a sales DataFrame and want to tag each customer with a business-defined tier label based on their total spend — logic that's too specific for a simple when/otherwise chain.

Defining the schema explicitely
The rule of thumb: Schema inference is fine for exploration. Defined schemas are essential for production pipelines. In Fabric, where we're building data pipelines that run repeatedly on large datasets, defining the schema upfront is a best practice we want to build the habit of early

schmea.png

Partitioning in PySpark
Partitioning is one of those topics that becomes very important as our data grows. Spark processes data in parallel chunks called partitions. Partitioning strategy directly affects:

 

  • Query speed — Spark can skip irrelevant partitions entirely (called partition pruning)

  • Shuffle cost — bad partitioning causes expensive data movement across nodes

  • Memory pressure — too few partitions = large chunks per node, too many = overhead
     

Partition by columns you frequently filter on. A good partition column has low cardinality (not too many unique values) — region with 4 values is great, customer_id with millions of values would create millions of tiny files, which is harmful.

partition.png

Sometimes we just need Python and Pandas
PySpark is amazing for large datasets (e.g., 10GB+). However, for smaller datasets, Python is faster and cheaper. For smaller datasets, Panda's visualization capability are superior to PySpark.

pandas.png

Inside the Apache Parquet Format

Apache Parquet is a columnar storage format that offers significant advantages over traditional row-based formats like CSV or JSON. Unlike row-oriented storage, Parquet organizes data by columns, which allows analytical queries to read only the relevant columns without scanning unnecessary data. It also embeds the data schema directly in the file and enables efficient compression of repetitive values — for example, a boolean column with many repeated 0s and 1s can be compressed far more effectively than in a plain text format. Microsoft Fabric makes it easy to work with Parquet files stored in OneLake or other cloud storage solutions. source: aka.ms/fabricnotes

fabric-notes.png

Advantages of Managing CLS & RLS in Fabric(Not in Power BI)

Power BI's Column Level Security (CLS) and Row Level Security (RLS) were always a workaround — security defined at the reporting layer because the data source couldn't handle it properly.
 

With Microsoft Fabric, that excuse is gone.
 

Define your security rules once in the Fabric Warehouse using standard SQL. Connect Power BI via DirectQuery, and those rules are automatically enforced everywhere — in Power BI, notebooks, SQL clients, every tool. No bypass possible. No duplication. One place to maintain.
 

Power BI goes back to being what it should

always have been: a presentation layer.
 

Simpler. Safer. More maintainable.

RLS.png

Pipeline Expression Builder (adding utcNow time stamp)

When working with Microsoft Fabric Data Pipelines, one handy trick is using the Pipeline Expression Builder to dynamically generate timestamped filenames when copying files — for example, from one Lakehouse folder to another.
 

Instead of hardcoding a filename like movies.csv (which would get overwritten on every pipeline run), you can use an expression like this:

@concat('movies_', utcNow('yyyyMMddTHHmmssZ'), '.csv')

pipeline_expression_builder.png

What does this expression do?

It builds a filename dynamically by joining three parts together:

  • 'movies_' — a static prefix

  • utcNow('yyyyMMddTHHmmssZ') — the current UTC timestamp at the moment the pipeline runs, formatted as something like 20260511T104632Z

  • '.csv' — a static file extension
     

The result is a uniquely named file on every run, for example: movies_20260511T104632Z.csv
 

A few things worth knowing

The @ symbol at the start tells the pipeline engine that what follows is an expression to be evaluated — not a literal string. Without it, you'd end up with the text concat('movies_', ...) as your filename.
 

The format string yyyyMMddTHHmmssZ follows standard date/time tokens. You can customize it to suit your needs — for instance, yyyy-MM-dd if you only need a date, or yyyyMMdd_HHmmss if you prefer an underscore separator between date and time.
 

Why bother?

Timestamped filenames are a simple but effective pattern for:

  • Preventing data loss — no more accidentally overwriting yesterday's file

  • Auditability — you can immediately see when each file was generated

  • Debugging — if a pipeline run produces unexpected output, you know exactly which file to look at
     

It's a small touch, but it makes your pipelines significantly more robust in practice.

Data Lineage

As data travels through its lifecycle, how can you tell which systems have influenced it or what it's made up of as it moves and changes? Data lineage refers to the recording of an audit trail that follows data throughout its lifecycle, capturing both the systems that handle it and the upstream sources it relies on.

Data lineage is valuable for error tracking, accountability, and debugging — both for the data itself and the systems that work with it. It provides a clear audit trail for the data lifecycle and supports compliance efforts. For instance, if a user requests that their data be removed from your systems, having lineage for that data makes it straightforward to identify where it lives and what depends on it.

Data lineage has long been a practice in larger organizations with stringent compliance requirements. However, it is increasingly being adopted by smaller companies as data management grows into a mainstream concern.

lineage.png

Why general-purpose APIs won over custom DB connections
 

  • Security & access control — APIs let vendors expose only what they want to expose, with fine-grained scoping (OAuth, API keys, rate limits), without handing out raw database credentials.

  • Abstraction & stability — The API contract is stable even if the underlying storage engine changes. Snowflake can re-architect internals; your pipeline doesn't break.

  • Cloud-native architecture — Services like S3, Snowflake, and Salesforce are SaaS or cloud platforms. There is no "database" to connect to directly — the API is the interface.

  • Decoupling — Each hop in your example (Salesforce → S3 → Snowflake → S3 → Spark) is loosely coupled. Teams can swap one component without rewiring everything else.

  • Ecosystem tooling — Tools like Airbyte, Fivetran, dbt, and Prefect are built entirely around API-to-API orchestration, reinforcing the pattern.

PS: If you're into finance, tradingeconomics.com offers 500 free API calls per month (see screenshot). Excellent to pull the API data into a Notebook, use a Data Pipeline, create a semantic model, and visualize the API data in Power BI.

pipe.png

No Data, No AI.
It's that simple.

Why Data Engineering in Microsoft Fabric is the quiet engine behind every AI breakthrough.

Screenshot 2026-05-12 at 09.49.29.png

Everyone wants AI. Smarter dashboards, automated insights, talking to your data instead of querying it. But here's the unglamorous truth: none of it works without solid Data Engineering underneath. Garbage in, garbage out — always.
 

Microsoft Fabric gets this. Its Data Engineering layer isn't a checkbox — it's the backbone. It handles ingestion, transformation, storage, and scale so that when AI shows up, it actually has something useful to work with. That's not a small thing.

Screenshot 2026-05-12 at 09.49.36.png

The Fabric Data Agent — launched in March 2026 — is a glimpse of where this is heading. Ask a question in plain English, get an answer. No query writing, no dashboard hunting. It's not magic; it's what happens when your data engineering is done right and AI finally has a clean surface to work on.

Data Engineering lays the road. AI is the vehicle. But without the road, you're going nowhere.

© 2026 by Franco Arda
 

bottom of page