(At least) 3 calamities driving your data org to its knees...
February 3, 2025
Every morning starts the same. You wake up, check Slack for overnight pipeline alerts, check Datadog for anomalies, and brace yourself for the morning standup where you will explain why that one DAG is still flaky.
You thought you had built an OLAP system of perfect observability. You, your manager (swamped in meetings, and barely finding time to code), and their manager who hasn't coded in a decade ... were all fooled.
You cobbled together a mud-monster of observation - and it's watching you. Data visibility has become data anxiety.
This is the panopticon modern data teams make for themselves, and dwell in.
We will look at the (at least) 3 calamities that are driving your data org to its knees:
I forewarn you ... you are about to read a most distressing engineering tale. It ends almost as badly as it starts.
Close this document immediately, and read something more cheerful ... like whatever comes up when you Google "ETL pipeline best practices". For those who choose to continue - and I see you have - I beckon you forward, "Hypocrite lecteur, — mon semblable, — mon frère!".
When I joined Bit Complete as the first Data Engineer, order of business #1 was designing an interview process to hire more data engineers.
To this end, I wanted to identify the central problems faced by all Data Engineers, the "zen" of Data Engineering. These central problems would then guide us in our assessment of candidates.
As it turns out, grasping the central problems faced by all Data Engineers is a good stepping stone to start a good-natured complaining session about the problems that befall Data Orgs.
To define the DE role, you can't just flip open the Oxford to the D section. My general sentiment was that cookie-cutter definitions were omitting a key element. The linchpin that eventually ... makes us all equal ... time. Bear with me, I will elaborate further down.
"Aha!" I thought. It felt like I was holding the right tool, just not quite by the right end. I knew however, that instead of throwing algorithms and SQL questions at the wall and seeing what sticks - we would look for programmers well acquainted with disciplined persistence ... over time.
Here is the definition I came up with. It has served us well ... a lighthouse of sorts to steer discerning interviews with interesting and varied candidates.
Data Engineer: A programmer first and foremost (acquainted with computer science fundamentals), who specializes in:
- The persistence of data over prolonged periods, and the abstractions around this persistence
- The nitty gritty of persistence (Hardware, how data warehouse and databases work, etc)
- Distributed systems, since contemporary persistence at web-scale requires multiple physical machines
Time and time again, in startups and Crown corporations ... in innovation labs and medium sized companies, I had seen OLAP teams run themselves into the ground over time. I'm sure you have too.
Stinky code over time is not a problem unique to OLAP teams. Far from it. But when you don't simply deal with transactions,
small data modeling errors are a nightmare to fix. They accrue over time, like ice and snow seizing a thin branch.
You can't just run a Django migration and create a new _2
table. You can't just ask ChatGPT or Claude.
You need to untangle a web of dimensions, themselves derived from other dimensions. OLAP is a harsh mistress.
On paper - super exciting stuff, right?! Yet, there is a malaise, and a stigma hanging around writing programs of dataflow.
Many Scala engineers put "No Spark" in their LinkedIn profiles. Through trauma, or disdain, they have seen enough of how software is written in the analytical space.
We can change this, by taking pride in our craft and delivering quality software, that impresses engineers from other verticals. Clean code, hard like cast-iron in production, yet flexible as bamboo when it comes to change at the peripheries.
We can't be too optimistic. The dance between getting the datasets you are pressured to ready fast, and writing clean-code (the lifeblood of a software org) is a wry pavane indeed.
But it's a thrilling dance, when executed well. "Fact" comes from the Latin passive participle of "to do", "factum" - having been done, or more loosely, having happened. In OLAP programming, we conjur up rainbows of dimension(tables) that in some sense never happened, from a bedrock of transactions (fact data) ... this is enthralling and almost magical when done right.
I hope this post pushes you to write clean(er) OLAP code, however you see fit.
I can only imagine that leading a big-data org is frustrating. You are in deeper doodoo than most (say FE eng managers) when your engineers jump ship (debatable). Data quality issues become million dollar bugs when running cumulative-metrics calculations over lengthy periods (not debatable).
Pointing towards greener pastures, without outlining a clear technical path to these goals will not earn you the respect of your troops. Statements like:
"We must have self-serve data-pipelines by Q3"will cause you to lose the respect of your programmers if you don't a systematic technical plan to achieve them, and the mettle to enforce it.
"We need to tighten up on data quality and backfill issues faster"
As we saw, the stakes in OLAP are so high because it is so easy to create a bacchanalia of poorly modelled dimension tables populated by a house of cards of scripts. You are praying the engineer who created this house of cards doesn't jump ship.
I hope this post gives you specific technical points to bring to, and expect from your devs.
Your OLAP codebase is a disgrace. If it isn't, and you have made it this far - accept my humble apology, and close this tab.
But the odds are it is a disgrace. Don't worry, we will look at specific examples to support this provocative statement. The pill is hard to swallow. No amount of Dagster, DBT or LLM's, or new shiny frameworks will make up for missing clarity of thought.
Engineers come and go through the revolving door. After adding to the growing pile of "hit-and-run" code, they get a fresh start elsewhere. Monitoring tools, SLA's, aggressive-on-call schedules, and inexistent tests are just added salt in the wound.
Let's face the sobering reality: your OLAP codebase shares an unfortunate similarity with low-quality Japanese kitchen knives. It's brittle and prone to breaking, but lacks any of the redeeming qualities that make high-carbon steel blades worth their delicate nature. There's no precision cutting edge, no superior retention – just a fragile monolith that shatters at the first sign of change.
It could be a flexible organism, composed of functions that keep frameworks at arm's length, and have powerful barriers of abstraction between your domain core's algebra and the outside World. Changing your sink storage engine could be like swapping out legos ... but it isn't.
After the last migration, you could have learned that the things most likely to change are those that have already - but you didn't. Seriously? I am being tongue and cheek - but therein lies the rub.
Like at the start of an Alcoholics Anonymous meeting, kaizen begins with the clear identification of a set of central problems. Instead of memorizing rote best practices, its my belief that understanding central problems, and using them as axioms, lets us as programmers deduce best practices when we need them.
My contention is that there are 3 central problems - calamities - which drive data-orgs to their knees.
In the next blog post, I will put my code where my mouth is and present
a new doctrine for building data systems at scale. But for now, lets focus on central problems.
If your pipelines are not reified in code, carefully crafted as first class and composable abstractions, you are destined to eventually drown in the deep waters of requests for similar but slightly different dimension datasets.
The calamity is not the un-reified pipelines in itself, its the fact that you will eventually drown in requests to build new pipelines if you don't offer your clients a mechanism to conjur up their datasets. The more you drown, the worse the code becomes, the faster you drown.
This is not a matter of pedantic architect-astronauting, if you are a data manager reading this, your team will not be able to snap together and conjur up new pipelines (the step before true self-serve) if dataflows are not reified. Period.
Lets re-iterate because it's so important.
⚠️ Un-reified pipelines > High cost to making new and/or generic pipelines > No true self-serve data-pipelines > Time spent on ad-hoc requests > Wasted time
"Talk is cheap, show me the code". Couldn't agree more, here it is - this is a pipeline reified in code:
//> using dep xyz.matthieucourt:etl4s_2.13:0.0.5
import etl4s.core._
@main def run() = {
// Define our blocks
val extract5 = Extract[Unit, Int](_ => 5)
val exclaim = Transform[Int, String](x => x + "!")
val consoleLoad = Load[String, Unit](x => println(x))
val dbLoad = Load[String, Unit](x => println(s"Load to DB $x"))
// Stitch our reified pipeline
val pipeline: Pipeline[Unit, (Unit, Unit)] =
extract5 ~> exclaim ~> (consoleLoad & dbLoad)
// Run it
pipeline.unsafeRun(())
}
This is a run of the mill pipeline not reified in code.
object MyPipeline {
def main(args: Array[String]): Unit = {
val five: Int = 5
val transformedString = five.toString + "!"
println(transformedString)
println(s"Load to DB ${five}")
}
}
Most data orgs (again, assuming that they have an OLAP goal in mind) make a realization as they are getting off their feet. "SQL, SQL everywhere". You must have a self-serve layer to run declarative queries across your storage engines.
Only the bravest of the brave not in your engineering department will run a bit of PromQL or dig into ElasticSearch logs ... and letting everybody run their queries gives you a botnet of skeptical QA testers for free.
Luke: Emm, so I need to reify my pipelines to unlock self-serve?
Yoda: Yes hmmmm
Luke: I mean, don't we already have self-serve? We make solid use of Trino? Anyone, from sales, to marketing can run their bits of SQL with pushed-down query federation. Babam, self-serve dinner is served.
Yoda: "Self-serve", this is. "Self-serve in depth", what we truly seek, hmmmm
Luke: Have you been reading too much First World War doctrine?
Yoda: Hmmmmm, yes ... indeed
At risk of being pretentious I'll define a new term
self-serve in depth: Capability to reach compose pipelines that reach deep into the well of fact-data in your immutable staging area.
Reified pipelines give you rapid composition of existing pipelines (or pipeline fragments) to create new ones. And this is how you you arrive at the gates of "self-serve in depth" (which is probably what you mean by self-serve data products). You are at the gates of "self-serve in depth". Pushing past them and creating a complete "self-serve in depth" framework will take a bit more work - but this is not in scope for this post.
Another negative aspect of un-reified pipelines, is that they tend to straddle a problematic no-man's land. Like that French guy who decided to tightrope walk between the Twin Towers (except not cool at all).
Half of your pipeline is a bunch of tasks in a scheduler, the other half is the business logic the scheduler calls.
Your un-reified pipeline, in the ether between Airflow and a bunch of functions
Is your pipeline, the rope? the walker? Tower 1? Tower 2? All of them together? This latter is a common school of thought and results in solutions like this:
A crufty hack to try and reify pipelines
If your pipelines are not reified, your work is not unified, and it'll be a slow death.
To have OLAP you trust, pipelines must be reproducible to a degree that not only the engineers understand ... but the clients as well. The only way this is achieved is via versioned logic AND data models.
This brings us to our next calamity. Fake "functional-ETL".
When I was first shown Maxime Beauchemin's Functional Data Engineering article (he created Airflow) - I was like a kid in a candy store. Everything described was so logical and clean. Data engineering was the cool new kid.
Its a great, seminal read for any Data Engineer, and probably responsible for bringing terms like idempotence, immutable partitions, into the mainstream.
My critique of functional ETL (if you can call it that, since I'm a fan) has two prongs:
The calamity isn't so much "fake functional ETL" - its thinking you are safe or not understanding the degree to which you can (or not) re-create the state of your warehouse at any point in time from your immutable append only staging are at the snap of a backfill. If reference data isn't SCD'd you've effectively created the OLAP equivalent of a phantom read when you run what you thought was an idempotent job again.
Fake functional ETL combined with skimping on unit tests create the maelstrom that firmly plants the seeds of fear. And fear is the mind-killer.
Everything grinds to a halt. After one production debacle where it takes you 5 days to find a root-cause and run a backfill, your manager tells you to test everything manually before each deploy ... "just in case".
And rightfully so, you can't explain to them the degree to which your dataflows are reproducible without shrugging your shoulders and starting a tirade of "well actually"s and "there quite a few nuances".
This article on pipeline versioning by the Palantir team seems directionally correct for wielding functional ETL correctly (although it doesn't mention it by name). Its a great read. One nugget that stood out:
This log correspondence shows which version of which code produced which dataset, and allows a user to electively and separately update code and data. Data and code can advance in independent versions. This allows isolated workflows while maintaining crucial relationships for traceability and collaboration
But in my opinion there is one key missing word, reified. The sentence should be:
... Reified data and code advance in independent versions ...It is important for pipelines to be compatible (forward and backwards) with a range of data-model versions that they abstract over. The messaging guys figured out you need a reified atom to version, an indivisible message, with Thrift, Avro, and Protobuf. Reified pipelines represented as indivisible abstractions are not a nicety, they are a condition for efficacious "data" versioning. My next blog post will offer more solutions, but for new - central problems.
To recap:
⚠️ Fake-functional ETL > No confidence in reproducibility > Fear paralysis > Excessive manual testing > Wasted timeThere you have it. Don't fake the funk - understand the degree to which your pipelines are functional. If you don't you will pay a heavy time-tax and bog yourself down. Speaking of things that will bog you down ...
By framework I mean a heavy-API offering not just utilities, but a World-view and/or paradigm spans the domain core, all the way to your infra layer. By infra I mean the outer layer of your Hex/Onion. Now back to business.
Frameworks are bad. There - I said it. Don't @ me.
Letting a framework creep its tentacles into your pipelines is the stitch in time that causes you nine. Frameworks combined with tight coupling cause migrations that never end like Penelope undoing her tapestry each night. The framework developer doesn't need you, but wants you to need him. From Uncle Bob, to Martin Fowler (and more before), this topic has been covered ad-nauseam.
Tightly coupling your pipelines to your storage engine of today, will cause you untold grief tomorrow. Again, this topic has been covered ad-nauseam, from SICP, to the Gang of Four book. It's not only OLAP teams who commit this cardinal sin. But in OLAP-land there are no ORM guardrails.
Banishing tight infra coupling (a la Hex) - not too hard! Holding frameworks (especially useful ones) at arm's length - harder. Just ask the creators of frameless (a Free-Monad over a decent subset of Spark) how it went. I say this without irony, I'm sure they have some great insights.
This includes "meta-frameworks" like Beam which tend to nerf the very frameworks they wrap, and impose a strong and useful World view (i.e. Google's Dataflow model) ... but a World view you might/need to break from time to time. Good luck doing that if you're tightly coupled up to your neck with "insert framework x".
Let powerful frameworks like Spark into your codebase with great caution, and without ideology. And when you do, erect barriers of abstraction between your algebra and the outside World.
//> using dep xyz.matthieucourt:etl4s_2.13:0.0.5
import etl4s.core._
import etl4s.types._
@main def run() = {
// Domain algebra - separates storage from logic
sealed trait StringRepo {
def writeString(input: String): Unit
}
case object PostgresRepo extends StringRepo {
def writeString(input: String) = println(s"Writing $input to PG")
}
case class Config(repo: StringRepo)
val myConfig = Config(PostgresRepo)
val exclaim = Transform[String, String](x => x + "!")
val dbLoad: Reader[Config, Load[String, Unit]] =
Reader(conf => Load(conf.repo.writeString))
val p = Extract("5") ~> exclaim ~> dbLoad.run(myConfig)
p.unsafeRun(()) /* "Writing 5! to PG" */
}
This is a pipeline that will save you time and money by bending like a bamboo stalk. You can swap out any backend whenever you want with a 1 LOC change.
This is the same pipeline without the barriers of abstraction
object MyPipeline {
def main(args: Array[String]): Unit = {
val input = "5"
val transformedString = input + "!"
println(s"Writing $input to PG") /* "Writing 5! to PG" */
}
}
Fewer lines of code - yes. Can we effortlessly swap out Postgres for SQLite or an in-memory testing repo - no? Will this resist inevitable change? You guessed it - no. Have fun untangling it when change is forced on you.
"Scale" comes from latin "scala" - ladder. If you aren't able to nimbly hop up and down the rungs of the ladder of changing data volume, velocity and variety and the technologies your OLAP codebase usees, you will be stuck in the mud.
Buster Keaton - demonstrating the rewards, dangers, and necessity of ladder climbing
It's not uncommon for OLAP orgs to start with a tiered infra runbook and some guiding principles (like the one we saw above of "SQL, SQL Everywhere"). The runbook might tell you to start small.
Small might look like so:
Pipeline 1: Small volume, starter OLAP
Very quickly, you might still be using Spark to run your ETL algebra at its core, but dealing with a buzzword salad of other technologies, and new storage engines. Something like below.
Pipeline 2: Larger, higher volume OLAP
You don't just jump from 1 to 2! You climb a ladder there. Clean, uncoupled code is the only way you climb this ladder without pain.
To recap:
⚠️ Tight coupling to frameworks > Brittle code > Incidental complexities tied to inevitable change > Lengthy "migrations" to scale or change frameworks > Wasted time
You've made it to the end. In a devilish way, it was my extreme pleasure to clack away at the keyboard, griping, with zero LLM interference. Just organic, flawed human neurons.
As I fussed over drafting the high level sections of this post - I had zero intention of including these types of calamity recaps.
⚠️ Calamity X > Consequence n > Consequence n+1 > ...
Fittingly, almost poetically, I noticed when I was done each Calamity recap ended with "Wasted Time".
We started off by defining the Data Engineer, as a programmer specialized in the persistence of data over long periods of time.
Ultimately it's up to each of us to make the most out of however much time we may be so lucky to get.