The Company Doesn't Need a Data Lake. It Needs a Data Plumber.

Published
You Need a Data Plumber, Not a Data Lake ⊹ Blog ⊹ BN Digital
Fig. 0

The Infrastructure Nobody Wants to Talk About

Over the past decade, the enterprise technology industry has sold a seductive vision: pour all the data into one place — a data lake, a warehouse, a lakehouse — and insights will follow. The architecture diagrams look elegant. The vendor presentations are convincing. The budgets are enormous.

The promise has been remarkably consistent across every wave of the cycle. Centralise everything. Break down silos. Democratise access. The specific product names change every few years — data lakes gave way to data warehouses, which gave way to lakehouses, which are now giving way to "data mesh" and "data fabric" architectures — but the underlying pitch remains the same: the answer to data problems is a new container to put the data in.

The outcome, across organisations of every size and sector, is depressingly similar. Most organisations that built a data lake are now sitting on what the industry politely calls a "data swamp" — a vast repository of poorly labelled, inconsistently formatted, partially duplicated information that nobody trusts and very few people actually query. The technology works. That was never the problem.

The actual problem was always plumbing: the unglamorous, tedious, detail-intensive work of moving data between systems cleanly, consistently, and reliably. That work has been systematically undervalued, underfunded, and treated as a commodity that can be bolted on after the exciting architecture decisions have been made.

It cannot.

Why Data Lakes Became Data Swamps

Three failure patterns turn data lakes into data swamps. Each is well understood in theory and routinely ignored in practice.

Pattern One: The "Dump Everything" Philosophy

The original promise was liberation. Stop worrying about schema. Stop modelling in advance. Just ingest everything and figure out the structure later. This sounded genuinely revolutionary after decades of rigid data warehousing — the painstaking process of defining schemas up front, modelling every relationship, and only then loading data into a tightly governed repository.

The data lake offered escape. It said: trust the data scientists. They will figure it out. Just get everything into one place first and let the smart people sort through it.

In practice, "figure it out later" almost always meant "nobody ever figures it out." Data arrives from dozens of sources: CRMs, ERPs, trading platforms, market data feeds, email archives, spreadsheets sent by the finance team, PDFs scanned from paper records in a regional office. Each has its own format, its own naming conventions, its own definition of what a "client" or a "transaction" or a "position" actually means. Without governance at the point of ingestion — without someone insisting on consistent definitions before the data enters the lake — the repository fills with data that is technically accessible and practically useless.

The numbers bear this out. Analysts estimate that between 55% and 80% of enterprise data is "dark" — stored but unused, unclassified, and untrusted. Despite hundreds of billions of dollars invested in cloud migration and data infrastructure over the past decade, that utilisation rate has barely moved. More storage. Same problem.

Financial institutions with data lakes containing terabytes of information routinely have analytics teams that still export to Excel because they do not trust the lake's contents. The Excel file gets emailed around, manually edited, saved as `finalFINALv3.xlsx`, and eventually somebody puts the wrong numbers in front of a client. That is not an infrastructure issue — it is a trust issue. And the root of it is that nobody did the plumbing properly at the start.

Pattern Two: The Missing Middle Layer

Most data architecture conversations jump from "source systems" to "analytics layer" as if the space between them is a solved problem. It is not. That middle layer — extraction, transformation, validation, deduplication, reconciliation, lineage tracking — is where the actual value lives. It is also where most projects chronically underinvest.

Consider what that middle layer actually involves. An extraction job that handles not just the happy path but the edge cases: what happens when a source system sends a null value where a primary key should be? What happens when a record appears in two source systems with conflicting data? What happens when the source system changes its schema without telling anyone? These are not exotic scenarios. They happen constantly in production environments. A pipeline that cannot handle them gracefully is not a pipeline — it is a liability that has been named something more reassuring.

Then there is transformation logic: converting data from the source system's representation into something consistent with the rest of the data estate. This sounds simple until three months in it becomes clear that the CRM uses a different country code standard to the ERP, that two business units define "revenue" differently for historical reasons nobody can fully explain, and that there is a category of transaction that exists in only one system and does not map cleanly to anything in the target schema.

This is the plumbing. Not exciting. Does not make good conference slides. But it is the difference between a system an analyst can query with genuine confidence and a system that produces numbers nobody is willing to put in front of a client.

Gartner has estimated that poor data quality costs organisations an average of $12.9 million per year. That figure tends to surprise people, but it should not. The costs are distributed and largely invisible: time spent by analysts cleaning data before they can use it, decisions made on incorrect information, regulatory penalties for inaccurate reporting, and the compounding cost of building downstream systems on a faulty foundation. Notably, Gartner also found that 59% of organisations do not even measure data quality — which means the true cost is typically unknown, and therefore never properly addressed.

Pattern Three: The Governance Afterthought

Data governance conversations usually focus on compliance: who can access what, how long data is retained, the privacy implications of storing certain categories of personal information. These are important questions. But they sidestep the more fundamental issue: is the data accurate, up to date, and consistently defined across the organisation?

Governance that only addresses access control is like putting locks on a warehouse full of mislabelled boxes. The security is excellent. The contents are still a mess.

Real data governance starts at ingestion. It means agreeing — before data enters any system, not after — on canonical definitions for the most important entities. What is a "customer"? Is it a legal entity, a relationship, or an individual contact? What is a "transaction"? Does that include internal transfers, or only external ones? When Finance and Operations define these terms differently, no amount of downstream engineering can fix the resulting inconsistencies. The only fix is to go back upstream and establish the definition at the source.

This is unglamorous work. It involves long meetings with business stakeholders who would rather be doing almost anything else. It produces documentation that nobody wants to maintain. But organisations that skip it pay for it — repeatedly, in every analytical project they attempt for years afterwards.

The Anatomy of a Real Pipeline Failure

The pattern recurs across financial services, professional services, and technology companies with uncanny consistency.

An organisation decides to modernise its data infrastructure. It invests in a cloud data platform. It migrates its data. The project is declared a success. The architecture diagram looks clean and modern. Six months later, the analytics team raises a ticket: the numbers in the new dashboard do not match the numbers in the old reporting tool.

An investigation begins. It turns out that the migration preserved the data but not the transformation logic — the rules that defined how raw source data was shaped into the metrics the business actually used. Some of that logic lived in stored procedures that were not documented. Some lived in Excel macros run by someone who has since left the company. Some existed only in the institutional memory of a single analyst who had been carrying it in her head for three years and is quietly horrified that nobody asked her about it before the migration went live.

The organisation now has technically better infrastructure and actually worse analytical capability than before. Rebuilding the lost logic takes months. Some of it cannot be reconstructed at all because nobody recorded what it was. The business has paid for a migration that moved data without moving the understanding of what that data means.

This is a data plumbing failure. Not a technology failure. Not a vendor failure. A failure to treat the movement and transformation of data as a discipline worthy of serious investment, careful documentation, and proper engineering standards.

What a "Data Plumber" Actually Does

The metaphor is deliberately unglamorous. A data plumber does the work that makes everything else possible — invisible when it functions correctly and catastrophic when it does not.

Mapping the actual data landscape. Not the architecture diagram from three years ago, not the slide deck presented to the board. The real picture: where data actually lives, how it actually moves, what transformations happen along the way, and where things break. In every organisation examined closely, the documented architecture and the actual data flow have diverged significantly — sometimes dramatically. The map is not the territory, and the gap between them is where most data quality problems originate.

Building reliable pipelines. Extraction jobs that handle edge cases. Transformation logic that accounts for inconsistencies between source systems. Validation checks that catch problems before they propagate downstream — because a data quality issue caught at ingestion costs almost nothing to fix, while the same issue discovered in a client-facing report costs an enormous amount in remediation, trust, and occasionally regulatory exposure. Monitoring that alerts when something fails at three in the morning, rather than when an analyst discovers the wrong number three weeks later.

Enforcing consistency at the source. The most impactful intervention is often the least technical: agreeing on a common definition of key entities across departments. This requires political capital as much as technical skill. Business units are territorial about their data. Finance has always called it one thing; Operations calls it something else; the client-facing teams have a third definition that made sense when the business was smaller and has never been properly reconciled. A data plumber has to navigate those organisational dynamics as well as the technical ones.

Maintaining lineage and auditability. In regulated industries, knowing where a number came from — which source system, which transformation, which version of the logic — is not optional. When a regulator asks how a particular figure was calculated, "the data warehouse produced it" is not an acceptable answer. Data plumbing includes the metadata that makes every output traceable back to its origins: a full chain of custody from raw source data to final output.

Treating pipelines as software. One of the most common mistakes is treating data pipelines as scripts — one-off pieces of code written to solve an immediate problem, deployed, and forgotten. Reliable pipelines require version control, testing, documentation, and ongoing maintenance. They need to be treated with the same engineering rigour as production software, because that is precisely what they are.

For organisations considering where to invest before embarking on an AI programme, an honest audit of existing data infrastructure is typically the most valuable first step. It surfaces the gaps that would otherwise derail the AI project downstream — often expensively, and always later than they should have been found. Once that picture is clear, the custom AI and ML work that follows has a foundation it can actually build on, rather than a foundation it will spend the first six months quietly working around.

The Organisational Dynamics That Perpetuate the Problem

Understanding why data plumbing gets underinvested requires understanding the incentives at play in most organisations.

Data infrastructure work is unglamorous. Nobody gets promoted for building a pipeline that runs cleanly for two years. The career-making work — in perception, if not in reality — is the AI model that generates an impressive demonstration, the dashboard that catches a senior executive's eye, the platform migration that gets announced at an all-hands. The quiet, reliable, essential plumbing beneath all of it is invisible. And in most organisations, invisible work is undervalued work.

This creates a systematic bias towards the visually impressive. Budgets flow to the parts of the data estate that produce artefacts people can see and point to. Data engineering teams are chronically understaffed relative to the scope of the problem they are trying to solve. The work gets deprioritised, delayed, or delegated to third-party tools that promise to automate the hard parts — and that deliver something which works in the happy path and falls apart as soon as the data gets complicated, which it always does.

Harvard Business Review has observed that what makes poor data so expensive is precisely this invisibility: the costs are distributed across thousands of small decisions, workarounds, and accommodations that individual workers make every day without anyone tracking them. Data scientists are estimated to spend up to 80% of their time cleaning and wrangling data before they can do any actual analysis — time that compounds into extraordinary expense when aggregated across an organisation.

There is also a vendor dynamic worth acknowledging. The enterprise technology industry has a strong financial incentive to sell platforms, not process. A cloud data platform is a recurring revenue line; data governance and pipeline engineering are unglamorous professional services engagements. Vendors have historically been much better at promoting the former than having the honest conversation about the latter. The result is that organisations often acquire excellent technology and then wonder why it is not producing the results the sales deck promised.

The answer, almost without exception, is that the plumbing was not done.

Why This Matters More in the AI Era

The data infrastructure conversation has taken on new urgency because AI has fundamentally raised the stakes — and changed the nature of the failure modes.

Traditional analytics tolerate moderate data quality issues. A dashboard with slightly inconsistent numbers is suboptimal and manageable. A human analyst can spot anomalies, apply judgement, and work around gaps. That institutional knowledge, held quietly in the heads of experienced analysts, has been compensating for poor data infrastructure in organisations for years. It is one of the most valuable and least-documented assets most organisations possess.

AI does not work that way. A language model reasoning over poorly structured data produces confidently wrong outputs. It does not flag uncertainty. It does not cross-reference its own scepticism against years of experience. It delivers a polished, fluent, authoritative-sounding answer that may be built entirely on inconsistent or erroneous data. The old saying "garbage in, garbage out" has never been more literally true — except that AI makes the garbage look polished and authoritative, which is considerably more dangerous than garbage that looks like garbage.

A machine learning model trained on inconsistently labelled data learns those inconsistencies as features. If historical data contains a systematic error — a category defined differently before an acquisition, a metric calculated differently by two regional offices before a standardisation project that was never quite finished — the model will learn that difference as signal. The specific failure modes in AI contexts deserve naming explicitly.

Retrieval-Augmented Generation and enterprise knowledge bases. The current enthusiasm for building AI assistants that can query internal documents, policies, and records is well-founded — but only if those documents are clean, current, and consistently structured. An AI assistant querying a knowledge base full of outdated policies, inconsistently tagged documents, and duplicated entries with conflicting details will confidently synthesise answers that are wrong. The AI is not the problem. The knowledge base is. This is, once again, a plumbing problem.

Agentic AI systems. As AI moves from answering questions to taking actions — booking meetings, triggering workflows, updating records, executing processes — the tolerance for data quality failures collapses to near zero. A hallucination in a text summary is embarrassing and correctable. A hallucination in an agentic workflow that triggers a financial transaction or modifies a customer record is a different category of problem entirely.

Forrester Research has identified data quality as the primary factor limiting B2B generative AI adoption, and the IBM Institute for Business Value found that 45% of business leaders cite data accuracy and bias concerns as a leading barrier to scaling AI initiatives. These are not technology sceptics. They are organisations that wanted to implement AI, invested in the capability, and discovered — often after considerable expense — that their data infrastructure would not support it.

What Good Infrastructure Actually Looks Like

The organisations that have got this right share recognisable characteristics.

They treat data engineering as a first-class discipline, not a support function. Data engineers sit alongside software engineers, are paid comparably, and work to the same engineering standards: version control, testing, code review, documentation. Their pipelines are observable — every job has monitoring, alerting, and a clear owner. Failures are treated as incidents with post-mortems, not inconveniences to be quietly fixed and forgotten.

They invest heavily in the definitional work before the technical work. Before building a pipeline, they have a documented, agreed definition of every entity it will handle. They know what a "customer" means in their data estate. They have a canonical identifier for it. They have a process for resolving conflicts when source systems disagree.

They treat data lineage as non-negotiable. Every number in every report is traceable back to its source — not just "it came from the data warehouse" but which source system, which pipeline version, which transformation logic, and when it was last validated. In regulated industries this is a compliance requirement; in every industry it is a precondition for trust.

And crucially: they do not skip this work in the rush to deploy AI. If anything, the AI initiative becomes the forcing function that finally gets the data infrastructure investment approved — because suddenly the stakes are high enough that the "we will sort the data out later" argument collapses under its own weight. This is, arguably, the most useful thing AI has done for enterprise data quality: it has made the cost of skipping the plumbing too visible to ignore.

The Unsexy Competitive Advantage

The organisations that will extract the most value from AI over the next five years will not be the ones with the most sophisticated models or the largest compute budgets. They will be the ones with the cleanest pipes.

This is a difficult argument to win in a budget meeting. Nobody celebrates the data engineer who built a pipeline that has run flawlessly for eighteen months. The celebration goes to the AI model that produced an impressive output — even when that output was only possible because of the quiet, careful, unglamorous work that preceded it. The model gets the credit; the infrastructure gets the maintenance schedule.

Every successful AI implementation, without exception, had one thing in common before the AI work began: serious, sustained investment in data infrastructure. Not a data lake. Not a shiny new platform with a compelling product roadmap. Just reliable, well-governed pipelines that moved the right data to the right place in the right format, with enough metadata to understand what was being looked at and enough monitoring to know when something had gone wrong.

The companies that skip this step and go straight to AI will learn the same lesson the data lake generation learned a decade ago: the technology works fine. The plumbing is what mattered.

The irony is that this has always been true — long before AI made it urgent. The organisations that built clean, reliable data infrastructure in the early 2010s did not do so because they were anticipating large language models. They did it because they were tired of analysts working from Excel files that nobody trusted, tired of board reports that required three days of manual reconciliation, tired of making consequential decisions on numbers that might or might not reflect reality. The AI capability they now have is a consequence of that discipline, not the cause of it.

The pipes.

Related Articles

[]