Data Governance

You Don't Actually Know How Many Customers You Have

14 July 2025·7 min read

Every organisation believes it knows its data. Then someone tries to build a unified view and discovers the same person exists three times across four systems under different identifiers. This is not a data engineering problem. It is a design decision that was never made - and the cost compounds every year you leave it.

A useful question: how many unique customers does your organisation actually have? Not how many records. Not how many rows in the CRM. How many distinct, real people. Most organisations can produce a number in seconds. Very few can defend it once someone asks them to reconcile it across the email platform, the ERP, the billing system, and the data warehouse. The number shifts. Sometimes by a factor of two.

The question most organisations cannot actually answer

The problem becomes visible the moment a project team tries to build a unified view - a single customer record, a consolidated employee file, a master product catalogue. They pull data from three systems and discover that the same entity lives under different identifiers in each one, with three variations of the name, two addresses, and an active/inactive flag that means something different depending on which system you are reading. The answer to "how many unique X do you have?" becomes: we are not sure. Not precisely. And the imprecision is not small.

Why this happens - and it is never an accident

It happens because systems are built in sequence, not in concert. The CRM came first. Then the ERP, then the data warehouse, each with its own primary key logic and its own tolerance for duplicates. No one was formally accountable for master data across all three. No one ever defined what "unique" meant at the entity level - not because the question was ignored, but because when each system was built, it seemed like someone else's problem. The technical architecture inherited a conceptual gap. And that gap widens every time a new system is added.

The real cost is not in the reports

The visible cost is duplicate marketing emails and slightly wrong dashboards. The real cost sits elsewhere. GDPR erasure requests that require manual triage across five systems instead of a single API call. Compliance audits where answering "who has access to this customer's data?" requires three weeks of investigation. AI and machine learning models trained on a population that is artificially inflated by duplicates and artificially fragmented by inconsistent identifiers. In financial services, HR, or healthcare - where a "unique entity" is a regulatory concept as much as a technical one - the liability is structural, not cosmetic.

What defining a unique identifier actually requires

A unique identifier is not a technical output. It is the result of a governance decision that precedes any engineering. Before a single pipeline is built, someone needs to answer: what attributes define this entity as unique? What are the rules for merging two records that might represent the same person? Who resolves exceptions? What happens when the rule produces an answer that contradicts what a business unit believes to be true? At scale - across a large group with dozens of subsidiaries running different HR systems and different definitions of "employee" - this governance work consistently takes longer than the data engineering that follows. It should. Getting the definition wrong before the pipeline is built costs a fraction of what it costs to fix it after.

The data quality cascade that follows

The downstream benefits of a well-governed unique identifier compound. Analytics become reliable because population counts are accurate. GDPR compliance becomes systematic because data subject identification is unambiguous. Reporting is consistent because different systems reference the same entity with the same key. The quality of AI training data improves because the modelled population reflects reality rather than years of accumulated deduplication debt. None of this requires new infrastructure. It requires a governance decision that was deferred.

What a mature master data model actually looks like

Four components are required: a canonical definition of each entity type, a deduplication rule set that is documented and version-controlled, a governance process for exceptions and edge cases, and a propagation mechanism that distributes the master identifier to dependent systems over time. The rule set does not need to be perfect to deliver value. A 95% accurate unique identifier shared across all systems is transformatively better than five systems each maintaining their own version of the truth. The goal is not perfection. The goal is a shared definition of reality that the whole organisation can build on.

Where to start

Pick one entity - the one that causes the most reconciliation pain in practice. Map every system it lives in. Document the primary key logic in each. Count the records that cannot be matched across two systems without manual intervention. That number is your governance debt. And like most debt, it is not going down on its own.