Finance · Banking100,000+ employees150 subsidiaries · 4 continents24 months

Building the foundations of an HR Data Lake for a global banking group

An international banking group managed its HR data across fragmented systems spanning 150 subsidiaries on 4 continents. No shared reference framework, large-scale duplication, and 700 users without unified BI tooling. The challenge: build the foundations of an HR data lake from scratch.

The Challenge

HR data scattered across 4 continents, with no common reference framework

The group operated with heterogeneous information systems across each of its geographic regions: North and South America, Africa, Asia-Pacific. HR data from its subsidiaries was neither consolidated, nor cleansed, nor governed. Duplicate employee identifiers distorted analyses at every level. No common definition of what constituted a "unique employee" had ever been established.

⚠️

HR data fragmented across incompatible systems spanning 150 entities in 4 geographic regions

⚠️

Massive employee identifier duplication making any group-level analysis impossible

⚠️

No data flow cataloguing or mapping of HR data sources

⚠️

700 users across 150 subsidiaries with no common BI tool or adequate training

⚠️

No data governance and no identified owner for HR reference datasets

The Approach

Laying the foundations before building

Before engaging in any migration or deploying any analytical tool, the project began with what most data projects skip: defining what we want to unify, mapping what already exists, and establishing the governance rules that will make data reliable over time. This upfront rigour is what made the results possible.

🗺️

Data Flow Mapping & Cataloguing

Comprehensive inventory of HR data sources across 150 entities, mapping of inter-system flows, creation of a group-wide data catalogue enabling origin and quality tracking for every data point.

🔬

Probabilistic Deduplication Algorithm

Development of a probability-based algorithm to identify and unify duplicate employee identifiers at group scale - without perfect matching across source systems. 100% workforce coverage achieved.

📊

BI Deployment & User Training

Creation of HR reporting for subsidiaries, data lake architecture implementation, and rollout of a training programme for 700 group BI tool users across 150 subsidiaries.

Results

Foundations that hold - and numbers that prove it

−50%

Reduction in duplication errors across group HR reference data

−30%

Analysis time reduction through optimised query performance

100%

Workforce coverage with a unique, reliable employee identifier

+40%

Group BI tool adoption rate following training programme (700 users)

Beyond the metrics, this project produced something the previous systems had never delivered: a shared, operational definition of what a group employee actually is. HR data accuracy improved by 20%, reducing cross-system mismatches by 25%. It is these foundations - not the tools - that finally made group-level analysis reliable.

"The real problem wasn't technical. It's that no one had ever defined what a unique identifier was before starting. Twelve per cent duplicates across 100,000 people, and no one saw it - because no one was looking."

Technologies & Methods

What Was Used

Data Lake ArchitecturePythonProbabilistic AlgorithmData Mapping & CataloguingData CleansingBusiness IntelligenceData GovernanceETL / Data PipelinesUser TrainingIT Project Management