Data lakes and legacy systems: An introduction to the concept

Author: Thierry Julien, CEO, TJC Group

Typically, organisations deal with hundreds of applications that, over time, are replaced, and these obsolete systems are decommissioned. But are you aware of the concept of data lakes? Well, fret not! In this article, we will address the basics of data lakes and how this is linked to legacy systems. In part 2, we’ll assess what are the consequences of legacy applications on your data lake. Read on!

One definition as a starting point
- What do we mean by “data lake”?
A repository from many sources
A component-based architecture
And sometimes, data lakes as legacy architectures
Different zones as a pipeline
Data lake, data mesh or data fabric?
Final word

One definition as a starting point

What do we mean by “data lake”?

Data lakes refer to “a solid architecture, logically centralized, a highly scalable environment filled with different types of analytic data that are sourced from both inside and outside your enterprise with varying latency, and which will be the primary go-to destination for your organization’s data-driven insights”, as defined on the book Data lakes for dummies written by Alan R. Simon. That said, data in a data lake may be seen as previously owned data that is refurbished and could be used again by a new owner.

Typical data ingestion for a data lake is the ELT process (Extract Load and Transform). The trick is that the transformation may happen at a later stage. No upfront data analysis is required (schema on read rather than schema on write). The typical historical ETL tools are provided by Informatica or IBM datastage, however, new ones are now available such as AWS lake formation.

A repository from many sources

Data lakes are constructed from potentially dozens of sources which can be applications, products, services, IoT, or any data source. The flow may come from batch, streaming, or more probably from both source types. It’s a loosely coupled architecture.

A catalog, or directory keeps track of the data contained in the data lake and what associated rules apply to the different groups of data. This is named ‘metadata’. In order to get reporting (OLAP/BI) from your data lake, you’ll need to add a semantic layer. A semantic layer may be a time slice and dice of facts (such as revenues (according to dimensions (such as customers).

The problem with data warehouses is they may become data dumps while the issue with data lake is they may become data swamp.

A component-based architecture

An important data lake feature is that it is possible to use different storage options (i.e. blob storage and SQL database storage for instance) for different usage. Your data lake does not need to be a monolithic architecture but becomes a component-based architecture.

Keep in mind that semi-structured data is in between structured and unstructured data. It includes -but is not limited to – blog posts, social media posts, teams or Slack messages, text, and email.

And sometimes, data lakes as legacy architectures

Most corporations will have to contend with legacy data and legacy analytical data, most likely from data warehouses and data marts. For example, in the SAP environments, SAP BI systems will be phased out of maintenance by 2027/2030 and replaced with SAP Datasphere.

Different zones as a pipeline

Let’s consider three parts of a data lake in legacy systems: Bronze zone, Silver zone, and gold zone:

A. Bronze zone

The bronze zone (also named raw zone or landing zone) includes the following items: data ingestion, data storage and management, and data cataloging. Some years ago, we would have considered HDFS here (Hadoop Distributed Dile system), but this technology seems to be outdated now.

The bronze zone may include database storage, so you may ingest a full database table structure, primary and foreign key relationships, and any range-of-value and list-of-value constraints.

Raw data may still be used, so the bronze zone also supports analytics in different forms.

B. Silver zone

The silver zone, also known as processed zone, allows for data cleansing and transformation, data refinement and data enrichment.

C. Gold zone

The gold zone (also known as the published zone) is where we find the most relevant data. It is sometimes referred to as “the golden source” or “the source of truth”.

Some other important data lake features

We must not forget data lineage, which refers to the process of tracking the flow of data over time, providing a clear understanding of where the data was originated, how it has changed, and its ultimate destination within the data pipeline.
And how long data should be kept in your data lake? well, it may be forever, or just for a few hours (Amazon Kinesis data stream default residence time is 24 hours, just to set an example of a common large-scale real-time streaming service). You may also classify data in between Hot, Cool, and Archive, or in between Hot, Cold, and Frozen, or between Hot, Warm, and Cold (different terminologies apply).
We have different types of users in a data lake: passive users will only have access to statics PDF (the stories from SAP SAC) but a light analytics user can also access the actual data.

Data lake, data mesh or data fabric?

Please note we are discussing data lakes, and not data warehouse, data mesh or data fabrics. These terms do have a lot in common. For instance, data mesh was first coined in 2018 (Forester Research) as an approach to data that decentralizes ownership and democratizes access. Data Mesh has been used in recent years, starting with Mark Russinovich and his “data gravity”: where businesses collect larger quantities of information and then struggle to manage them.

I will personally be more aligned with the data fabric concept and the data pipeline concept. As defined by Padmaraj Nidagundi, an experienced software engineer, “Data fabrics bridge legacy environments with new cloud-native implementations providing target systems with specific data they need while maintaining security concerns. Data fabrics is a framework, not a technology”.

Final word

In conclusion, the concept of data lakes represents a pivotal evolution in managing and harnessing vast amounts of diverse data from both internal and external sources. This article has explored the fundamental definition of data lakes, emphasizing their role as a centralized, scalable repository for a variety of analytic data. As organizations grapple with legacy applications, the link between data lakes and these outdated systems becomes crucial, with the forthcoming Part 2 addressing the consequences on data lakes.

Back to all Blogs

Table of contents

Securing legacy data in the move to SAP SuccessFactors Employee Central Payroll

SAP legacy system decommissioning: Why it matters more than ever in 2026

Data security in 2026: How to strengthen your defence strategy

About TJC Group

Solutions

Software

Contact

Certifications

Consulting

Resources

Events

Find all our advice