Data lakes and legacy systems: part 1

19-02-2024 | 4 lecture minimale | Décommissionnement des systèmes legacy, Tendances informatiques

Typically, organisations deal with hundreds of applications that, over time, are replaced, and these sunset systems are decommissioned. In this article, we will address the basic concept of data lakes and how this is linked to legacy systems. In part 2, we’ll assess what are the consequences of legacy applications on your data lake.

What do we mean by “data lake”? Data lakes refer to “a solid architecture, logically centralized, a highly scalable environment filled with different types of analytic data that are sourced from both inside and outside your enterprise with varying latency, and which will be the primary go-to destination for your organization’s data-driven insights”, as defined on the book Data lakes for dummies written by Alan R. Simon

Data in a data lake may be seen as previously owned data that is refurbished and could be used again by a new owner.

Typical data ingestion for a data lake is the ELT process (Extract Load and Transform). The trick is that the transformation may happen at a later stage. No upfront data analysis is required (schema on read rather than schema on write). The typical historical ETL tools are provided by Informatica or IBM datastage, however, new ones are now available such as AWS lake formation.

Data lakes are constructed from potentially dozens of sources which can be applications, products, services, IoT, or any data source. The flow may come from batch, streaming, or more probably from both source types. It’s a loosely coupled architecture.

A catalog, or directory keeps track of the data contained in the data lake and what associated rules apply to the different groups of data. This is named ‘metadata’. In order to get reporting (OLAP/BI) from your data lake, you’ll need to add a semantic layer. A semantic layer may be a time slice and dice of facts (such as revenues (according to dimensions (such as customers).

The problem with data warehouses is they may become data dumps while the issue with data lake is they may become data swamp.

An important data lake feature is that it is possible to use different storage options (i.e. blob storage and SQL database storage for instance) for different usage. Your data lake does not need to be a monolithic architecture but becomes a component-based architecture.

Semi-structured data is in between structured and unstructured data. It includes -but is not limited to – blog posts, social media posts, teams or Slack messages, text, and email.

Most corporations will have to contend with legacy data and legacy analytical data, most likely from data warehouses and data marts. For example, in the SAP environments, SAP BI systems will be phased out of maintenance by 2027/2030 and replaced with SAP Datasphere.

Let’s consider three parts of a data lake: Bronze zone, Silver zone, and gold zone:

The bronze zone (also named raw zone or landing zone) includes the following items: data ingestion, data storage and management, and data cataloging. Some years ago, we would have considered HDFS here (Hadoop Distributed Dile system), but this technology seems to be outdated now.

The bronze zone may include database storage, so you may ingest a full database table structure, primary and foreign key relationships, and any range-of-value and list-of-value constraints.

Raw data may still be used, so the bronze zone also supports analytics in different forms.

The silver zone, also known as processed zone, allows for data cleansing and transformation, data refinement and data enrichment.

The gold zone (also known as the published zone) is where we find the most relevant data. It is sometimes referred to as “the golden source” or “the source of truth”.

  • We must not forget data lineage, which refers to the process of tracking the flow of data over time, providing a clear understanding of where the data was originated, how it has changed, and its ultimate destination within the data pipeline.
  • And how long data should be kept in your data lake? well, it may be forever, or just for a few hours (Amazon Kinesis data stream default residence time is 24 hours, just to set an example of a common large-scale real-time streaming service). You may also classify data in between Hot, Cool, and Archive, or in between Hot, Cold, and Frozen, or between Hot, Warm, and Cold (different terminologies apply).
  • We have different types of users in a data lake: passive users will only have access to statics PDF (the stories from SAP SAC) but a light analytics user can also access the actual data.

Please note we are discussing data lakes, and not data warehouse, data mesh or data fabrics. These terms do have a lot in common. For instance, data mesh was first coined in 2018 (Forester Research) as an approach to data that decentralizes ownership and democratizes access. Data Mesh has been used in recent years, starting with Mark Russinovich and his “data gravity”: where businesses collect larger quantities of information and then struggle to manage them.

I will personally be more aligned with the data fabric concept and the data pipeline concept. As defined by Padmaraj Nidagundi, an experienced software engineer, “Data fabrics bridge legacy environments with new cloud-native implementations providing target systems with specific data they need while maintaining security concerns. Data fabrics is a framework, not a technology”.

In conclusion, the concept of data lakes represents a pivotal evolution in managing and harnessing vast amounts of diverse data from both internal and external sources. This article has explored the fundamental definition of data lakes, emphasizing their role as a centralized, scalable repository for a variety of analytic data. As organizations grapple with legacy applications, the link between data lakes and these outdated systems becomes crucial, with the forthcoming Part 2 addressing the consequences on data lakes.