Why and how to put in place a data lake
ReadyForIT is based on three pillars: data, the cloud and cybersecurity. This round table focused on the data segment, examining the problem of data lakes: why and how should this solution be put in place? Speakers from INA, Comexposium and Kynapse shared their experiences and opinions on the relevance of data lakes.
Why put in place a data lake?
Not all companies have addressed the question “why” in the same way when it comes to data lakes. Most based their process on one of three different approaches:
- “Techno-centric”: when discussions on data are led by the ISD. In this case, a data lake is desired to store, organise, retrieve and harmonise the company’s data.
- “Usage centric”: for some advanced uses, data has to be matched, external data need to be accessed, and quality data are needed in modular spaces of varying sizes. So, in this case, the data lake is built based on how the data are used.
- “Partner centric”: when the company has no clearly defined resources and decides to work with a publisher partner to avoid having to deal with a number of problems. This third approach is more anecdotal, because it is often temporary.
It is not viable for a company to consider a data lake simply as an IT gadget. It is important to first understand the value it can offer for business. The answer depends on the company, and not all of them actually need a data lake.
However, there are many advantages to adopting a data lake structure, because it offers much greater flexibility than a traditional datawarehouse. Still, the prospect of spending tens of millions of euros to switch from a datawarehouse infrastructure to a data lake should be examined based on the company’s actual needs.
INA and Comexposium’s experiences
To understand how to put in place a data lake, this round table examined the experiences of INA and the Comexposium group.
The approach undertaken by INA
Data represent INA’s core business, which is to maintain and use France’s audiovisual heritage. The Institute has 18 million hours of digitised content in its collections, as well as all the documentation for all these files.
Four years ago, INA had to re-examine its information system in order to unify two parallel systems that had cohabited for two decades without ever really communicating. So, the Institute had to figure out how to merge the two into a single system. At the same time, the organisation decided to deploy Big Data, starting with its information system’s internal data, before adding more and putting them together.
When transforming the information system, it was necessary to avoid describing the processes and features before completing complex migrations. Since the organisation already had a large volume of data, this would necessarily have led to failure.
So, INA turned this vision of transformation on its head by first structuring the data, and then uses (how to interact with and exploit the data). Basically, it separated data and uses.
Comexposium’s data lake
A data lake’s relevance depends on the industry in which it is built. This is because some industries are naturally R&D oriented and have “purist” approaches. Others, such as event organisers, are less concerned by R&D. This difference determines whether there is a scientific need or a business need.
The Comexposium group organises over a hundred events every year, ranging from cyber security to farming equipment. The common denominator of all these events is not obvious, but every one concerns two types of customers: visitors and exhibitors. By dividing them into two distinct types, a certain volume of personal and behavioural data can be dealt with.
Comexposium didn’t undertake the creation of its data lake in the traditional way, because its ultimate goal is to improve the customer experience. So, it focussed the construction of its infrastructure on this aspect. As the group completes the data lake implementation process, it is only now beginning to address governance issues; a more purist approach would have started with that.
Indeed, from a data standpoint, a data division first has to be created, and data analysts put in charge of a data cycle, identification, and translation of the data for each business line. The purpose of this is to guarantee governance, prepare it and then analyse it.
Why is data control so important?
Companies have been taking care of their data for decades, and so far, haven’t necessarily needed a data lake to do so. A data lake has to have value and not simply be a place to dump raw data of different sizes and formats.
A company wanting to implement this solution has to determine what justifies such an expense, when a traditional database might be good enough if the need were simply to improve customer knowledge.
Stéphane Messika explains: “INA is a box of data, unlike Comexposium, so it seems obvious that it has to build as clean an environment as possible to address its main concern. However, all companies are not data boxes and can’t think like that (sic).” In other words, starting with a certain level of interest and relevance, it is important to rein in IT and adopt a design-oriented approach.
Furthermore, power is nothing without control. Gautier Poupeau believes that “whatever happens, the essential question is: how do you control your data?” There is no one-size-fits-all data control solution, because each company has its own history, culture, information system (if the desire is to control it), and technical and organisational capacities.
In choosing the data lake, INA considered how it wanted to address the transformation of its information system: from a usage angle or a data angle? INA knew there was a significant risk the project would fail if it didn’t adopt uses and if the ISD alone managed the transformation of the BUs and the merger of the two systems. This is because the BUs need to be responsible for managing their transformation, and not the ISD.
Speakers: Gautier Poupeau, INA; Romain Chassinat, Comexposium, and Stéphane Messika, Kynapse Open