Article Ready For It - 5 recommendations pour commencer son data lake

Data Lake : Five tips for starting yours

A data lake is often considered a more economical means of storing internal and external, structured and unstructured data. 

In reality, it’s much more than that, and to get a definite benefit from it, your data lake will require both a strategy and hard work.

Here are some tips on how to make your data lake a success.

Companies are facing an explosion of data volumes, formats and geographies. Creating a data lake – based on the simple principle of gathering useful data in a single repository – is one solution to this problem. But first, you need to ask a few questions, like: What is its purpose? Does it have to be located in a new architecture or in the Cloud? What are the regulatory and/or business requirements? Most importantly, you need to expand your vision beyond simply reducing storage costs. Here are five recommendations for data lake best practices.

1 - Know the difference between a data lake and a database

Some people think that a data lake is just a cheaper way of creating a database. However, if you start out with this mindset, you’ll realise after a few months that the result is less than satisfactory. A data lake doesn’t behave like a database, so you shouldn’t think of it as one. A data lake is no magic bullet. It requires resources and skills that the company has to provide, and you have to match the means you deploy to your company’s expectations. So, really think about whether you need a data lake before starting a project.

2 - Ensure you have the right resources

A data lake project is generally loaded with ambitions, but you have to have the resources necessary for its deployment. And everything you build has a cost! Companies don’t always understand the amount of resources, knowledge and experience they will need. Very quickly, this raises the question: Can we build our data lake in house, or should we buy one? This is very important to consider because the team in charge of building the data lake will quickly realise that data entry and imports – everyday tasks for infrastructure managers – are only the beginning of the work, not its goal.

These resources are not just financial, but also investments in time and efficiency. And, as we have mentioned, they relate to a lack of skills within the organisation. Whether you decide to train your own staff or recruit people with the skills you need, knowledge is essential because it has a real impact on both the lifetime and success of your data lake.

3 - Start with an actual business problem

Data lakes are usually the stuff of dreams for the company’s teams, and the risk is that they stray off track, transforming databases into pseudo-scientific projects, that they play with and experiment with, creating repositories that end up being of only limited use. To avoid this, and create a successful dynamic that you’ll try to extend, start by deploying your data lake with the goal of resolving an actual business problem. This type of project is more likely to bring positive results quickly, and provide information that satisfies everyone, from the BUs to general management.

It’s also important to consider the psychological effect that is sought. Teams will get engaged more quickly and easily, and be more willing to become involved in working on data on projects that concern them. And above all, they’ll stay focussed. This will help avoid the risk of getting off track, and of imagining that the data lake will solve all the cases submitted to it. That’s why you should start with a business problem, to keep your teams focused and solve the problem. This successful demonstration will be a factor in the data lake's success.

4 - Make security your top priority

There's more to a data lake than just storage. There’s also data management, and both the organisation deploying and the one operating it – even if they're one and the same – have to guarantee the security of the data entrusted to them. However, a data lake project is inherently an IT project, and is subject to the same threats, whether intrusion, theft, data destruction, or the ubiquitous risk of human error.

While more and more technologies are popping up that contribute to data security and governance, and now support the principle of contributing value to the company, the threat of cyber-crime is evolving at the same pace, and cyber criminals are even quicker to react than companies and their partners. The danger is real and can strike the very heart of the company. That’s why you need to really focus on the security of your data lake, its data and flows, and at least protect it from the wrong people.

At the very least, pay special attention to user authentication, authorisation, and to encrypting data at rest and in motion.

5 - Consider the data management life cycle

A data lake isn’t some magical place, nor is it a workplace reserved to data science. You need to consider the entire data management life cycle, including data collection and storage, loading data in intermediate storage, quality controls, data cleaning and enrichment, management, and report generation. A data lake can be considered a completely separate project. But this doesn’t mean you shouldn’t consider the data management life cycle.

And again, to create an information management pipeline, start with something known and of a reasonable size, before you attack unstructured sources, data from sensors, data in constant flow, etc. In this way, you’ll build a solid foundation that you won't have to rethink in the event of failure.

Here's probably the most important piece of advice we can give anyone considering a data lake project: concentrate on the quality of decision-making, which is the final goal of any project impacting data.