Building a Data-Lake

4 min readDec 13, 2020

Data data and data everywhere. Everyone has their own’s data. Also, every organization has its huge cloud storage that is known as data-lake, and all sort of data is present in that data-lake. But, now it’s time to analyze the structure of your data lake. As not only the Cold data is present in the lake but also there are various real-time and ad-hoc analysis requires the HOT data from the lake on their plate.

Why it is necessary to build the data lake in the right way?

Building the data-lake in the right way is very important because it is the first milestone for all your big data analytics, ML, and AI solutions. Once data-lake is created there is a very big cost involved in changing and migrating the data in data-lake and also every time for any new project or new job you are processing the same data again and again. If the governance policy is not applied correctly then things will keep on go messier with time.

Data-Lake is not cold storage, that meant for storing or ingesting data and keeping it for future use for ML jobs and other analytics engine, but nowadays it has become the source for much real-time data processing job. So, Data is best served HOT.

How to Build the data-lake the right way?

We are generating a tremendous amount of data daily and if it was not kept in manageable ways we will be burning a huge amount for unworthy data or processing non-useful data every time. So I will be emphasizing only 3 points (Quality, Cost, and Governance) to build good data-lake

The output of building a data-lake in the right way-

1- Enhance data quality with no time data discovery.
2- Process data once. Not every ML and ETL job doing more and less the same thing in their preprocessing steps.
3- Data is secure and made available in the correct way.

Steps for creating data-lake

I will define the whole process for creating the data-lake in major 5 steps. We have to work on these points one after another for creating our data-lake for handling all the modern use cases.

Define Catalog
Governance Policy
Retention Policy
Format of the data
Preprocessing of the required data

DataCatalog

Data catalog plays a very crucial role and the first step for any data lake. If the data is not searchable and also some job has pre-processed the data and if it not known to some other developers they probably will do the pre-processing again and the cycle will go on. At some point later in time, all jobs are only consuming raw data and doing duplicate efforts for pre-processing and filtering data again and again. Also if data is not categorized data-lake will only be used for raw data and all will be writing their custom ETLs to process the data.

So Creating the data catalog make data easily available and searchable. The catalog should be defined as.

raw/<level-1>/<level-2>/
processed/<level-1>/level-2/<format>
transformed/<level-1>/level-2/<format>level-1 = {what kind of data} => transactional, security 
level-2 = {origin of the data} => order_history, firewall
level-3 = {endpoint or the software which is generating the data} => malware endpoint

If considering AWS S3 for our data lake, these can be termed as the buckets. You can define as many levels according to your needs.

Catalog must contains => Location + format + Acessibility + Retention + Sources of the data + Sinks

Governance Policy

The catalog should be defined at such a granular level that we can apply restrictions to each level or folder. Data should always present by the principle of least privilege (access only the folder that is legitimate for that process). Bucket policies in terms of S3 should be written in a manner so that it can be derived to provide folder level access.

Retention

Retention is furthermore an important factor for the data lake. Why need to pay extra for the data of no use from the saved money please buy_me_a_coffee. We cannot keep the data forever.

Always keep the most extensive retention for the raw data. All forms can be generated from the raw data. For processed or transformed data the retention should be defined keeping the computation and storage cost in mind.

cost of generation of processed data < cost of keeping the processed data

Format of the data

Data in the lake should always be compressed. Encryption can be applied to the data based on the sensitivity of the data and it has been a proven fact that encryption and decryption do not hamper any performance cost nowadays. So it will be great to keep the data encrypted.

Other compressed formats can also be explored for the data lake.
ORC/Avro/Parquet — Storing schema with the data, major query performance impact in Athena, and Spark SQL.
CSV/JSON — Supported in all programming languages.

Preprocessing of the required data

Data preprocessing should be done to make data readily available for the job to process. Developers and the analyst should be made aware to check the availability of the processed data in a particular format before writing a job to processed or transform data.
The catalog must be updated for a new preprocessed data made available in the data lake

What’s next? Introducing features like ACID, schema evolution, upsert, time travel, incremental consumption, etc. in the data lake.

For these features, we can introduce technology like