Building a Data-Lake

Why it is necessary to build the data lake in the right way?

How to Build the data-lake the right way?

The output of building a data-lake in the right way-

Steps for creating data-lake

  1. Define Catalog
  2. Governance Policy
  3. Retention Policy
  4. Format of the data
  5. Preprocessing of the required data

DataCatalog

raw/<level-1>/<level-2>/
processed/<level-1>/level-2/<format>
transformed/<level-1>/level-2/<format>
level-1 = {what kind of data} => transactional, security
level-2 = {origin of the data} => order_history, firewall
level-3 = {endpoint or the software which is generating the data} => malware endpoint

Governance Policy

Retention

Format of the data

Preprocessing of the required data

What’s next? Introducing features like ACID, schema evolution, upsert, time travel, incremental consumption, etc. in the data lake.

--

--

--

Software engineer at Expedia Group. Passionate about data-science and Big Data technologies.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Weekly review of Reinforcement Learning papers #7

Tableau Performance Optimization

Marketing Citizen Data Scientist

Predicting Football With Python

How TFIDF scoring in Content Based Recommender works

Optimization (One-Dimensional Search Methods)

Life is a journey of twists and turns, peaks and valleys, mountains to climb and oceans to explore.

ML 101 — Improving titanic score from 0.7 to 1

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
AS

AS

Software engineer at Expedia Group. Passionate about data-science and Big Data technologies.

More from Medium

Advice for devs who want to build strong ETLs

Angry programmer trying to figure out the code he made two months ago.

6 ways to improve Redshift performance

AWS Athena vs Redshift: How to differentiate and leverage them

Big Data In Hadoop