SQL’s for your Big Data Pipelines

3 min readSep 21, 2020

Code your datapipeline in a smart way with right tools, move away from big whole chunk of non manageable SQL to config driven and manageable SQL

ETL pipeline has been made with SQL since long back, and now I have seen many articles and theory regarding NOT to use SQL for your big data pipeline. But that's not correct completely, So I thought discouraging the use of SQL in not the right way, You have to migrate to the correct way of using SQL for your big data pipelines.

Nowadays most of the big company use languages like spark, beam, flink, etc. for their big data pipelines. Pipelines are majorly designed for Extracting, Loading, and transformation and so-called ELT or ETL pipelines. We try to use the advance way the API way for solving these problem totally ignoring the the powerful SQL way.

Are you moving away from SQL for Your ETL/ELT Pipelines?

The main driving factor of migrating away from SQL is to write maintainable, readable and testable code, which can be automated for test, build and deployed . Instead of whole single huge chunk of single SQL file.

So if this is you driving factor please take a back seat, You can write more readable, more reusable, more testable code in the SQL which can be easily build and deployed, here I will take example of spark-sql.

For simplicity and clarity, I will take a very small piece of configuration file

{
[  
 {
    description = "step-1",
    output-view-label = "output_data_frame",
    sql-to-perform =
      """
         SELECT * from input_data_frame
        
      """
  },
 
 {
    description = "step-2",
    output-view-label = "final_data_frame",
    sql-to-perform =
      """
         SELECT * from output_data_frame
        
      """ }]
}

This configuration is very easy to understand at any level and also depicts the true picture of ETL

You have to write code in spark to understand these configs and execute config driven sql.

These configs are easy to read and self-explanatory;
All big SQL chunks can be broken down to these small small testable steps in the form of configurations
Your ETL pipelines can be easily implemented in TDD approach, you have to write small small test cases for each of your SQL
You can easily persist and de-persisit you data frame at any stage in the program
These configs can be easily maintained and can be understood at any level in the hierarchy

Why we want to stick to SQL?

Question like are we not maintaining pace with the changes happening around the world, we want to stick with the old school SQL style. Well this is not the case, As SQL are more optimized and in newer version of spark and flink we have seen that the tremendous amount of work has been done to make SQL more optimized.

If you ever got chance to look at big big SQL you can only concludes that SQL becomes more heavy by joins and multi stage select statement which you can easily get solved by moving SQL from .sql file to .config file.

Now your SQL is more manageable, more understandable and it as per the lastest industry standard

This is not only the problem which we are trying to solve here, you can also build a more robust streaming solution, whcih is config diven and totally based on SQL queries.

Are all transformation and joins need can be solved by SQL?

Obviously not, I am not stopping or discouraging you to write UDF and to use API whcih are directly available as DataFrame and DataSet API’s in spark. For the problem in whcih SQL is not directly helping or you found that methods provided by spark-core API is more suitable you can simply create your next config whcih directly works on your dataframe. So concluding with..

Choose the right tool with the right approach, your approach matters not the tool or language capability

SQL’s for your Big Data Pipelines

Are you moving away from SQL for Your ETL/ELT Pipelines?

Why we want to stick to SQL?

Are all transformation and joins need can be solved by SQL?

Written by AS