Spark with Redshift

AS
3 min readJun 5, 2023

The integration of Apache Spark with Amazon Redshift brings together the power of two leading data processing technologies, enabling organizations to efficiently process and analyze large volumes of data at scale. Spark, an open-source distributed computing framework, and Redshift, a fully managed data warehousing solution, provide a robust and flexible platform for handling diverse data workloads.

By combining the strengths of Spark’s distributed processing capabilities with Redshift’s columnar storage and query optimization, businesses can unlock the potential of their data assets, enabling faster insights, improved decision-making, and enhanced data-driven workflows.

Installing and configuring python3

#Installing Jq 
sudo yum install jq

#Creating a python softlink
sudo ln -s /usr/bin/python3 /usr/bin/python

Installing pyspark and other packages

#Installing open JDK-11
sudo amazon-linux-extras install java-openjdk11
#Installing latest pyspark
pip3 install pyspark

#Pyspark additional packages
pip3 install pyspark[sql]
pip3 install pyspark[connect]

#configuring pyspark python path
PYSPARK_PYTHON=python3 SPARK_HOME=~/.local/lib/python3.7/site-packages/pyspark pyspark

Running spark-submit command

1-redshift-jdbc42–2.1.0.14.jar
2- org.apache.spark:spark-hadoop-cloud_2.12:3.2.1
3- io.github.spark-redshift-community:spark-redshift_2.12:5.1.0
4- org.apache.spark:spark-avro_2.12:3.3.0

Running directly without downloading any packages or Jar, Please check the library and supported version

spark-submit --jars redshift-jdbc42-2.1.0.14.jar --packages org.apache.spark:spark-hadoop-cloud_2.12:3.2.1,io.github.spark-redshift-community:spark-redshift_2.12:5.1.0,org.apache.spark:spark-avro_2.12:3.3.0 red.py

Reading and writing to redshift

import sys

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, col, lit, lpad, max, upper
from pyspark.sql.types import IntegerType
from pyspark.sql import functions as F
from pyspark.sql.functions import *


df = spark.read.json("s3a://ankur/files")
df2 = df.select(col("created"),
col("fixed"),

df2.count()

Storing to Redshift

username = 'username' #redshift db [username]
password = 'password()$!!' #redshift db [password]
url = "jdbc:redshift://my-redshift-cluster-us-west-1.redshift.amazonaws.com:5439/prod?user=" + username + "&password=" + password

df2.write \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", url) \
.option("dbtable", "schema.table_name") \
.option("tempdir", "s3a://koi_bhi_temp_dir/1") \
.option("aws_iam_role", "arn:aws:iam::1234567890:role/role-have-access-to-s3-and-redshift") \
.mode("append") \
.save()
print("processed")
print(path)

Read from redshift

df = sql_context.read \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", url) \
.option("dbtable", "schema.table_name") \
.option("tempdir", "s3a://koi_bhi_temp_dir/1") \
.option("aws_iam_role", "arn:aws:iam::1234567890:role/role-have-access-to-s3-and-redshift") \
.load()

Key Benefits of Spark and Redshift Integration:

  1. Enhanced Data Processing Power: Spark’s in-memory processing engine allows for lightning-fast data transformations and analytics on large datasets, while Redshift’s massively parallel processing (MPP) architecture enables high-performance querying of structured and semi-structured data. The integration empowers organizations to handle complex data pipelines and perform real-time analytics efficiently.
  2. Seamless Data Pipeline: Spark and Redshift integration provides a seamless data pipeline from ingestion to analysis. Spark’s connectors and libraries enable easy data extraction, transformation, and loading (ETL) operations, allowing users to prepare and cleanse data before loading it into Redshift. This streamlined workflow ensures data consistency and accuracy for downstream analytics.
  3. Scalability and Flexibility: Spark’s distributed computing model enables horizontal scalability, allowing organizations to scale their data processing infrastructure per their needs. Redshift, on the other hand, offers on-demand scalability and elasticity, automatically managing cluster size and optimizing query performance. The integration leverages these capabilities to handle varying data workloads efficiently.
  4. Advanced Analytics Capabilities: Spark’s rich ecosystem of libraries and machine learning algorithms can be seamlessly integrated with Redshift, enabling advanced analytics and predictive modelling on large datasets. Organizations can derive valuable insights from their data using Spark’s machine learning, graph processing, and natural language processing capabilities, complemented by Redshift’s robust SQL-based analytics.
  5. Cost Optimization: Spark and Redshift integration can help optimize data processing and storage costs. Spark’s ability to perform data transformations and aggregations in memory reduces the need for expensive disk I/O operations. Redshift’s columnar storage and compression techniques minimize storage costs, while its pay-as-you-go pricing model ensures cost efficiency for data warehousing.

--

--

AS

Software engineer at Expedia Group. Passionate about data-science and Big Data technologies.