Spark Monitoring with Graphite & Telegraf

AS
4 min readJun 7, 2020

Monitoring spark application with time-series databases like influx, with help of spark performance metrics we can diagnose various issues in our spark application whether streaming or batch application

Spark exposes metrics to various endpoint, we will be using graphite endpoints for exposing metrics

ConsoleSink: Logs metrics information to the console.
CSVSink: Exports metrics data to CSV files at regular intervals.
JmxSink: Registers metrics for viewing in a JMX console.
MetricsServlet: Adds a servlet within the existing Spark UI to serve metrics data as JSON data.
GraphiteSink: Sends metrics to a Graphite node.
Slf4jSink: Sends metrics to slf4j as log entries.
StatsdSink: Sends metrics to a StatsD node.

Spark by default exposes its metrics to `graphite` endpoint and telegraf has the capability to listen from graphite endpoint and forward these metrics to influx databases.

https://spark.apache.org/docs/latest/monitoring.html#metrics

For configuring metrics in spark edit spark metrics.conf file on the node of the cluster. Properties which need to be added in spark-metrics.conf

#spark.metrics.conf# Enable JvmSource for instance master, worker, driver and executormaster.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host={{ master_ip_address }}
*.sink.graphite.port=2003
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=my.spark.metrics.computation

If you are using EMR cluster spark-metrics properties can be set from EMR console, Classificatication ‘spark-metrics’ and the property are graphite properties

Configuration which needs to be provided to telegraf in ‘telegraf.conf’

https://docs.influxdata.com/influxdb/v1.8/supported_protocols/graphite/

[[inputs.socket_listener]]
## Address and port to host HTTP listener on
service_address = "tcp://:2003"

data_format = "graphite"

## This string will be used to join the matched values.
separator = "_"

templates = [
"my.*.*.*.* .prefix.prefix.prefix.app_name.node.measurement*",
"my.*.*.*.*.driver.* .prefix.prefix.prefix.app_name.node.measurement*",
"my.*.*.*.*.driver.*.*.* .prefix.prefix.prefix.app_name.node.measurement.measurement.type_instance",
"my.*.*.*.*.driver.*.*.*.* .prefix.prefix.prefix.app_name.node.measurement.measurement.measurement.type_instance",
"my.*.*.*.*.driver.*.StreamingMetrics.streaming.* .prefix.prefix.prefix.app_name.node..measurement.measurement.type_instance",
"my.*.*.*.*.*.*.*.* .prefix.prefix.prefix.app_name.node.measurement.measurement.type_instance",
"my.*.*.*.*.*.*.*.*.* .prefix.prefix.prefix.app_name.node.measurement.measurement.measurement.type_instance"
]

Type of measurement which you will receive in influx time-series databases

----
BlockManager_disk
BlockManager_memory
CodeGenerator_compilationTime
CodeGenerator_compilationTime_count
CodeGenerator_compilationTime_max
CodeGenerator_compilationTime_mean
CodeGenerator_compilationTime_min
CodeGenerator_compilationTime_p50
CodeGenerator_compilationTime_p75
CodeGenerator_compilationTime_p95
CodeGenerator_compilationTime_p98
CodeGenerator_compilationTime_p99
CodeGenerator_compilationTime_p999
CodeGenerator_compilationTime_stddev
CodeGenerator_generatedClassSize
CodeGenerator_generatedClassSize_count
CodeGenerator_generatedClassSize_max
CodeGenerator_generatedClassSize_mean
CodeGenerator_generatedClassSize_min
CodeGenerator_generatedClassSize_p50
CodeGenerator_generatedClassSize_p75
CodeGenerator_generatedClassSize_p95
CodeGenerator_generatedClassSize_p98
CodeGenerator_generatedClassSize_p99
CodeGenerator_generatedClassSize_p999
CodeGenerator_generatedClassSize_stddev
CodeGenerator_generatedMethodSize
CodeGenerator_generatedMethodSize_count
CodeGenerator_generatedMethodSize_max
CodeGenerator_generatedMethodSize_mean
CodeGenerator_generatedMethodSize_min
CodeGenerator_generatedMethodSize_p50
CodeGenerator_generatedMethodSize_p75
CodeGenerator_generatedMethodSize_p95
CodeGenerator_generatedMethodSize_p98
CodeGenerator_generatedMethodSize_p99
CodeGenerator_generatedMethodSize_p999
CodeGenerator_generatedMethodSize_stddev
CodeGenerator_sourceCodeSize
CodeGenerator_sourceCodeSize_count
CodeGenerator_sourceCodeSize_max
CodeGenerator_sourceCodeSize_mean
CodeGenerator_sourceCodeSize_min
CodeGenerator_sourceCodeSize_p50
CodeGenerator_sourceCodeSize_p75
CodeGenerator_sourceCodeSize_p95
CodeGenerator_sourceCodeSize_p98
CodeGenerator_sourceCodeSize_p99
CodeGenerator_sourceCodeSize_p999
CodeGenerator_sourceCodeSize_stddev
DAGScheduler_job
DAGScheduler_messageProcessingTime
DAGScheduler_stage
ExecutorAllocationManager_executors
ExternalShuffle_shuffle-client
ExternalShuffle_shuffle-client_usedDirectMemory
ExternalShuffle_shuffle-client_usedHeapMemory
HiveExternalCatalog_fileCacheHits
HiveExternalCatalog_fileCacheHits_count
HiveExternalCatalog_filesDiscovered
HiveExternalCatalog_filesDiscovered_count
HiveExternalCatalog_hiveClientCalls
HiveExternalCatalog_hiveClientCalls_count
HiveExternalCatalog_parallelListingJobCount
HiveExternalCatalog_parallelListingJobCount_count
HiveExternalCatalog_partitionsFetched
HiveExternalCatalog_partitionsFetched_count
LiveListenerBus_listenerProcessingTime_org
LiveListenerBus_numEventsPosted
LiveListenerBus_queue_appStatus
LiveListenerBus_queue_eventLog
LiveListenerBus_queue_executorManagement
count
cpu
disk
diskio
executor_bytesRead
executor_bytesRead_count
executor_bytesWritten
executor_bytesWritten_count
executor_cpuTime
executor_cpuTime_count
executor_deserializeCpuTime
executor_deserializeCpuTime_count
executor_deserializeTime
executor_deserializeTime_count
executor_diskBytesSpilled
executor_diskBytesSpilled_count
executor_filesystem_file
executor_filesystem_file_largeRead_ops
executor_filesystem_file_read_bytes
executor_filesystem_file_read_ops
executor_filesystem_file_write_bytes
executor_filesystem_file_write_ops
executor_filesystem_hdfs
executor_filesystem_hdfs_largeRead_ops
executor_filesystem_hdfs_read_bytes
executor_filesystem_hdfs_read_ops
executor_filesystem_hdfs_write_bytes
executor_filesystem_hdfs_write_ops
executor_jvmCpuTime
executor_jvmGCTime
executor_jvmGCTime_count
executor_memoryBytesSpilled
executor_memoryBytesSpilled_count
executor_recordsRead
executor_recordsRead_count
executor_recordsWritten
executor_recordsWritten_count
executor_resultSerializationTime
executor_resultSerializationTime_count
executor_resultSize
executor_resultSize_count
executor_runTime
executor_runTime_count
executor_shuffleBytesWritten
executor_shuffleBytesWritten_count
executor_shuffleFetchWaitTime
executor_shuffleFetchWaitTime_count
executor_shuffleLocalBlocksFetched
executor_shuffleLocalBlocksFetched_count
executor_shuffleLocalBytesRead
executor_shuffleLocalBytesRead_count
executor_shuffleRecordsRead
executor_shuffleRecordsRead_count
executor_shuffleRecordsWritten
executor_shuffleRecordsWritten_count
executor_shuffleRemoteBlocksFetched
executor_shuffleRemoteBlocksFetched_count
executor_shuffleRemoteBytesRead
executor_shuffleRemoteBytesReadToDisk
executor_shuffleRemoteBytesReadToDisk_count
executor_shuffleRemoteBytesRead_count
executor_shuffleTotalBytesRead
executor_shuffleTotalBytesRead_count
executor_shuffleWriteTime
executor_shuffleWriteTime_count
executor_threadpool
executor_threadpool_activeTasks
executor_threadpool_completeTasks
executor_threadpool_currentPool_size
executor_threadpool_maxPool_size
jvm_ConcurrentMarkSweep
jvm_ConcurrentMarkSweep_count
jvm_ConcurrentMarkSweep_time
jvm_ParNew
jvm_ParNew_count
jvm_ParNew_time
jvm_direct
jvm_direct_capacity
jvm_direct_count
jvm_direct_used
jvm_heap
jvm_heap_committed
jvm_heap_init
jvm_heap_max
jvm_heap_usage
jvm_heap_used
jvm_mapped
jvm_mapped_capacity
jvm_mapped_count
jvm_mapped_used
jvm_non-heap
jvm_non-heap_committed
jvm_non-heap_init
jvm_non-heap_max
jvm_non-heap_usage
jvm_non-heap_used
jvm_pools_CMS-Old-Gen
jvm_pools_CMS-Old-Gen_committed
jvm_pools_CMS-Old-Gen_init
jvm_pools_CMS-Old-Gen_max
jvm_pools_CMS-Old-Gen_usage
jvm_pools_CMS-Old-Gen_used
jvm_pools_Code-Cache
jvm_pools_Code-Cache_committed
jvm_pools_Code-Cache_init
jvm_pools_Code-Cache_max
jvm_pools_Code-Cache_usage
jvm_pools_Code-Cache_used
jvm_pools_Compressed-Class-Space
jvm_pools_Compressed-Class-Space_committed
jvm_pools_Compressed-Class-Space_init
jvm_pools_Compressed-Class-Space_max
jvm_pools_Compressed-Class-Space_usage
jvm_pools_Compressed-Class-Space_used
jvm_pools_Metaspace
jvm_pools_Metaspace_committed
jvm_pools_Metaspace_init
jvm_pools_Metaspace_max
jvm_pools_Metaspace_usage
jvm_pools_Metaspace_used
jvm_pools_Par-Eden-Space
jvm_pools_Par-Eden-Space_committed
jvm_pools_Par-Eden-Space_init
jvm_pools_Par-Eden-Space_max
jvm_pools_Par-Eden-Space_usage
jvm_pools_Par-Eden-Space_used
jvm_pools_Par-Survivor-Space
jvm_pools_Par-Survivor-Space_committed
jvm_pools_Par-Survivor-Space_init
jvm_pools_Par-Survivor-Space_max
jvm_pools_Par-Survivor-Space_usage
jvm_pools_Par-Survivor-Space_used
jvm_total
jvm_total_committed
jvm_total_init
jvm_total_max
jvm_total_used
kernel
max
mean
mem
min
numContainersPendingAllocate
numExecutorsFailed
numExecutorsRunning
numLocalityAwareTasks
numReleasedContainers
p50
p75
p95
p98
p99
p999
processes
stddev
swap
system
my_spark_metrics_computation_CodeGenerator_compilationTime_count
my_spark_metrics_computation_CodeGenerator_compilationTime_max
my_spark_metrics_computation_CodeGenerator_compilationTime_mean
my_spark_metrics_computation_CodeGenerator_compilationTime_min
my_spark_metrics_computation_CodeGenerator_compilationTime_p50
my_spark_metrics_computation_CodeGenerator_compilationTime_p75
my_spark_metrics_computation_CodeGenerator_compilationTime_p95
my_spark_metrics_computation_CodeGenerator_compilationTime_p98
my_spark_metrics_computation_CodeGenerator_compilationTime_p99
my_spark_metrics_computation_CodeGenerator_compilationTime_p999
my_spark_metrics_computation_CodeGenerator_compilationTime_stddev
my_spark_metrics_computation_CodeGenerator_generatedClassSize_count
my_spark_metrics_computation_CodeGenerator_generatedClassSize_max
my_spark_metrics_computation_CodeGenerator_generatedClassSize_mean
my_spark_metrics_computation_CodeGenerator_generatedClassSize_min
my_spark_metrics_computation_CodeGenerator_generatedClassSize_p50
my_spark_metrics_computation_CodeGenerator_generatedClassSize_p75
my_spark_metrics_computation_CodeGenerator_generatedClassSize_p95
my_spark_metrics_computation_CodeGenerator_generatedClassSize_p98
my_spark_metrics_computation_CodeGenerator_generatedClassSize_p99
my_spark_metrics_computation_CodeGenerator_generatedClassSize_p999
my_spark_metrics_computation_CodeGenerator_generatedClassSize_stddev
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_count
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_max
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_mean
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_min
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_p50
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_p75
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_p95
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_p98
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_p99
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_p999
my_spark_metrics_computation_CodeGenerator_generatedMethodSize_stddev
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_count
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_max
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_mean
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_min
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_p50
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_p75
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_p95
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_p98
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_p99
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_p999
my_spark_metrics_computation_CodeGenerator_sourceCodeSize_stddev
my_spark_metrics_computation_HiveExternalCatalog_fileCacheHits_count
my_spark_metrics_computation_HiveExternalCatalog_filesDiscovered_count
my_spark_metrics_computation_HiveExternalCatalog_hiveClientCalls_count
my_spark_metrics_computation_HiveExternalCatalog_parallelListingJobCount_count
my_spark_metrics_computation_HiveExternalCatalog_partitionsFetched_count
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_1_executor_jvmCpuTime
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_2_executor_jvmCpuTime
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_3_executor_jvmCpuTime
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_4_executor_jvmCpuTime
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_5_executor_jvmCpuTime
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_6_executor_jvmCpuTime
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_applicationMaster_numContainersPendingAllocate
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_applicationMaster_numExecutorsFailed
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_applicationMaster_numExecutorsRunning
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_applicationMaster_numLocalityAwareTasks
my_spark_metrics_computation_application_XXXXXXXXXXXXX_12478_applicationMaster_numReleasedContainers

if you want to remove the metrics like _executor_jvmCpuTime and code generator because these are not so very useful and also it will create so many measurements in your time series database So you can disabled and enabled these per need basis. You can remove these two lines from the template in telegraf.conf file

"my.*.*.*.*.*.*.*.* .prefix.prefix.prefix.app_name.node.measurement.measurement.type_instance",
"my.*.*.*.*.*.*.*.*.* .prefix.prefix.prefix.app_name.node.measurement.measurement.measurement.type_instance"

Grafana template — https://grafana.com/grafana/dashboards/7890

--

--

AS

Software engineer at Expedia Group. Passionate about data-science and Big Data technologies.