Pyspark - Spark-Submit Important Configs
Pyspark - Spark-Submit Important Configs
spark-submit is the command used to submit applications to a Spark cluster. It is a powerful tool that allows
you to configure various settings for your Spark jobs, including memory and CPU allocation, cluster modes,
and application-specific parameters. Properly configuring spark-submit is essential for optimizing Spark jobs
for performance and resource usage.
Below are the most important configurations you can use with spark-submit, along with their purposes and
examples:
--master: Specifies the cluster manager to connect to. It can be local for local mode, yarn for Hadoop
YARN, mesos for Apache Mesos, or k8s for Kubernetes.
Example: --master yarn
--deploy-mode: Defines whether to launch the driver on the worker nodes (cluster) or locally on the
machine submitting the application (client).
Example: --deploy-mode cluster
--num-executors: Sets the number of executors to use for the job. This is applicable in cluster modes
like YARN.
Example: --num-executors 5
--executor-cores: Specifies the number of CPU cores per executor. Higher values increase
parallelism.
Example: --executor-cores 4
--executor-memory: Allocates memory for each executor process. Proper sizing can prevent out-of-
memory errors.
Example: --executor-memory 8G
--driver-memory: Sets the amount of memory allocated for the driver process.
Example: --driver-memory 4G
--conf spark.eventLog.enabled=true: Enables Spark event logging. This helps in monitoring and
debugging by storing event information.
Example: --conf spark.eventLog.enabled=true
--conf spark.eventLog.dir: Specifies the directory where the event logs should be stored.
Example: --conf spark.eventLog.dir=hdfs:///logs/
--conf spark.executor.logs.rolling.strategy=time: Sets the rolling strategy for executor logs.
Useful for managing log file sizes and retention.
Example: --conf spark.executor.logs.rolling.strategy=time
--conf spark.executor.logs.rolling.time.interval=daily: Defines the interval for rolling executor
logs.
Example: --conf spark.executor.logs.rolling.time.interval=daily
--conf spark.sql.shuffle.partitions: Configures the number of partitions to use when shuffling data
during Spark SQL operations. Tweaking this number can optimize shuffling.
Example: --conf spark.sql.shuffle.partitions=200
--conf spark.serializer: Specifies the serializer for RDDs. The default is Java serialization, but Kryo
serialization can be more efficient.
Example: --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.extraJavaOptions: Passes additional JVM options for executors. Useful for
setting system properties or managing JVM memory.
Example: --conf spark.executor.extraJavaOptions="-XX:+UseG1GC"
6. Security Configurations #