Why is my AWS Glue ETL job running for a long time?

7 minute read
0

My AWS Glue job runs for a long time. Or, my AWS Glue straggler task takes a long time to complete.

Short description

Your AWS Glue job might take a long time to complete for the following reasons:

  • Large datasets
  • Non-uniform distribution of data in the datasets
  • Uneven distribution of tasks across the executors
  • Under-provisioned resources

Resolution

Turn on metrics

AWS Glue uses Amazon CloudWatch metrics to provide information about the executors, such as the amount of work of each executor. When you use AWS Glue version 3.0 or later, you can also use AWS Glue serverless Spark UI and Observability metrics. For more information, see Turning on the Apache Spark web UI for AWS Glue jobs and Monitoring with AWS Glue Observability metrics.

To turn on CloudWatch metrics for your AWS Glue job, you can use either a special parameter, the AWS Glue console, or an API.

Use a special parameter

Add the following argument to your AWS Glue job:

Key: --enable-metrics

Note: The enable-metrics parameter allows you to collect job profiling metrics for your job run. The metrics are available on the AWS Glue console and the CloudWatch console.

Use the AWS Glue console

Complete the following steps:

  1. Open the AWS Glue console.
  2. In the navigation pane, choose Jobs.
  3. Select the job that you want to turn on metrics for.
  4. Choose Action, and then choose Edit job.
  5. Under Monitoring options, choose Job metrics.
  6. Choose Save.

Use the API

Use the AWS Glue UpdateJob API with --enable-metrics as the DefaultArguments parameter.

Note: AWS Glue 2.0 doesn't use a YARN that reports metrics because you can't get some of the executor metrics, such as numberMaxNeededExecutors and numberAllExecutor.

Turn on continuous logging

If you turn on continuous logging for your AWS Glue job, then the real-time driver and executor logs are pushed to CloudWatch every 5 seconds. When you use real-time logging information, you get more details on the job. For more information, see Turning on continuous logging for AWS Glue jobs.

Check the driver and executor logs

In the driver logs, check for tasks that run for a long time before they are completed. In the following example, one task took 77 minutes to complete:

2021-04-15 10:53:54,484 ERROR executionlogs:128 - g-7dd5eec38ff57a273fcaa35f289a99ecc1be6901:2021-04-15 10:53:54,484 INFO [task-result-getter-1] scheduler.TaskSetManager (Logging.scala:logInfo(54)): Finished task 0.0 in stage 7.0 (TID 139) in 4538 ms on 10.117.101.76 (executor 10) (13/14)
...
2021-04-15 12:11:30,692 ERROR executionlogs:128 - g-7dd5eec38ff57a273fcaa35f289a99ecc1be6901:2021-04-15 12:11:30,692 INFO [task-result-getter-3] scheduler.TaskSetManager (Logging.scala:logInfo(54)): Finished task 13.0 in stage 7.0 (TID 152) in 4660742 ms on 10.117.97.97 (executor 11) (14/14)

To review why the task took a long time to complete, use the Apache Spark web UI.

Turn on the Spark UI

When you launch the Spark history server and turn on the Spark UI logs, you can access information on the log's stages and tasks. Use the logs to learn how workers run the tasks. For more information, see Turning on the Apache Spark web UI for AWS Glue jobs.

Note: If you use the AWS Command Line Interface (AWS CLI) to turn on the Spark UI, then make sure that you're using the most recent AWS CLI version. If you receive errors when you run AWS CLI commands, then see Troubleshoot AWS CLI errors.

After the job is complete, you might see driver logs similar to the following example:

ERROR executionlogs:128 - example-task-id:example-timeframe INFO [pool-2-thread-1] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(414)): close closed:false s3://dox-example-bucket/spark-application-1626828545941.inprogress

Use an Amazon Elastic Compute Cloud (Amazon EC2) instance or Docker to launch the Spark history server. Open the UI, and navigate to the Executor tab to check if a particular executor is running for a long time. If an executor runs for a long time, then a skew in the dataset might cause unevenly distributed work and under-utilized resources. In the Stages tab, review information and statistics on the stages that took a long time.

Capacity planning for DPUs

If all executors contribute equally and the job takes a long time, then add more workers to your job to improve the speed. Data processing units (DPU) capacity planning can help you to avoid the following issues:

  • Under-provisioning that might result in slower execution time
  • Over-provisioning that incurs higher costs, but provides results in the same amount of time

When you use CloudWatch metrics, you can get information on the number of executors that are currently used and the maximum number of executors that are required. The number of DPUs that are required depends on the number of input partitions and the worker type that's requested.

The type of Amazon Simple Storage Service (Amazon S3) file that you use and the type of data determine the number of partitions that you define:

  • For Amazon S3 files that you can't split, the number of partitions equals the number of input files.
  • For Amazon S3 files that you can split and the data are unstructured or semi-structured, the number of partitions equals the file size divided by 64 MB. If the size of each file is less than 64 MB, then the number of partitions equals the number of files.
  • For Amazon S3 files that you can split and the data are structured, the number of partitions equals the total file size divided by 128 MB.

The following is an example of how to calculate the optimal number of DPUs. In this example, the number of input partitions is 240. Use the following formula to calculate the optimal number of DPUs:

Maximum number of required executors = Number of input partitions / Number of tasks per executor

Note: For Glue 2.0 and earlier with a G1.X worker type, the maximum number of required executors is equal to 240 divided by eight. The result is 30. For Glue 3.0 and later with a G1.X worker type, the maximum number of required executors is equal to 240 divided by four. The result is 60.

When you plan DPU capacity for AWS Glue version 3.0 or later, use a worker type that supports the required number of tasks for each executor. Each worker type supports a different number of tasks per executor:

  • The Standard worker type supports four tasks per executor.
  • G.1X supports four tasks per executor.
  • G.2X supports eight tasks per executor.
  • G.4X supports 16 tasks per executor.
  • G.8X supports 32 tasks per executor.

Glue 2.0 and earlier example

The G1.X worker type has one executor per worker and one driver node for a Glue job. Based on the following example, you need 30 executors:
Number of required DPUs = (Number of executors / Number of executors per node) + 1 DPU = (30/1) + 1 = 31

Glue 3.0 and later example

The G1.X worker type has one executor per worker and one driver node for a Glue job. Based on the following example, you need 60 executors.
Number of required DPUs = (Number of executors / Number of executors per node) + 1 DPU = (60/1) + 1 = 61

Related information

AWS Glue job parameters

Monitoring jobs using the Apache Spark web UI

Monitoring for DPU capacity planning

AWS OFFICIAL
AWS OFFICIALUpdated 3 months ago
2 Comments

G.1X 8 tasks per executor and 16 for G.2X is the default for Glue 2.0 because it runs two threads per vcore, in Glue 3.0 and 4.0 it's one to one, so by default is 4 and 8, respectively (16 and 32 for the new G.4X and G.8X instances). That can be adjusted using --executor-cores which updates the Spark executor cores parameter but normally you should not have to change the default provided.

profile pictureAWS
EXPERT
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago