Why is my Amazon Aurora DB cluster clone, snapshot restore, or point in time restore taking so long?

4 minute read
1

I'm performing a cluster clone, snapshot restore, or a point in time operation on my Amazon Aurora cluster.

Short description

Amazon Aurora’s continuous backup and restore techniques are optimized to avoid variation in restore times. They also help the cluster’s storage volume to reach full performance as soon as the cluster becomes available. Long restore times are generally caused by long-running transactions in the source database at the time that the backup is taken.

Resolution

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.

Amazon Aurora backups your cluster volume’s changes automatically and continuously. The backups are retained for the length of your backup retention period. This continuous backup allows you to restore your data to a new cluster, to any point in time within the retention period specified. This avoids the need for a lengthy binlog roll-forward process. Because you create a new cluster, there is no impact to performance or interruption to your original database.

When you initiate a clone, snapshot, or point in time restore, Amazon Relational Database Service (Amazon RDS) calls the following APIs on your behalf:

When this step completes, the cluster changes into the Available state. You can check your cluster state by refreshing the console or checking with the AWS CLI.

The instance creation process starts only when the cluster is Available. This happens in two stages: setting up the instance configuration and database crash recovery.

You can check if the API has finished setting up the instance by looking for the MySQL error log file. You can do this even if the instance is in the Creating status. If the error log file is available to download, then the instance is set up and the engine is now performing crash recovery. The error log file is also the best resource to check on the progress of your database crash recovery, along with Amazon CloudWatch metrics.

Note: If you're using the AWS CLI or API to perform a restore operation, then you must invoke the CreateDBInstance call because it's not automatic.

Check for long-running write operations on the source database

It’s a best practice to confirm that there aren’t long-running write operations on the source database at the time of the snapshot, point-in-time, or clone. Any long-running DCL, DDL, or DML (open write transactions) might lengthen the time it takes for the restored database to become available.

For example, you activate the binary log for an Aurora cluster, and this increases the time it takes to perform a recovery. This is because InnoDB automatically checks the logs and performs a roll-forward of the database to the present. It then rolls back any uncommitted transactions that are present at the time of the recovery. For more information on InnoDB crash recovery, see Innodb recovery.

When the instance finishes the creation and recovery processes, the cluster and the instance are then ready to accept incoming connections.

Note: Aurora doesn't require the binary log. It's a best practice to deactivate it unless it's required. For cross-Region replication, you can evaluate the Aurora global databases instead. Aurora global databases also don't require binary logs.


Related information

Amazon Aurora storage and reliability

Restoring from a DB cluster snapshot

AWS OFFICIAL
AWS OFFICIALUpdated a year ago