Networking & Content Delivery

Testing AWS Direct Connect Resiliency with Resiliency Toolkit – Failover Testing

When deploying workloads in AWS, having highly resilient and fault-tolerant hybrid network connectivity is key to a well-architected system. Frequently testing this resiliency with simulated failure scenarios is important to ensure business continuity. The new Resiliency Toolkit – Failover Testing feature enables you to easily test the resiliency of your Direct Connect connections.

In this blog post, we will dive deep into how you can leverage Resiliency Toolkit – Failover Testing with the help of multiple failover testing scenarios. We will also cover key considerations and best practices.

AWS Direct Connect Resiliency

Within an AWS Region, you deploy resources in multiple availability zones. These are highly available data centers miles part with redundant power, networking and connectivity. As you setup hybrid connectivity to connect workloads in AWS to resources in your on-premises data centers, the resiliency of this connectivity should match the resiliency you have in the cloud. To achieve this, you can use AWS Direct Connect to setup redundant private connectivity between AWS and your data center. There are two levels of resiliency we recommend for critical workloads –

  1. High resiliency – You can achieve high resiliency for critical workloads by using two single connections to multiple locations This topology provides resiliency against connectivity failures caused by a fiber cut or a device failure. It also helps prevent a complete location failure.
  2. Maximum resiliency – You can achieve maximum resiliency for critical workloads by using separate connections that terminate on separate devices in more than one location. This topology provides resiliency against device, connectivity, and complete location failures.

These resiliency models are covered under the Direct Connect SLA as described here. You should also consider using redundant hardware and telecommunications providers at your end for setting up direct connect connections.

While designing redundant hybrid connectivity, you have to think about the appropriate number of dedicated connections to provision, verify that you are connecting to different AWS Direct Connect locations and devices, ensure that the redundant connections you provision have the same speed and configure LAG…etc. In order to make this process easier, late 2019 we announced Direct Connect Resiliency Toolkit which helps customers order resilient connectivity to AWS, providing a simplified wizard-based experience that guides you through the connectivity ordering process while providing several built-in resiliency models to choose from.

Once your redundant hybrid connectivity is setup, it’s important to regularly test it for failure scenarios as part of your broader Disaster Recovery (DR) test.  This involves bringing individual circuits down and seeing how and if your end applications get impacted. To make failover testing easier for our customers, on June 3rd, 2020, AWS added support for Failover Testing to the Direct Connect Resiliency Toolkit. You can now use the Resiliency Toolkit – Failover Testing feature to test the resiliency of your Direct Connect connections. You can use the AWS Direct Connect Resiliency Toolkit failover test to bring down the Border Gateway Protocol (BGP) peering session(s) of a virtual interface in order to verify that traffic routes to one of your redundant virtual interfaces, and meets your resiliency requirements. This post shows how to get started with failover testing, how to go about planning test scenarios and the key considerations you should keep in mind through this process.

Overview of solution – Step by step approach

You start by defining a testing plan. We will cover how to approach test scenarios in more details later in this blog. Once you are ready to begin, you bring down the BGP session on a virtual interface and check if traffic successfully routes over redundant virtual interfaces.

You start by selecting a virtual interface, and initiating a ‘Bring down BGP’ action.

Next, you specify the time you want to perform the test for and the BGP peer you wish to bring down (if you have multiple peers). Once you do that, the virtual interface will go from available into a testing state and BGP peering status will be changed to ‘down’.

At this stage, you want to check the impact on your applications and make sure traffic is successfully routing over redundant virtual interfaces. You test will automatically end after the configured ‘Test maximum time’ which is 10 minutes in our case or you can manually click the ‘Cancel test’ option to cancel any time prior to that. Once test is complete or cancelled, the BGP status will change to its initial ‘up’ state.

You can check the status of your on-going test or previous tests in the ‘Test history’ tab.

Here are the same steps using AWS CLI :

  • Starting a test by bring down BGP

cli command – “aws directconnect start-bgp-failover-test –virtual-interface-id dxvif-fg0hsm3g –test-duration-in-minutes 10”

  • Check status of the VIF

cli command – ‘aws directconnect describe-virtual-interfaces —virtual-interface-id dxvif-fg0hsm3g’. Notice the value of the “virtualInterfaceState” is changed to “testing” in the output.

  • View the test history

CLI command – ‘aws directconnect list-virtual-interface-test-history’

Failover testing scenarios

Active/Standby High Resiliency Direct Connect setup:

In this scenario, hybrid connectivity to AWS can be designed with two direct connect connections setup in an active/standby configuration for high resiliency. For best resiliency we recommend each direct connect connection to reside in separate direct connect locations. One connection is active and the other is passive. In the event of the primary link going down, the secondary will become active.

In this architecture the BGP local preference community is modified to a lower preference on routes advertised from the on premises customer gateway to the direct connect location two.

Failover Test steps:

Step 1 – You can execute the test by calling the ‘StartBgpFailoverTest’ API to shutdown BGP on the Primary (Active in picture above) direct connect circuit.

Step 2 – Verify with ping, traceroute, show and debug commands that the second link has become active.

Step 3 – Once the BGP test has completed verify that the primary link falls back to active.

Active/Active Maximum Resiliency Direct Connect setup:

In this scenario, we have four direct connect connections provisioned with two located in a direct connect location in the west and two in the east to achieve maximum resiliency. In this case the direct connect connections are configured with active/active equal-cost multi-path (ECMP) routing in each region.

In this architecture BGP attributes are left as default and ECMP is automatically achieved.

Failover Test steps:

Step 1 – Shutdown BGP on one direct connect connection in the east and all traffic will automatically traverse the second connection in the east.

Step 2 – Verify with ping, traceroute, show and debug commands that traffic flows over the second link in the east.

Step 3 – Shutdown BGP on the second link in the east.

Step 4 – Verify with ping, traceroute, show and debug commands that the first and second are down and traffic is using the west connections.

Step 5 – Test reverse scenario in the west after the east BGP processes are brought up.

Step 6 – After all tests are complete ensure that the primary regions are the preferred path.

Key considerations

  • Schedule maintenance windows prior to testing. Communicate with stakeholders that business applications may be impacted as part of your testing.
  • Configure notifications for operations teams to be notified if the direct connect encounters issues. Direct Connect notifications can be configured with Amazon CloudWatch alarms at the physical port level and CloudWatch events for the virtual interface level. For notifications during testing a CloudWatch event rule can be provisioned to notify network operations with Amazon Simple Notification Service. An example event pattern for CloudWatch events rules is shown here.
{
  "source": [
    "aws.cloudtrail"
  ],
  "detail-type": [
    "AWS API Call via CloudTrail"
  ],
  "detail": {
    "eventSource": [
      "cloudtrail.amazonaws.com"
    ],
    "eventName": [
      "StartBgpFailoverTest"
    ]
  }
}
  • During testing use debug and show commands on the customer gateway to verify the status of BGP and the route table. Use a continuous ping to monitor direct connect connectivity and use traceroute to verify the network paths. Once testing is completed you should verify BGP is up on all links and that network is in its desired state.
  • Test routinely. Depending on your requirements schedule testing yearly, quarterly or monthly to ensure your intended network resiliency is functioning properly.
  • Use game days as an approach to injecting failures into your environment. Game days should start in a development environment and as network operations teams evolve game days can be run in production.
  • Develop playbooks to be used during testing which can be used as procedures to follow during an event in production.
  • Limit access to the ‘StartBgpFailoverTest’ API call with strict IAM policies so it is only accessible during a testing window.

What next?

Once you successfully complete failover testing, it is important that you set up the right monitoring infrastructure in place which enables you to get notified in case of a real DR scenario. For this it is recommended that you monitor (using Amazon Cloudwatch) and take actions on the usage, state, and health of your physical AWS Direct Connect connection as well as your AWS Direct Connect virtual interfaces (VIFs). Monitoring can be done through CloudWatch Metrics at the port level to see all traffic that ingresses and egresses or more granularly at the Virtual Interface level. See here for detailed list of metrics available for you to track and get alerted on.

For AWS Direct Connect Physical Connection, you can configure alarms on metrics like ‘ConnectionState’ , ‘ConnectionLightLevelTx’ or ‘ConnectionErrorCount’ and configure automated notifications using SNS when the alarm is activated. See the figure below for an example Alarm. For the conditions, set the threshold type to static whenever connection state is lower than 1. In Alarm state, configure it to trigger an SNS topic.

For a Virtual Interface (VIF) level health tracking and alerting, you can configure a CloudWatch Event Rule which sends notifications to an SNS topic. The following event rule uses AWS Health Dashboard as the source of information.

{
  "source": [
    "aws.health"
  ],
  "detail-type": [
    "AWS Health Event"
  ],
  "detail": {
    "service": [
      "DIRECTCONNECT"
    ]
  },
  "resources": [
    "dxvif-fgbjrfaa"
  ]
}

It is recommended that you automate operations by integrating the CloudWatch Alarms and Events into ticketing systems and conferencing applications to automatically create tickets and conference bridges during an outage event. Lambda would be triggered in additional to SNS to make API calls into the operations applications.

Conclusion

Failover testing is an essential step to ensure a fault tolerant and a highly available hybrid IT environment. When leveraging AWS direct connect, testing redundancy can easily be done using Resiliency Toolkit – Failover Testing feature. We encourage that you test for various failure scenarios using Resiliency Toolkit – Failover Testing and integrate this as part of your regular DR testing process.