Implementing Warm Standby on AWS

Implementing Warm Standby on AWS

A warm standby maintains a minimum deployment that can handle requests, but at a reduced capacity—it cannot handle production-level traffic. In this blog post we will use the AWS cli along with a cloudformation template to setup the warm standby Disaster recovery infrastructure.

I am in no way affiliated with AWS, this has just been documentated for my own learning purposes

Prerequisites

  1. An AWS Account
  2. Sufficient Permissions to provision resources like vpc,subnet,NAT Gateway,Internet gateway, modify route table, EC2 and Route 53.
  3. Key pairs for EC2 in the regions us-east-1 and us-west-1

Steps

The Disaster Recovery Architecture

image info

Creating Resources in Multiple Regions

  1. Create two VPC environments in two distinct regions with different CIDR (e.g. 10.0.0.0/16 and 10.1.0.0/16) with interconnectivity using the CloudFormation template below:
    #Create VPC and Subnets in N. Virginia
    aws cloudformation create-stack \
    --template-url https://drinfra.s3.ap-south-1.amazonaws.com/vpc-cfn-template.yaml \
    --stack-name drinfra --parameters \
    ParameterKey=EnvironmentName,ParameterValue=DR-Test \
    ParameterKey=VpcCIDR,ParameterValue=10.0.0.0/16 \
    ParameterKey=PublicSubnet1CIDR,ParameterValue=10.0.1.0/24 \
    ParameterKey=KeyName,ParameterValue=us-east-1-keypair \
    --region us-east-1

    #Create VPC and Subnets in N. California
    aws cloudformation create-stack \
    --template-url https://drinfra.s3.ap-south-1.amazonaws.com/vpc-cfn-template.yaml \
    --stack-name drinfra --parameters \
    ParameterKey=EnvironmentName,ParameterValue=DR-Test \
    ParameterKey=VpcCIDR,ParameterValue=10.1.0.0/16 \
    ParameterKey=PublicSubnet1CIDR,ParameterValue=10.1.1.0/24 \
    ParameterKey=KeyName,ParameterValue=us-west-1-keypair \
    --region us-west-1

    #Wait for CloudFormation stacks be completed
    aws cloudformation wait stack-create-complete \
    --stack-name drinfra --region us-east-1

    aws cloudformation describe-stacks --stack-name drinfra \
    --region us-east-1 | jq -r ".Stacks[].StackStatus"

    aws cloudformation wait stack-create-complete --stack-name drinfra \
    --region us-west-1

    aws cloudformation describe-stacks --stack-name drinfra \
    --region us-west-1 | jq -r ".Stacks[].StackStatus"

Create VPC Peerings to connect both regions

  1. Get the VPC ID of the VPC’s created on previous step.
        #Get VPC ID's
        export VPC_ID_PRIMARY=$(aws cloudformation describe-stacks \
        --stack-name drinfra \
        --region us-east-1 | jq -r .Stacks[0].Outputs[0].OutputValue)

        export VPC_ID_SECONDARY=$(aws cloudformation describe-stacks \
        --stack-name drinfra \
        --region us-west-1 | jq -r .Stacks[0].Outputs[0].OutputValue)

        echo PRIMARY VPC = $VPC_ID_PRIMARY
        echo SECONDARY VPC = $VPC_ID_SECONDARY

        #Get Custom Route Table ID's
        export ROUTE_TABLE_PRIMARY=$(aws ec2 describe-route-tables \
        --filters \
        "Name=vpc-id,Values=$VPC_ID_PRIMARY" \
        "Name=association.main,Values=false" \
        --query 'RouteTables[*].RouteTableId' \
        --output text \
        --region us-east-1)

        export ROUTE_TABLE_SECONDARY=$(aws ec2 describe-route-tables \
        --filters \
        "Name=vpc-id,Values=$VPC_ID_SECONDARY" \
        "Name=association.main,Values=false" \
        --query 'RouteTables[*].RouteTableId' \
        --output text \
        --region us-west-1)  
   

  1. Create a VPC peering between the two VPCs from different AWS Regions.
        #Request VPC Peering
        export PEERING_ID=$(aws ec2 create-vpc-peering-connection \
        --vpc-id $VPC_ID_PRIMARY --region us-east-1 \
        --peer-vpc-id $VPC_ID_SECONDARY  --peer-region us-west-1 | \
        jq -r ".VpcPeeringConnection.VpcPeeringConnectionId")

        #Wait a few seconds
        sleep 10

        #Accept VPC Peering
        aws ec2 accept-vpc-peering-connection \
        --vpc-peering-connection-id $PEERING_ID \
        --region us-west-1

        #Add routes to VPC Peering
        aws ec2 create-route \
        --route-table-id $ROUTE_TABLE_PRIMARY \
        --destination-cidr-block 10.1.0.0/16 \
        --vpc-peering-connection-id $PEERING_ID \
        --region us-east-1

        aws ec2 create-route \
        --route-table-id $ROUTE_TABLE_SECONDARY \
        --destination-cidr-block 10.0.0.0/16 \
        --vpc-peering-connection-id $PEERING_ID \
        --region us-west-1    

Create Private DNS Entries

  1. Create a Route 53 - Private Hosted Zone for the anycompany.internal DNS entries by associating the PRIMARY VPC in us-east-1.
        export HOSTED_ZONE_ID=$(aws route53 create-hosted-zone --name anycompany.internal \
        --caller-reference $(date "+%Y%m%d%H%M%S") \
        --hosted-zone-config PrivateZone=true \
        --vpc VPCRegion=us-east-1,VPCId=$VPC_ID_PRIMARY |\
        jq -r ".HostedZone.Id")

  1. Associate N. California VPC with Private Hosted Zone to share DNS records
        aws route53 associate-vpc-with-hosted-zone \
        --hosted-zone-id $HOSTED_ZONE_ID \
        --vpc VPCRegion=us-west-1,VPCId=$VPC_ID_SECONDARY

Create Route53 Health Check

  1. Get the public IP of both Web Servers. This is because Route 53 health checkers are public and they can only monitor hosts with IP addresses that are publicly routable on the internet.
        export WEBSERVER_PRIMARY_PUBLIC_IP=$(aws ec2 describe-instances \
        --region us-east-1 \
        --filters \
        "Name=instance-state-name,Values=running" \
        "Name=tag-value,Values=WebServerInstance" \
        --query 'Reservations[*].Instances[*].[PublicIpAddress]' \
        --output text)

        export WEBSERVER_SECONDARY_PUBLIC_IP=$(aws ec2 describe-instances \
        --region us-west-1 \
        --filters \
        "Name=instance-state-name,Values=running" \
        "Name=tag-value,Values=WebServerInstance" \
        --query 'Reservations[*].Instances[*].[PublicIpAddress]' \
        --output text)      

        echo PRIMARY WEB SERVER = $WEBSERVER_PRIMARY_PUBLIC_IP
        echo SECONDARY WEB SERVER = $WEBSERVER_SECONDARY_PUBLIC_IP  

  1. Create the health check policy and save in a file
        #Health check policy
        cat > health-check-config.json << EOF
        {
        "Type": "HTTP",
        "Port": 80,
        "ResourcePath": "/index.html",
        "IPAddress": "$WEBSERVER_PRIMARY_PUBLIC_IP",
        "RequestInterval": 30,
        "FailureThreshold": 3
        } 
        EOF

  1. Let’s create the health check for our primary endpoint that is in us-east-1
    #Create Health check for Route 53
    export HEALTH_ID=$(aws route53 create-health-check \
    --caller-reference $(date "+%Y%m%d%H%M%S") \
    --health-check-config file://health-check-config.json |\
    jq -r ".HealthCheck.Id")

The health check will be active in 30 Minutes

Create a Route53 Failover Policy

  1. Get the private IP of both Web Servers.
    export WEBSERVER_PRIMARY_PRIVATE_IP=$(aws ec2 describe-instances \
    --region us-east-1 \
    --filters \
    "Name=instance-state-name,Values=running" \
    "Name=tag-value,Values=WebServerInstance" \
    --query 'Reservations[*].Instances[*].[PrivateIpAddress]' \
    --output text)

    export WEBSERVER_SECONDARY_PRIVATE_IP=$(aws ec2 describe-instances \
    --region us-west-1 \
    --filters \
    "Name=instance-state-name,Values=running" \
    "Name=tag-value,Values=WebServerInstance" \
    --query 'Reservations[*].Instances[*].[PrivateIpAddress]' \
    --output text)      

    echo PRIMARY WEB SERVER = $WEBSERVER_PRIMARY_PRIVATE_IP
    echo SECONDARY WEB SERVER = $WEBSERVER_SECONDARY_PRIVATE_IP  

  1. Create the Failover routing policy and save in a file
        # Failover policy
        cat > failover-policy.json << EOF
        {
            "AWSPolicyFormatVersion":"2015-10-01",
            "RecordType":"A",
            "StartRule":"site_switch",
            "Endpoints":{
                "WEBSERVER_PRIMARY":{
                    "Type":"value",
                    "Value":"$WEBSERVER_PRIMARY_PRIVATE_IP"
                },
                "WEBSERVER_SECONDARY":{
                    "Type":"value",
                    "Value":"$WEBSERVER_SECONDARY_PRIVATE_IP"
                }
            },
            "Rules":{
                "site_switch":{
                    "RuleType":"failover",
                    "Primary":{
                        "EndpointReference":"WEBSERVER_PRIMARY",
                        "HealthCheck": "$HEALTH_ID"
                    },
                    "Secondary":{
                        "EndpointReference":"WEBSERVER_SECONDARY"
                    }
                }
            }
        }
        EOF

  1. Associate the traffic policy to Route53 Private Hosted Zone
        #Create traffic policy
        export TRAFFIC_ID=$(aws route53 create-traffic-policy --name failover-policy \
        --document file://failover-policy.json | jq -r ".TrafficPolicy.Id")
        
        #Associate traffic policy to Private Hosted Zone
        aws route53 create-traffic-policy-instance \
        --hosted-zone-id $HOSTED_ZONE_ID --name service.anycompany.internal \
        --ttl 60 --traffic-policy-id $TRAFFIC_ID \
        --traffic-policy-version 1

Test the Failover Policy

  1. On a second terminal access one of the EC2 instance via SSH
  2. Execute the following commands:
        #GET EC2 IP in N. Virginia
        export EC2_CLIENT_IP=$(aws ec2 describe-instances \
        --region us-east-1 \
        --filters \
        "Name=instance-state-name,Values=running" \
        "Name=tag-value,Values=EC2Client" \
        --query 'Reservations[*].Instances[*].[PublicIpAddress]' \
        --output text)
        
        #Access EC2 instance by SSH
        chmod 400 us-east-1-keypair.pem
        ssh -i us-east-1-keypair.pem ec2-user@$EC2_CLIENT_IP   

  1. Try to access the website using “service.anycompany.internal”
        #Check the DNS answer
        dig +short service.anycompany.internal

        #Access the Web Server
        curl service.anycompany.internal

It’s expected a response in HTML from PRIMARY WEB SERVER like below:

    <html><h1>Welcome to Example Portal !! </h1><h2>Hosted in us-east-1</h2</html>

  1. Now let’s provoke an error! Using the first terminal, remove the inbound rule from Security Group of Primary Web Server
        #Get security group id of Web Server Primary
        export SG_ID_PRIMARY=$(aws ec2 describe-instances \
        --region us-east-1 \
        --filters \
        "Name=instance-state-name,Values=running" \
        "Name=tag-value,Values=WebServerInstance" \
        --query 'Reservations[*].Instances[*].NetworkInterfaces[*].Groups[*].GroupId' \
        --output text)

        #Revoke the inbound rule for **port 80**
        aws ec2 revoke-security-group-ingress \
        --group-id $SG_ID_PRIMARY \
        --ip-permissions FromPort=80,IpProtocol=tcp,ToPort=80,IpRanges=[{CidrIp=0.0.0.0/0}]

After 2 minutes, the Health Check will mark the PRIMARY WEB SERVER as UNHEALTHY and will trigger the failover mechanism.

  1. On second terminal, try to access the website again
        # Check the DNS answer
        dig +short service.anycompany.internal

        # Access the Web Server
        curl service.anycompany.internal

Now the Failover Policy on Route 53 will answer the SECONDARY WEB SERVER ip address, so it’s expected a response in HTML from SECONDARY WEB SERVER like below:

<html><h1>Welcome to Example Portal !! </h1><h2>Hosted in us-west-1</h2</html>