How do I use Amazon Athena to analyze my Application Load Balancer access logs?

6 minute read
3

I want to use Amazon Athena to analyze my Application Load Balancer access logs.

Short description

Elastic Load Balancing doesn't activate access logs by default. When you activate access logs, you specify an Amazon Simple Storage Service (Amazon S3) bucket. Athena analyzes Application Load Balancer and Classic Load Balancer access logs and stores the logs in the Amazon S3 bucket.

Note: This resolution applies only to Application Load Balancers.

Resolution

Create a database and table for Application Load Balancer logs

To analyze access logs in Athena, create a database and table:

  1. Open the Athena console.

  2. To create a database, run the following command in the Query Editor. It's a best practice to create the database in the same AWS Region as the Amazon S3 bucket:

    CREATE DATABASE alb_db
  3. In the database, create an alb_logs table for the Application Load Balancer logs:
    Note: For better query performance, create a table with partition projection.

    CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
       type string,
        time string,
        elb string,
        client_ip string,
        client_port int,
        target_ip string,
        target_port int,
        request_processing_time double,
        target_processing_time double,
        response_processing_time double,
        elb_status_code int,
        target_status_code string,
        received_bytes bigint,
        sent_bytes bigint,
        request_verb string,
        request_url string,
        request_proto string,
        user_agent string,
        ssl_cipher string,
        ssl_protocol string,
        target_group_arn string,
        trace_id string,
        domain_name string,
        chosen_cert_arn string,
        matched_rule_priority string,
        request_creation_time string,
        actions_executed string,
        redirect_url string,
        lambda_error_reason string,
        target_port_list string,
        target_status_code_list string,
        classification string,
        classification_reason string
    )
    PARTITIONED BY (day STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'serialization.format' = '1',
        'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'
    )
    LOCATION 's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'
    TBLPROPERTIES (
        "projection.enabled" = "true",
        "projection.day.type" = "date",
        "projection.day.range" = "2022/01/01,NOW",
        "projection.day.format" = "yyyy/MM/dd",
        "projection.day.interval" = "1",
        "projection.day.interval.unit" = "DAYS",
        "storage.location.template" = "s3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${day}"
    )           

    Note:

    • In the preceding query, replace the table name and S3 locations with your table name and S3 locations.
    • For projection.day.range, replace 2022/01/01 with the start date that you want to use.

    Or, use the following query to create a table with partitions stored in the AWS Glue Data Catalog:

    CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs_partitioned (
       type string,
        time string,
        elb string,
        client_ip string,
        client_port int,
        target_ip string,
        target_port int,
        request_processing_time double,
        target_processing_time double,
        response_processing_time double,
        elb_status_code int,
        target_status_code string,
        received_bytes bigint,
        sent_bytes bigint,
        request_verb string,
        request_url string,
        request_proto string,
        user_agent string,
        ssl_cipher string,
        ssl_protocol string,
        target_group_arn string,
        trace_id string,
        domain_name string,
        chosen_cert_arn string,
        matched_rule_priority string,
        request_creation_time string,
        actions_executed string,
        redirect_url string,
        lambda_error_reason string,
        target_port_list string,
        target_status_code_list string,
        classification string,
        classification_reason string
    )
    PARTITIONED BY (day string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'serialization.format' = '1',
        'input.regex' =
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'
    )
    LOCATION 's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'

    Note: In the preceding query, replace the table name and S3 locations with your table name and S3 locations.

  4. (Optional) Use the ALTER TABLE ADD PARTITION DDL statement to load the partitions:

    ALTER TABLE alb_logs_partitioned ADD PARTITION (day = '2022/05/21') LOCATION's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/2022/05/21/'

    Note:

    • If your table uses partition projection, then skip step 4.
    • It's not a best practice to use an AWS Glue crawler on the Application Load Balancer logs.
  5. In the navigation pane, under Tables, choose Preview table.

  6. In the Results window, view the data from the Application Load Balancer access logs.

  7. Use the Query editor to run SQL statements on the table.

Example queries

In the following examples, modify the table name, column values, and other variables to fit your query.

View the first 100 access log entries in chronological order

Use this query for analysis and troubleshooting:

SELECT * FROM alb_logs
ORDER BY time ASC
LIMIT 100;

List all client IP addresses that accessed the Application Load Balancer and how many times they accessed the Application Load Balancer

Use this query for analysis and troubleshooting:

SELECT DISTINCT client_ip, COUNT() AS count FROM alb_logs GROUP BY client_ip
ORDER BY count() DESC;

List the average amount of data in kilobytes that passes through the Application Load Balancer in request or response pairs

Use this query for analysis and troubleshooting:

SELECT (AVG(sent_bytes)/1000.0 + AVG(received_bytes)/1000.0) AS prewarm_kilobytes FROM alb_logs;

List all targets that the Application Load Balancer routes traffic to and the number of routed requests per target, by percentage distribution

Use this query to identify potential target traffic imbalances:

SELECT target_ip, (COUNT(target_ip)*100.0 / (SELECT COUNT(*) FROM alb_logs)) AS backend_traffic_percentage
FROM alb_logs
GROUP BY target_ip
ORDER BY COUNT() DESC;

List the times that a client sent a request and closed the connection before the idle timeout elapsed (HTTP 460 error)

Use this query to troubleshoot HTTP 460 errors:

SELECT * FROM alb_logs WHERE elb_status_code = 460;

List the times that a client request wasn't routed because the listener rule forwarded the request to an empty target group (HTTP 503 error)

Use this query to troubleshoot HTTP 503 errors:

SELECT * FROM alb_logs WHERE elb_status_code = 503;

List clients in descending order by the number of times that each client visited a specified URL

Use this query to analyze traffic patterns:

SELECT client_ip, elb, request_url, COUNT(*) AS count FROM alb_logs GROUP BY client_ip, elb, request_url
ORDER BY count DESC;

List the 10 URLs in descending order that Firefox users accessed most frequently

Use this query to analyze traffic distribution and patterns:

SELECT request_url, user_agent, COUNT(*) AS count FROM alb_logs
WHERE user_agent LIKE '%Firefox%'
GROUP BY request_url, user_agent
ORDER BY count(*) DESC
LIMIT 10;

List clients in descending order by the amount of data in megabytes that each client sent in their requests to the Application Load Balancer

Use this query to analyze traffic distribution and patterns:

SELECT client_ip, SUM(received_bytes/1000000.0) as client_datareceived_megabytes FROM alb_logs
GROUP BY client_ip
ORDER BY client_datareceived_megabytes DESC;

List each time in a specified date range when the target process time was more than 5 seconds

Use this query to troubleshoot latency in a specified time frame:

SELECT * FROM alb_logs WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
    BETWEEN parse_datetime('2018-08-08-00:00:00','yyyy-MM-dd-HH:mm:ss')
    AND parse_datetime('2018-08-08-02:00:00','yyyy-MM-dd-HH:mm:ss'))
AND (target_processing_time >= 5.0);

Count the number of HTTP GET requests grouped by the IP address that the load balancer received

Use this query to analyze ingress traffic distribution:

SELECT COUNT(request_verb)AS count, request_verb, client_ip
FROM alb_logs_partitioned
WHERE day = '2022/05/21'
GROUP BY request_verb, client_ip;

For more information, see Access log entries and Querying Application Load Balancer logs.

AWS OFFICIAL
AWS OFFICIALUpdated 2 months ago
12 Comments

This is great but requires some corrections:

For example, SELECT * from alb_log where elb_status_code = '503'; will fail with "TYPE_MISMATCH: line 1:46: Cannot apply operator: integer = varchar(3)" error:

and we should use the following instead: SELECT * from alb_log where elb_status_code = 503;

Rez
replied a year ago

Also this query:

SELECT COUNT(request_verb) AS count, request_verb, client_ip FROM alb_log_partition_projection WHERE day = '2022/05/21' GROUP BY request_verb, client_ip;

Needs to be changed to:

SELECT COUNT(request_verb) AS count, request_verb, client_ip FROM alb_log WHERE day = '2022/05/21' GROUP BY request_verb, client_ip;

Not a major issue its just the table name according to your tutorial.

Rez
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

Create table queries missing REGION from the S3 path

profile picture
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

The region is still missing from the S3 paths in the first CREATE TABLE query. Without this, the table doesn't work and returns no results.

profile picture
replied 9 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 9 months ago

The first create table in step 3 has an extra single quote in the LOCATION line:

LOCATION ''s3://<your-al...

replied 5 months ago

This example in the article:

SELECT target_ip, (Count(target_ip)* 100.0 / (Select Count(*) From alb_logs)) as backend_traffic_percentage FROM alb_logs GROUP by target_ip< ORDER By count() DESC;

Give this error when running:

error - line 5:1: mismatched input '<'. Expecting: <predicate>

I was able to resolve it by removing < after target_ip on line 4.

SELECT target_ip, (Count(target_ip)* 100.0 / (Select Count(*) From alb_logs)) as backend_traffic_percentage FROM alb_logs GROUP by target_ip ORDER By count() DESC;

profile pictureAWS
replied 3 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 3 months ago

Is there something that needs to be done to load the log files into athena? I followed this article step by step. I have the table created but I do not see the logs. I have logs in S3. What is the secret?

replied 2 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 2 months ago