I want to use Amazon Athena to analyze my Application Load Balancer access logs.
Short description
Elastic Load Balancing doesn't activate access logs by default. When you activate access logs, you specify an Amazon Simple Storage Service (Amazon S3) bucket. Athena analyzes Application Load Balancer and Classic Load Balancer access logs and stores the logs in the Amazon S3 bucket.
Note: This resolution applies only to Application Load Balancers.
Resolution
Create a database and table for Application Load Balancer logs
To analyze access logs in Athena, create a database and table:
-
Open the Athena console.
-
To create a database, run the following command in the Query Editor. It's a best practice to create the database in the same AWS Region as the Amazon S3 bucket:
CREATE DATABASE alb_db
-
In the database, create an alb_logs table for the Application Load Balancer logs:
Note: For better query performance, create a table with partition projection.
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
type string,
time string,
elb string,
client_ip string,
client_port int,
target_ip string,
target_port int,
request_processing_time double,
target_processing_time double,
response_processing_time double,
elb_status_code int,
target_status_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
request_url string,
request_proto string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string,
domain_name string,
chosen_cert_arn string,
matched_rule_priority string,
request_creation_time string,
actions_executed string,
redirect_url string,
lambda_error_reason string,
target_port_list string,
target_status_code_list string,
classification string,
classification_reason string
)
PARTITIONED BY (day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'
)
LOCATION 's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.day.type" = "date",
"projection.day.range" = "2022/01/01,NOW",
"projection.day.format" = "yyyy/MM/dd",
"projection.day.interval" = "1",
"projection.day.interval.unit" = "DAYS",
"storage.location.template" = "s3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${day}"
)
Note:
- In the preceding query, replace the table name and S3 locations with your table name and S3 locations.
- For projection.day.range, replace 2022/01/01 with the start date that you want to use.
Or, use the following query to create a table with partitions stored in the AWS Glue Data Catalog:
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs_partitioned (
type string,
time string,
elb string,
client_ip string,
client_port int,
target_ip string,
target_port int,
request_processing_time double,
target_processing_time double,
response_processing_time double,
elb_status_code int,
target_status_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
request_url string,
request_proto string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string,
domain_name string,
chosen_cert_arn string,
matched_rule_priority string,
request_creation_time string,
actions_executed string,
redirect_url string,
lambda_error_reason string,
target_port_list string,
target_status_code_list string,
classification string,
classification_reason string
)
PARTITIONED BY (day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' =
'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'
)
LOCATION 's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'
Note: In the preceding query, replace the table name and S3 locations with your table name and S3 locations.
-
(Optional) Use the ALTER TABLE ADD PARTITION DDL statement to load the partitions:
ALTER TABLE alb_logs_partitioned ADD PARTITION (day = '2022/05/21') LOCATION's3://<your-alb-logs-directory>/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/2022/05/21/'
Note:
- If your table uses partition projection, then skip step 4.
- It's not a best practice to use an AWS Glue crawler on the Application Load Balancer logs.
-
In the navigation pane, under Tables, choose Preview table.
-
In the Results window, view the data from the Application Load Balancer access logs.
-
Use the Query editor to run SQL statements on the table.
Example queries
In the following examples, modify the table name, column values, and other variables to fit your query.
View the first 100 access log entries in chronological order
Use this query for analysis and troubleshooting:
SELECT * FROM alb_logs
ORDER BY time ASC
LIMIT 100;
List all client IP addresses that accessed the Application Load Balancer and how many times they accessed the Application Load Balancer
Use this query for analysis and troubleshooting:
SELECT DISTINCT client_ip, COUNT() AS count FROM alb_logs GROUP BY client_ip
ORDER BY count() DESC;
List the average amount of data in kilobytes that passes through the Application Load Balancer in request or response pairs
Use this query for analysis and troubleshooting:
SELECT (AVG(sent_bytes)/1000.0 + AVG(received_bytes)/1000.0) AS prewarm_kilobytes FROM alb_logs;
List all targets that the Application Load Balancer routes traffic to and the number of routed requests per target, by percentage distribution
Use this query to identify potential target traffic imbalances:
SELECT target_ip, (COUNT(target_ip)*100.0 / (SELECT COUNT(*) FROM alb_logs)) AS backend_traffic_percentage
FROM alb_logs
GROUP BY target_ip
ORDER BY COUNT() DESC;
List the times that a client sent a request and closed the connection before the idle timeout elapsed (HTTP 460 error)
Use this query to troubleshoot HTTP 460 errors:
SELECT * FROM alb_logs WHERE elb_status_code = 460;
List the times that a client request wasn't routed because the listener rule forwarded the request to an empty target group (HTTP 503 error)
Use this query to troubleshoot HTTP 503 errors:
SELECT * FROM alb_logs WHERE elb_status_code = 503;
List clients in descending order by the number of times that each client visited a specified URL
Use this query to analyze traffic patterns:
SELECT client_ip, elb, request_url, COUNT(*) AS count FROM alb_logs GROUP BY client_ip, elb, request_url
ORDER BY count DESC;
List the 10 URLs in descending order that Firefox users accessed most frequently
Use this query to analyze traffic distribution and patterns:
SELECT request_url, user_agent, COUNT(*) AS count FROM alb_logs
WHERE user_agent LIKE '%Firefox%'
GROUP BY request_url, user_agent
ORDER BY count(*) DESC
LIMIT 10;
List clients in descending order by the amount of data in megabytes that each client sent in their requests to the Application Load Balancer
Use this query to analyze traffic distribution and patterns:
SELECT client_ip, SUM(received_bytes/1000000.0) as client_datareceived_megabytes FROM alb_logs
GROUP BY client_ip
ORDER BY client_datareceived_megabytes DESC;
List each time in a specified date range when the target process time was more than 5 seconds
Use this query to troubleshoot latency in a specified time frame:
SELECT * FROM alb_logs WHERE (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
BETWEEN parse_datetime('2018-08-08-00:00:00','yyyy-MM-dd-HH:mm:ss')
AND parse_datetime('2018-08-08-02:00:00','yyyy-MM-dd-HH:mm:ss'))
AND (target_processing_time >= 5.0);
Count the number of HTTP GET requests grouped by the IP address that the load balancer received
Use this query to analyze ingress traffic distribution:
SELECT COUNT(request_verb)AS count, request_verb, client_ip
FROM alb_logs_partitioned
WHERE day = '2022/05/21'
GROUP BY request_verb, client_ip;
For more information, see Access log entries and Querying Application Load Balancer logs.