How can I create and use partitioned tables in Amazon Athena?

5 minute read
0

I want to create partitioned tables in Amazon Athena and use them to improve my queries.

Short description

By partitioning your Athena tables, you can restrict the amount of data scanned by each query, thus improving performance and reducing costs. Partitioning divides your table into parts and keeps related data together based on column values. Partitions act as virtual columns and help reduce the amount of data scanned per query.

Consider the following when you create a table and partition the data:

  • You must store your data on Amazon Simple Storage Service (Amazon S3) buckets as a partition.
  • Include the partitioning columns and the root location of partitioned data when you create the table.
  • Choose the appropriate approach to load the partitions into the AWS Glue Data Catalog. The table refers to the Data Catalog when you run your queries.
  • Use partition projection for highly partitioned data in Amazon S3.

Resolution

Here are a few things to keep in mind when you create a table with partitions.

Store on Amazon S3

The data must be partitioned and stored on Amazon S3. The partitioned data might be in either of the following formats:

  • Hive style format (Example: s3://doc-example-bucket/example-folder/year=2021/month=01/day=01/myfile.csv)
    Note: The path includes the names of the partition keys and their values (Example: year=2021)
  • Non-Hive style format (Example: s3://doc-example-bucket/example-folder/2021/01/01/myfile.csv)

Include partitioning information when creating the table

The CREATE TABLE statement must include the partitioning details. Use PARTITIONED BY to define the partition columns and LOCATION to specify the root location of the partitioned data. Run a query similar to the following:

CREATE EXTERNAL TABLE doc-example-table (
first string,
last string,
username string
)
PARTITIONED BY (year string, month string, day string)
STORED AS parquet
LOCATION 's3://doc-example-bucket/example-folder'

Replace the following in the query:

  • doc-example-table with the name of the table that you are creating
  • doc-example-bucket with the name of the S3 bucket where you store your table
  • example-folder with the name of your S3 folder
  • first, last, and username with the names of the columns
  • year, month, and day with the names of the partition columns

Load partitions into the Data Catalog with an approach that's appropriate for your use case

After creating the table, add the partitions to the Data Catalog. You can do so using one of the following approaches:

  • Use the MSCK REPAIR TABLE query for Hive style format data: The MSCK REPAIR TABLE command scans a file system, such as Amazon S3, for Hive-compatible partitions. The command compares them to the partitions that are already present in the table and then adds the new partitions to the Data Catalog. Run a command similar to the following:
    Note: This approach isn't a best practice if you have more than a few thousand partitions. Your DDL queries might face time out issues. For more information, see Why doesn't my MSCK REPAIR TABLE query add partitions to the AWS Glue Data Catalog?
MSCK REPAIR TABLE doc-example-table
  • Use the ALTER TABLE ADD PARTITION query for both Hive style and non-Hive style format data: The ALTER TABLE ADD PARTITION command adds one or more partitions to the Data Catalog. In the command, specify the partition column name and value pairs along with the Amazon S3 path where the data files for that partition are stored. You can add one partition or a batch of partitions per query by running commands similar to the following:
ALTER TABLE doc-example-table ADD PARTITION (year='2021', month='01', day='01') LOCATION 's3://doc-example-bucket/example-folder/2021/01/01/'
ALTER TABLE doc-example-table ADD
PARTITION (year='2021', month='01', day='01') LOCATION 's3://doc-example-bucket/example-folder/2021/01/01/'
PARTITION (year='2020', month='06', day='01') LOCATION 's3://doc-example-bucket/example-folder/2020/06/01/'
  • Use the AWS Glue crawler for both Hive and non-Hive style format data: You can use the Glue crawler to automatically infer table schema from your dataset, create the table, and then add the partitions to the Data Catalog. Or, you can use the crawler to only add partitions to a table that's created manually with the CREATE TABLE statement. To use the crawler to add partitions, when you define the crawler, specify one or more existing Data Catalog tables as the source of the crawl instead of specifying data stores. The crawler crawls the data stores specified by the Data Catalog tables. No new tables are created. Instead, the manually created tables are updated, and new partitions are added. For more information, see Setting crawler configuration options.
  • Use partition projection for highly partitioned data in Amazon S3: When you have highly partitioned data in Amazon S3, adding partitions to the Data Catalog can be impractical and time consuming. Queries against a highly partitioned table don't complete quickly. In such cases, you can use the partition projection feature to speed up query processing of highly partitioned tables and automate partition management. In partition projection, partition values and locations are calculated from configuration rather than read from a repository, such as the Data Catalog. This means that there's no need to add partitions to the Data Catalog with partition projection. Because in-memory operations are usually faster than remote operations, partition projection can reduce the runtime of queries against highly partitioned tables. The partition projection feature currently works with enumerated values, integers, dates, or injected partition column types. For more information, see Partition Projection with Amazon Athena.

Related information

Why do I get zero records when I query my Amazon Athena table?

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago