Apache Hudi on Amazon EMR

Why Apache Hudi on EMR?

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record-level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools.

Hudi is supported in Amazon EMR and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster. Using Hudi, you can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on S3 using Apache Parquet and Apache Avro. Data sets managed by Hudi are accessible not only from Spark (and PySpark) but also other engines such as Hive and Presto. Native integration with AWS Database Migration Service also provides another source for data as it changes.

Use cases

Due to recent privacy regulations like GDPR and CCPA, companies across many industries need to perform record-level updates and deletions for people's "right to be forgotten" or changes to consent as to how their data can be used. Previously, you had to create custom data management and ingestion solutions to track individual changes and rewrite large data sets for just a few changes. With Apache Hudi on EMR, you can use familiar insert, update, upsert, and delete operations and Hudi will track transactions and make granular changes on S3 which simplifies your data pipelines.

Streaming IoT and ingestion pipelines need to handle data insertion and update events without creating many small files that can cause performance issues for analytics. Data engineers need tools that enable them to use upserts to efficiently handle streaming data ingestion, automate and optimize storage, and enable analysts to query new data immediately. Previously, you had to build custom solutions that monitor and re-write many small files into fewer large files, and manage orchestration and monitoring. Apache Hudi will automatically track changes and merge files so they remain optimally sized.

A common use case is making data from Enterprise Data Warehouses (EDW) and Operational Data Stores (ODS) available for SQL query engines like Apache Hive and Presto for processing and analytics. With Hudi, individual changes are can be processed much more granularly, reducing overhead. You can query S3 data sets directly to view and provide your users with a near real-time view of your data.

A common challenge when creating data pipelines is dealing with CDC. Late arriving or incorrect data requires the data to be rewritten for updated records. Finding the right files to update, applying the changes, and then viewing the data is challenging and requires customers to create their own frameworks or conventions. With Hudi, late arriving data can be “upserted” into an existing data set. When changes are made, Hudi will find the appropriate files in S3 and rewrite them to incorporate the changes. Hudi also allows you to view your data set at specific points in time. Each change to a data set is tracked and can be easily rolled back, should you need to “undo” them. Integration with AWS Database Migration Service (DMS) can also simplify loading data.