Apache Hudi on Amazon EMR
Why Apache Hudi on EMR?
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record-level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools.
Hudi is supported in Amazon EMR and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster. Using Hudi, you can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on S3 using Apache Parquet and Apache Avro. Data sets managed by Hudi are accessible not only from Spark (and PySpark) but also other engines such as Hive and Presto. Native integration with AWS Database Migration Service also provides another source for data as it changes.