Apache Hive on Amazon EMR
Why Apache Hive on EMR?
Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface. Learn more about Apache Hive here.
Apache Hive is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Hive clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API. Additionally, you can leverage additional Amazon EMR features, including direct connectivity to Amazon DynamoDB or Amazon S3 for storage, integration with the AWS Glue Data Catalog, AWS Lake Formation, Amazon RDS, or Amazon Aurora to configure an external metastore, and EMR Managed Scaling to add or remove instances from your cluster.
Features and benefits
Customer success
Airbnb customer success
Airbnb connects people with places to stay and things to do around the world with 2.9 million hosts listed, supporting 800k nightly stays. Airbnb uses Amazon EMR to run Apache Hive on a S3 data lake. Running Hive on the EMR clusters enables Airbnb analysts to perform ad hoc SQL queries on data stored in the S3 data lake. By migrating to a S3 data lake, Airbnb reduced expenses, can now do cost attribution, and increased the speed of Apache Spark jobs by three times their original speed.
Guardian customer success
Guardian gives 27 million members the security they deserve through insurance and wealth management products and services. Guardian uses Amazon EMR to run Apache Hive on a S3 data lake. Apache Hive is used for batch processing to enable fast queries on large datasets. The S3 data lake fuels Guardian Direct, a digital platform that allows consumers to research and purchase both Guardian products and third party products in the insurance sector.
FINRA customer success
FINRA – the Financial Industry Regulatory Authority – is the largest independent securities regulator in the United States, and monitors and regulates financial trading practices. FINRA uses Amazon EMR to run Apache Hive on a S3 data lake. Running Hive on the EMR clusters enables FINRA to process and analyze trade data of up to 90 billion events using SQL. The cloud data lake resulted in cost savings of up to $20 million compared to FINRA’s on-premises solution, and drastically reduced the time needed for recovery and upgrades.
Vanguard customer success
Vanguard, an American registered investment advisor, is the largest provider of mutual funds and the second largest provider of exchange traded funds. Vanguard uses Amazon EMR to run Apache Hive on a S3 data lake. Data is stored in S3 and EMR builds a Hive metastore on top of that data. The Hive metastore contains all the metadata about the data and tables in the EMR cluster, which allows for easy data analysis. Hive also enables analysts to perform ad hoc SQL queries on data stored in the S3 data lake. Migrating to a S3 data lake with Amazon EMR has enabled 150+ data analysts to realize operational efficiency and has reduced EC2 and EMR costs by $600k.