Apache Hive is one of Apache's top-level projects. Hive is a data warehouse and ETL for a large dataset in distributed storage. Hive supports different types of storage formats like CSV, TSV, Parquet, ORC (Optimized Row Column), and others. It is used for the analytical processing of structured data using an SQL-like interface. Hive is built on top of Hadoop.
![]() |
Apache Hive ~ https://hive.apache.org/ |
Hive is a software project that provided data querying and analysis. It facilitates the reading, writing, and handling of a wide dataset that is stored in distributed storage and queried by SQL syntax, HiveQL.
Hive provides the necessary abstraction to the Hadoop environment by projecting structure on data in HDFS storage so that SQL queries can be integrated with the low-level Java API.
Hive also provides a command-line tool and Java Database Connectivity (JDBC) driver that can be used to connect to Hive.
Hive was co-created by Joydeep Sen Sarma and Ashish Thusoo, Facebook. After that used and developed by other companies like Netflix and FINRA. Amazon maintains a software fork of Apache Hive included in Amazon Elastic Map Reduce on Amazon Web Services.
Characteristics of Hive
- Database and tables are built before loading data.
- Hive is a data warehouse for managing and querying only structured data residing in tables.
- Hive is compatible with multiple file formats like textfile, sequence file, ORC, RC file, etc.
- Hive uses derby database in single-user metadata storage and uses MySQL (RDBMS) for multiple users' metadata or shared metadata.
Comments
Post a Comment