Hive Starter: A Comprehensive Guide to Hadoop’s Data Warehousing Tool
Introduction
Hive is a powerful data warehousing tool built on top of Hadoop, enabling users to process and analyze large datasets with ease. In this article, we will delve into the world of Hive, exploring its installation, configuration, and usage. By the end of this tutorial, you will be well-versed in using Hive to extract insights from your data.
Hive: A SQL-like Query Language
Hive is based on the Hadoop data warehousing framework, allowing users to map structured data files into tables and execute SQL-like queries. These queries, known as HQL (Hive Query Language), are translated into MapReduce jobs to perform the desired operations. Hive stores query data in HDFS (Hadoop Distributed File System) and runs on Yarn, making it an ideal choice for off-line data analysis, such as batch processing and less demanding scenarios.
Installation
Before we begin, we need to prepare a Hadoop environment. Since Hive is built on top of Hadoop, we will need to install Hadoop first. Please refer to our previous article for detailed instructions on installing Hadoop.
Installation of Hive
To install Hive, download the latest version from the Apache website and follow these steps:
- Download the installation package from the following address: https://dist.apache.org/repos/dist/release/hive/hive-2.3.2/
- Unzip the package using the following commands:
wget -c https://dist.apache.org/repos/dist/release/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
tar zxvf apache-hive-2.3.2-bin.tar.gz
$ cd apache-hive-2.3.2-bin
3. Set the environment variables:
```bash
$ export JAVA_HOME=/opt/jdk8
$ export HADOOP_HOME=/apps/hadoop-3.0.0
$ export HIVE_HOME=/apps/apache-hive-2.3.2-bin
- Initialize the Derby database, which Hive uses to store metadata by default. For production environments, we recommend using MySQL.
$ bin/schematool -dbType derby -initSchema
**Running Hive**
To start the Hive command-line terminal, use the following command:
```bash
$ bin/hive
Table View and Data Insertion
Let’s create a table called “users” with the following schema:
hive> CREATE TABLE users (id int, username string, password string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Now, let’s insert some sample data into the table:
hive> LOAD DATA LOCAL INPATH '/tmp/users.dat' INTO TABLE users;
The “users.dat” file contains the following data:
1, user1, password1
2, user2, password2
3, user3, password3
4, user4, password4
5, user5, password5
After loading the data, we can execute a simple query to retrieve all records from the table:
hive> select * from users;
The output will be:
OK
1 user1 password1
2 user2 password2
3 user3 password3
4 user4 password4
5 user5 password5
To verify that the data has been stored in HDFS, we can use the following commands:
hive> dfs -ls /user/hive/warehouse/users
This will list the contents of the “users” directory in HDFS, which should contain a single file called “users.dat”. We can also use the following command to view the contents of the file:
hive> dfs -cat /user/hive/warehouse/users/users.dat
The output will be the same as the contents of the “users.dat” file.
Inserting New Records
Let’s insert a new record into the “users” table:
hive> INSERT INTO TABLE users (id, username, password) VALUES (6, 'user6', 'password6');
This will trigger a MapReduce job to process the insertion operation. We can verify that the new record has been inserted by executing another query:
hive> select * from users;
The output will be:
OK
1 user1 password1
2 user2 password2
3 user3 password3
4 user4 password4
5 user5 password5
6 user6 password6
We can also use the following commands to verify that the new record has been stored in HDFS:
hive> dfs -ls /user/hive/warehouse/users
This will list the contents of the “users” directory in HDFS, which should now contain two files: “users.dat” and “000000_0”. We can use the following command to view the contents of the new file:
hive> dfs -cat /user/hive/warehouse/users/000000_0
The output will be the new record:
6, user6, password6
Conclusion
In this article, we have covered the basics of Hive, including its installation, configuration, and usage. We have created a table, inserted data into it, and executed queries to retrieve records. We have also verified that the data has been stored in HDFS and that new records have been inserted correctly. With Hive, you can now extract insights from your data with ease, making it an ideal choice for off-line data analysis and batch processing.