Basics of Hive and Impala Tutorial

Welcome to the fourth lesson ‘Basics of Hive and Impala’ which is a part of ‘Big Data Hadoop and Spark Developer Certification course’ offered by Simplilearn.

In this lesson, you will learn the basics of Hive and Impala, which are among the two components of the Hadoop ecosystem. We will compare Hive and Impala and learn how to execute queries using them.

Let’s look at the objectives of this Impala and Hive Tutorial.

Objectives

After completing this lesson, you will be able to:

  • Explain Hive and Impala along with their needs and features

  • Compare Impala vs. Hive

  • Differentiate between relational database and Hive and Impala

  • Execute queries using Hive and Impala Needs and Features of Hive and Impala

In the next section of the Impala Hadoop tutorial, we will discuss the needs and features of Hive and Impala.

Introduction to Hive and Impala

Hive and Impala are tools that provide a SQL-like interface for users to extract data from the Hadoop system. Since SQL knowledge is popular in the programming world, anyone familiar with it can use Hive and Impala.

Broadly, the diagram shows how an SQL query in Hive and Impala is processed on the Hadoop cluster and can be stored or fetched from the storage components HDFS or HBase.

sql-query-in-hive-and-impala

Let’s take a look at the similarities between the two Hadoop components in the next section of this Apache Impala tutorial.

Hive and Impala: Similarities

Hive, which helps in data analysis, is an abstraction layer on Hadoop. It is very similar to Impala; however, Hive is preferred for data processing and Extract Transform Load operations, also known as ETL.

Both Hive and Impala bring large-scale data analysis to a larger audience as users with no software development experience can easily adapt them. Users can leverage existing knowledge of SQL to work with Hive and Impala.

Writing a query in a map-reduce program is often complex and sometimes even requires 200 lines of Java code to complete the task of mapping and reducing. Whereas, writing queries in Hive and Impala is easy and can be done with a few lines of code.

Hive and Impala also offer interoperability with other systems. They are extensible through Java and external script. Many Business Intelligence tools support Hive and (or) Impala.

In the following section, we’ll discuss Impala vs. Hive.

Impala vs. Hive

Given below are some differences between hive and impala.

Impala

Hive

Impala was inspired by Google’s Dremel project and was developed by Cloudera around 2012.

Hive was developed by Facebook around 2007

It is an incubating Apache project

It is an open source Apache project.

It has a high performance dedicated SQL engine

It has a high-level abstraction layer on top of MapReduce and Apache Spark.

It uses Impala SQL for ad hoc queries and has a low query latency measured in milliseconds

It uses HiveQL to query the structured data in a metastore and generates MapReduce or Spark jobs that run on the Hadoop cluster

It is designed for high concurrencies and ad hoc queries such as Business Intelligence and analytics

It is suitable for structured data such as periodic reviews and analysis of historic data

Which one to choose - Hive or Impala?

Hive has more features compared to Impala such as:

  • Use of non-scalar data types, extensibility mechanisms, sampling, and lateral views.

  • It is highly extensible and commonly used for batch processing.

Impala being a specialized SQL engine offering:

  • Five to fifty times faster performance compared to Hive:

  • It is ideal for interactive queries and data analysis.

  • Unlike Hive, Impala accommodates many concurrent users.

Over the last few years, features have been added to Impala. Although Hive is widely used, both the tools have their unique features.

Now, the following section of the Apache Hive tutorial, we will compare Relational Database Management Systems, or RDBMS, with Hive and Impala.

Relational Databases vs. Hive vs. Impala

The table given below distinguishes Relational Databases vs. Hive vs. Impala.

relational-databases-vs-hive-vs-impala

The few differences can be explained as given.

  • RDBMS has total SQL support, whereas Hive and Impala have limited SQL support.

  • You can update and delete individual records or rows from RDBMS, whereas these functionalities are not supported in Hive and Impala.

  • Transactions are possible only in RDBMS and not in Hive and Impala.

  • RDBMS has extensive index support, whereas Hive has limited index support and Impala has no index support.

  • The latency of SQL queries is very low in RDBMS, low in Impala, and high in Hive.

  • You can process terabytes of data in RDBMS, whereas you can process petabytes of data in Impala and Hive.

In the next topic, we will learn the steps to execute queries and to run Hive and Impala queries using various interfaces.

Interacting with Hive and Impala

Given below are the steps involved in executing a query on Hive and Impala. We will first discuss the steps in Hive.

Steps for executing a Query in Hive and Impala

Hive receives a SQL query and performs the following steps to get the result:

  1. Parses HiveQL

  2. Makes optimizations

  3. Plans for execution

  4. Submits job to the cluster

  5. Monitors the progress

  6. Processes the data: The job is sent to the data processing engine where it is either converted to MapReduce or processed by Spark.

  7. Stores the data in HDFS

Let’s now discuss Impala

Impala performs the following steps after receiving a SQL query:

  1. Parses Impala SQL

  2. Makes optimizations

  3. Plans for execution

  4. Executes a query on the cluster

  5. Stores data in a cluster

Note that Impala does not process the data using MapReduce or Spark, instead it executes the query on the cluster. This feature makes Impala faster than Hive.

Interfaces to Run Hive and Impala Queries

Hive and Impala offer many interfaces to run queries. The section displays the impala-shell.

the-command-line-shell-for-impala-impala-shell

The Command-line shell for Impala is impala-shell and for Hive is Beeline.

Similarly, in Hue Web UI you can access Hive through Hive Query Editor and Impala through Impala Query Editor as shown in the below section.

hive-query-editor-to-access-hive

Accessing Hive through Impala Query Editor to access Impala can be shown in below image.

impala-query-editor-to-access-impalaYou can also connect Metastore Manager through the Open Database Connectivity/Java Database Connectivity driver, popularly known as ODBC/JDBC.

Running Hive Queries using Beeline

“!” Is used to execute Beeline commands. Here are a few commands to run Beeline:

  • !exit - to exit the shell

  • !help - to show the list of all commands

  • !verbose - to show added details of queries

You can view the section on the use of “!” to start Beeline.

the-use-of-!-to-start-beelineRunning Beeline from Command Line

You can execute the file using the “-u” option using the following command:

beeline –u …

-f simplilearn.hql

You can use HiveQL directly from the command line using the “-e” option using the following command:

beeline –u ...

-e 'SELECT * FROM users‘

You can use the command line shown in the section to continue running the script even after an error.

beeline –u …

-force=TRUE

You can execute any Hive SQL query from Hive terminal or Beeline. SQL command is terminated with a semicolon. You can run Impala from impala-shell and Hive from Beeline as displayed below.

running-hive-query-from-hive-terminal-or-beeline

Want to know Big Data Hadoop and Spark Developer Certification course? Check out our Course here!

Summary

Let’s summarize what we learned in this lesson.

  • Hive and Impala are tools to perform SQL queries on data residing on HDFS/HBase.

  • Hive and Impala are easy to learn for experienced SQL developers.

  • Hive and Impala solve Big Data challenges but do not replace traditional RDBMS.

  • Hive uses HiveQL and converts data into MapReduce or Spark jobs that run on the Hadoop cluster.

  • Impala uses a very fast specialized SQL engine faster than that of MapReduce.

Conclusion

This concludes the lesson “Basics of Hive and Impala.” In the next lesson, we will discuss How to work with Hive and Impala.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in top cities:


Name Date Place
Big Data Hadoop and Spark Developer 28 Jul -2 Sep 2018, Weekend batch Your City View Details
Big Data Hadoop and Spark Developer 4 Aug -9 Sep 2018, Weekend batch Toronto View Details
Big Data Hadoop and Spark Developer 6 Aug -27 Aug 2018, Weekdays batch Ottawa View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*