Obtenha atualizações recentes da Hortonworks por e-mail

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.

Sign up for the Developers Newsletter

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.


Comece a Usar


Pronto para começar?

Baixar sandbox

Como podemos ajudá-lo?

* Eu entendo que posso cancelar a inscrição a qualquer momento. Eu também reconheço as informações adicionais encontradas na Política de Privacidade da Hortonworks.
fecharBotão Fechar
February 24, 2015
slide anteriorPróximo slide

5 Ways to Make Your Hive Queries Run Faster

As a data scientist working with Hadoop, I often use Apache Hive to explore data, make ad-hoc queries or build data pipelines.

Until recently, optimizing Hive queries focused mostly on data layout techniques such as partitioning and bucketing or using custom file formats.

In the last couple of years, driven largely by the innovation of the Hive community around the Stinger initiative, Hive query time has improved dramatically, enabling Hive to support both batch and interactive workloads at speed and at scale.

However, many data scientists remain unfamiliar with basic techniques and best practices for running their Hive queries at maximum speed. In this blog post, I will highlight a few simple techniques I use most often to improve performance of my HIVE queries.

Technique #1: Use Tez

Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. I won’t go into details about the many benefits of using Tez which are mentioned here; instead, I want to make a simple recommendation: if it’s not turned on by default in your environment, use Tez by setting to ‘true’ the following in the beginning of your Hive query:

set hive.execution.engine=tez;

With the above setting, every HIVE query you execute will take advantage of Tez.

Technique #2: Use ORCFile

Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression and more.

Using ORCFile for every HIVE table should really be a no-brainer and extremely beneficial to get fast response times for your HIVE queries.

As an example, consider two large tables A and B (stored as text files, with some columns not all specified here), and a simple query like:

SELECT A.customerID, A.name, A.age, A.address join
B.role, B.department, B.salary
ON A.customerID=B.customerID;

This query may take a long time to execute since tables A and B are both stored as TEXT. Converting these tables to ORCFile format will usually reduce query time significantly:

customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);


customerID int, role string, salary float, department string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);


SELECT A_ORC.customerID, A_ORC.name,
A_ORC.age, A_ORC.address join
B_ORC.role, B_ORC.department, B_ORC.salary
ON A_ORC.customerID=B_ORC.customerID;

ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.

Converting base tables to ORC is often the responsibility of your ingest team, and it may take them some time to change the complete ingestion process due to other priorities. The benefits of ORCFile are so tangible that I often recommend a do-it-yourself approach as demonstrated above – convert A into A_ORC and B into B_ORC and do the join that way, so that you benefit from faster queries immediately, with no dependencies on other teams.

Technique #3: Use Vectorization

Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

Technique #4: cost based query optimization

Hive optimizes each query’s logical and physical execution plan before submitting for final execution. These optimizations are not based on the cost of the query – that is, until now.

A recent addition to Hive, Cost-based optimization, performs further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.

To use cost-based optimization (also known as CBO), set the following parameters at the beginning of your query:

set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.

For example, in a table tweets we want to collect statistics about the table and about 2 columns: “sender” and “topic”:

analyze table tweets compute statistics;
analyze table tweets compute statistics for columns sender, topic;

With HIVE 0.14 (on HDP 2.2) the analyze command works much faster, and you don’t need to specify each column, so you can just issue:

analyze table tweets compute statistics for columns;

That’s it. Now executing a query using this table should result in a different execution plan that is faster because of the cost calculation and different execution plan created by Hive.

Technique #5: Write good SQL

SQL is a powerful declarative language. Like other declarative languages, there is more than one way to write a SQL statement. Although each statement’s functionality is the same, it may have strikingly different performance characteristics.

Let’s look at an example. Consider a click-stream event table:

timestamp date, sessionID string, url string, source_ip string
) STORED as ORC tblproperties (“orc.compress” = “SNAPPY”);

Each record represents a click event, and we would like to find the latest URL for each sessionID.

One might consider the following approach:

SELECT clicks.* FROM clicks inner join
(select sessionID, max(timestamp) as max_ts from clicks
group by sessionID) latest
ON clicks.sessionID = latest.sessionID and
clicks.timestamp = latest.max_ts;

In the above query, we build a sub-query to collect the timestamp of the latest event in each session, and then use an inner join to filter out the rest.

While the query is a reasonable solution—from a functional point of view—it turns out there’s a better way to re-write this query as follows:

(SELECT *, RANK() over (partition by sessionID,
order by timestamp desc) as rank
FROM clicks) ranked_clicks
WHERE ranked_clicks.rank=1;

Here we use Hive’s OLAP functionality (OVER and RANK) to achieve the same thing, but without a Join.

Clearly, removing an unnecessary join will almost always result in better performance, and when using big data this is more important than ever. I find many cases where queries are not optimal — so look carefully at every query and consider if a rewrite can make it better and faster.


Apache Hive is a powerful tool in the hands of data analysts and data scientists, and supports a variety of batch and interactive workloads.

In this blog post, I’ve discussed some useful techniques—the ones I use most often and find most useful for my day-to-day work as a data scientist—to make Hive queries run faster.

Thankfully, the Hive community is not finished yet. Even between HIVE 0.13 and HIVE 0.14, there are dramatic improvements in ORCFiles, vectorization and CBO and how they positively impact query performance.

I’m really excited about Stinger.next, bringing performance improvements to the sub-second range.

I can’t wait.



Bhaskar says:
Your comment is awaiting moderation.

great know this. Good article

anand says:

precise & straight forward explanation, its helpful to setup & monitor the performance

Dipak kolhe says:

its awesome…. Very Very useful …. Thanks You

Jack says:
Your comment is awaiting moderation.

Great post Ofer!

You’re forgetting the 6th way though: Use http://www.atscale.com

AtScale lets regular users to get screaming performance and multi-dimensional analysis capabilities on Hadoop, straight from the tools they already know (Tableau, Qlik, Excel…). Sometimes 200X faster.

Business users love this and so does IT. With AtScale, they don’t have to move data, install new drivers on end-user machines or deploy invasive bits on their clusters.

Eric Tschetter says:

Jack you troll, you’re forgetting the 7th way!

Use Druid! Druid is an ultra-fast time-series OLAP database. Ready for web-scale.

Check out my great paper online or just:

npm install druid.js

Naaraayanan says:

Excellent. Explained to understand easily and can do practice.

hashini says:

Thanks For Your valuable posting,this is for wonderful sharing,i would like to see more information from your side.i am working in Erp Software Company In Dubai

rakesh says:

extraordinary understanding

Ramesh Sreera says:

Excellent article to the point for performance tuning hive queries

Balachandar says:

Very good points mentioned in this article. It will help in the coming days. Thanks

Thiru says:

Very good information to optimize Hive queries

Hitender Razora says:

Typo Error in the blog
Sports should be changed to Supports.

Thinklayer says:

Data warehousing Optimization Delivering data warehousing solutions with industry certified and latest data warehousing technologies to deliver value for your enterprise.

Manoj Ramaiyan says:

Thank you very much Ofer. I appreciate it very much. Clear cut and simple.

Sharad Choudhary says:

You helped me with this blog on the self join on Tables where I was stuck for 1 week. Thanks so much! You rock!

Jay says:

Will vectorization work on tables saved in non ORC format?

Abbas says:

Good explainations & clear

Sakthi says:

Nice blog sir, As a beginners like me it will helpful for lot.. Thanks..

Mohana says:

Very useful.. Made my day.. Thank you…!

Ashfaque says:

excellent work

Madhavan says:

looks neat and helpful, thanks!

ccc says:

Great Post !!

Small correction.
No need of comma(,) after “partition by sessionID”

(SELECT *, RANK() over (partition by sessionID

order by timestamp desc) as rank

FROM clicks) ranked_clicks

WHERE ranked_clicks.rank=1;

Deixar uma resposta

Seu endereço de e-mail não será publicado. Os campos obrigatórios são marcados como *

If you have specific technical questions, please post them in the Forums