Obtenha atualizações recentes da Hortonworks por e-mail

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.


Sign up for the Developers Newsletter

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.


Comece a Usar


Pronto para começar?

Baixar sandbox

Como podemos ajudá-lo?

* Eu entendo que posso cancelar a inscrição a qualquer momento. Eu também reconheço as informações adicionais encontradas na Política de Privacidade da Hortonworks.
fecharBotão Fechar
September 03, 2014
slide anteriorPróximo slide Enterprise SQL at Hadoop Scale with Apache Hive

In April of this year, Hortonworks, along with the broad Hadoop community delivered the final phase of the Stinger Initiative on schedule, completing the work to bring interactive SQL query to Apache Hive.  The original directive of Stinger was about advancing SQL capabilities at petabyte scale in pure open source. And over 13 months, 145 developers from 44 companies delivered exactly that, contributing over 390,000 lines of code to the Hive project alone.

While this community collaboration has had a tremendously positive impact for data workers, business analysts and the many data center tools around Hadoop that rely on Hive for SQL in Hadoop, it was just the beginning.

Apache Hive, and Enterprise SQL at Hadoop Scale

The Stinger Initiative enabled Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support interactive queries – all with a common SQL access layer. is a continuation of this initiative focused on even further enhancing the speed, scale and breadth of SQL support to enable truly real-time access in Hive while also bringing support for transactional capabilities.  And just as the original Stinger initiative did, this will be addressed through a familiar three-phase delivery schedule and developed completely in the open Apache Hive community.

r4 Project Goals

Deliver sub-second query response times.
The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes.
Enable transactions and SQL:2011 Analytics for Hive.

Hive has always been the defacto standard for SQL in Hadoop and these advances will surely accelerate the production deployment of Hive across a much wider array of scenarios.  Explicitly, some of the key deliverables that will enable these new business applications of Hive include:

  • Transactions with ACID semantics allow users to easily modify data with inserts, updates and deletes. They extend Hive from the traditional write-once, and read-often system to support analytics over changing data. This enables reporting with occasional corrections and modifications and allows operational reporting with periodic bulk updates from an operational database.
  • Sub-second queries will allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements.
  • SQL:2011 Analytics allows rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.

Transactions with ACID semantics in Hive

Hive has been used as a write-once, read-often system, where users add partitions of data and query this data often. ACID is a major shift in the paradigm, adding SQL transactions that allow users to insert, update and delete the existing data. This allows a much wider set of use cases that require periodic modifications to the existing data. ACID will include BEGIN, COMMIT and ROLLBACK for multi-statement transactions in next releases.

Screen Shot 2014-09-02 at 5.03.35 PM

Sub-Second Queries with Hive LLAP

Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called  LLAP (Live Long and Process, #llap online).

LLAP is an optional daemon process running on multiple nodes, that provides the following:

  • Caching and data reuse across queries with compressed columnar data in-memory (off-heap)
  • Multi-threaded execution including reads with predicate pushdown and hash joins
  • High throughput IO using Async IO Elevator with dedicated thread and core per disk
  • Granular column level security across applications

YARN will provide workload management in LLAP by using delegation. Queries will bring information from YARN to LLAP about their authorized resource allocation. LLAP processes will then allocate additional resources to serve the query as instructed by YARN.

The hybrid engine approach provides fast response times by efficient in-memory data caching and low-latency processing, provided by node resident processes. However, by limiting LLAP use to the initial phases of query processing, Hive sidesteps limitations around coordination, workload management and failure isolation that are introduced by running entire query within this process as done by other databases.

Screen Shot 2014-09-02 at 5.03.47 PM

Comprehensive SQL:2011 Analytics

SQL:2011 Analytics subset will be supported by Hive, with new features being added over multiple iterations, driven by customer demand. Hive is already much further along than other SQL options for Hadoop with strong SQL support including:

  • Window Functions
  • Common Table Expressions
  • Common sub-queries – correlated and uncorrelated
  • Advanced UDFs
  • Rollup, Cube, and Standard Aggregates
  • Inner, outer, semi and cross Joins will extend this lead to cover most of the frequently used SQL constructs:

  • Non Equi-Joins
  • Set Functions – Union, Except and Intersect
  • Interval types
  • Most sub-queries, nested and otherwise
  • Fixes to syntactical differences from SQL:2011 spec, such as rollup

Integration with Machine Learning Frameworks

Hive-Spark Machine Learning Integration will also allow Hive users to run machine learning models via Hive. Users want to run predictive analytics and descriptive analytics in Hive, both on the same dataset.

Hive on Spark?

There is a lot of talk about Spark as a powerful engine running on YARN, and we at Hortonworks share that excitement and are working actively to make it enterprise ready for Spark users.  In fact, in order to integrate with Spark, the broad Hive community is making use of several of the infrastructure components already added to Hive as part of the Tez integration which was delivered in Hive 0.13.

Some Additional Advances

In addition to these primary use cases, some additional enhancements include:

  • Hive Streaming Ingest helps Hive users expand operational reporting on the latest data.
  • Hive Cross-Geo Query allows users to query and report on datasets distributed across geography due to legal or efficiency constraints. Users currently are unable to do this and need to write their own application code that stitches together multiple results.
  • Materialized views allow storing multiple views of the same data allowing faster analyses. The views can be held speculatively in-memory and discarded when memory is needed.
  • Usability improvements will help users work more simply with Hive.
  • Simplified deployment will focus on providing near plug and play deployment solutions for the most common use cases.

Delivery will be delivered at a rapid pace over the next 18 months. Transactions will release in late 2014. Sub-second queries are coming in the first half of 2015, with a preview in the next few months. An initial outline of the delivery is below.  We expect this work to be completed as the initial work was, in scope and on schedule.


Enthusiasm abounds

It is not just Hortonworks that is enthusiastic about this next phase in the delivery of Enterprise SQL at Hadoop Scale.  Some of our key partners have weighed in on their excitement as well.  Watch this space over the next few days as Microsoft, Informatica, Microstrategy and Tableau all weigh in on this important initiative.

And as always, we are excited to continue our work within the Hive community to extend Hive, the leading SQL on Hadoop solution, further in terms of speed, scale, and SQL semantics.

Hive delivers a message of simplicity.  It already provides a single tool for all SQL across, batch and interactive workload and with it is extended to near real-time.  We’re enthusiastic about the upcoming journey as Hive adds exciting new features toward this goal. Watch this blog for future posts from Apache Hive committers and contributors from around the world, as they share enhancement ideas with the community.



Winthrop Hayes says:


Raviteja Chirala says:

Success of Hive in Enterprise clusters is phenomenal. Not very soon or already replaced traditional warehouses.

Herman Yu says:

exciting! materialized view, hive streaming, ACID support…

Haifeng Li says:
Your comment is awaiting moderation.

Enabling insert, update, delete, and transactions will make Hive much more complicated because of complex concurrency control. I don’t feel that they are desperately necessary for data warehouses. Why not keep improving the performance instead of adding kitchen sink? It would be great to devote more resources to things like cost based query optimizer. There are a lot of things Hive could improve. For example,

1. Reduce the cold start latency. It is not really a problem of Hive. But if hadoop community can improve it, it is a big win. Spark already did a good job on it.

2. Many clusters have high memory setup. But large heap is a big challenge to the garbage collection system of the reused JVM instances in Hadoop. The stop-of-the-world GC pauses may add high latency to queries. Hive could do something to reduce the effects. Shark spent a lot of time to solve this problem. Hive may borrow something there.

3. Improve the single node parallelism, e.g. multithreading and SSE/AVX.

4. Make Tez support pipelined execution and use it in Hive.

5. If possible, physical query plan can use TCP in Tez rather than file for some cases.

6. Dynamic query expression generation.

7. Better straggler handling using the complete process control of Tez.

Victor says:
Your comment is awaiting moderation.

Eres un loco? Se confirma!!!!!Hubo hubo

Sriram says:
Your comment is awaiting moderation.

The marketing on “sub-second” from hortonworks is really dubious at best. I dont expect this from a reputed vendor. You have to have the right amount of hardware with the right specs and the correct partitioning to get the sub-second for only a “specific set of queries” working on limited data. This is true of Impala or any other execution engine on Hadoop platforms.

Deixar uma resposta

Seu endereço de e-mail não será publicado. Os campos obrigatórios são marcados como *

If you have specific technical questions, please post them in the Forums