Obtenha atualizações recentes da Hortonworks por e-mail

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.


Sign up for the Developers Newsletter

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.


Comece a Usar


Pronto para começar?

Baixar sandbox

Como podemos ajudá-lo?

* Eu entendo que posso cancelar a inscrição a qualquer momento. Eu também reconheço as informações adicionais encontradas na Política de Privacidade da Hortonworks.
fecharBotão Fechar
Projetos Apache
Apache Spark

Apache Spark



Spark adds in-Memory Compute for ETL, Machine Learning and Data Science Workloads to Hadoop

What Apache Spark Does

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.

The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP.

Arun Murthy : Hadoop & Spark : Perfect Together : Spark Summit 2015

Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.

Spark screen shot 3

Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning.

Spark is designed for data science and its abstraction makes data science easier.   Data scientists commonly use machine learning – a set of techniques and algorithms that can learn from data. These algorithms are often iterative, and Spark’s ability to cache the dataset in memory greatly speeds up such iterative data processing, making Spark an ideal processing engine for implementing such algorithms.

Spark also includes MLlib, a library that provides a growing set of machine algorithms for common data science techniques: Classification, Regression, Collaborative Filtering, Clustering and Dimensionality Reduction.

Spark’s ML Pipeline API is a high level abstraction to model an entire data science workflow.   The ML pipeline package in Spark models a typical machine learning workflow and provides abstractions like Transformer, Estimator, Pipeline & Parameters.  This is an abstraction layer that makes data scientists more productive.

Spark Use Cases

As Apache Spark’s momentum continues to grow we are seeing customers across all industries get real value from using it with the Hortonworks Data Platform (HDP).  Customers are using Spark to improve their businesses by detecting patterns and providing actionable insight which is driving organizational change and also starting to change some facets of life.  The following table provides a few examples, from insurance to internet companies, of how Spark is being used:

Seguros Optimize their claims reimbursements process by using Spark’s machine learning capabilities to process and analyze all claims.
Saúde Build a Patient Care System using Spark Core, Streaming and SQL.
Varejo Use Spark to analyze point-of-sale data and coupon usage.
Internet Use Spark’s ML capability to identify fake profiles and enhance products matches that they show their customers.
Banking Use a machine learning model to predict the profile of retail banking customers for certain financial products.
Government Analyze spending across geography, time and category.
Scientific Research Analyze earthquake events by time, depth, geography to predict future events.
Investment Banking Analyze intra-day stock prices to predict future price movements.
Geospatial Analysis Analyze Uber trips by time and geography to predict future demand and pricing.
Twitter Sentiment Analysis Analyze large volumes of Tweets to determine positive, negative or neutral sentiment for specific organizations and products.
Airlines Build a model for predicting airline travel delays.
Devices Predict likelihood of a building exceeding threshold temperatures.

Many customers are using Cloudbreak and Ambari to spin up clusters in the cloud for ad-hoc, self-service data science.

Spark & HDP

Spark is certified as YARN Ready and is a part of HDP. Memory and CPU-intensive Spark-based applications can coexist with other workloads deployed in a YARN-enabled cluster. This approach avoids the need to create and manage dedicated Spark clusters and allows for more efficient resource use within a single cluster.

spark screen shot
Spark is integrated within Hortonworks Data Platform

HDP also provides consistent governance, security and management policies for Spark applications, just as it does for the other data processing engines within HDP.

Hortonworks Focus on Spark: Vertical and Horizontal Integration with HDP


Hortonworks approached Spark in the same way we approached other data access engines like Storm, Hive, and HBase. We outline a strategy, rally the community, and contribute key features within the Apache Software Foundation’s process.

Below is a summary of the various integration points that make Spark and HDP enterprise-ready.

  • Support for the ORC File Format
    As part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression. It is rapidly becoming the defacto storage format for Hive. Hortonworks contributed to SPARK-2883, which provides basic support of ORC file format in Spark.  We worked with the community to bring ORC support to Spark and now this work is GA with Spark 1.4.1 (SPARK-2883, SPARK-10623).
  • Segurança
    Many of our customers’ initial use cases for Spark run on Hadoop clusters which either do not contain sensitive data or are dedicated for a single application and so they are not subject to broad security requirements. But users plan to deploy Spark-based applications alongside other applications in a single cluster, so we worked to integrate Spark with the security constructs of the broader Hadoop platform. We hear a common request that Spark runs effectively on a secure Hadoop cluster and can leverage authorization offered by HDFS.Also to improve security we have worked within the community to ensure that Spark runs on a Kerberos-enabled cluster. This means that only authenticated users can submit Spark jobs.
  • Operações
    Hortonworks continues to focus on streamlining operations for Spark through the 100% open source Apache Ambari. Our customers use Ambari to provision, manage and monitor their HDP clusters, and many Hortonworks partners, such as Microsoft, Teradata, Pivotal and HP have all taken advantage and backed this foundational Hadoop project. Currently, our partners leverage Ambari Stacks to rapidly define new components/services and add those within a Hadoop cluster. With Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster. The Quick Links feature of Ambari will allow for the cluster operator to access the native Spark User Interface.To simplify the operational experience, HDP 2.2.4 also allows Spark to be installed and be managed by Apache Ambari 2.0. Ambari allows the cluster administrator to manage the configuration of Spark and Spark daemons life cycles.
  • Acesso a dados sem interrupção
    Most of our customers who use Spark also use Hive. Last year we worked with the community to upgrade the version of Hive in Spark to 0.13.1 and this year we have upgraded Spark to use Hive 1.2.1. We have also streamlined the integration between Hive & Spark. This integration enables customers who have embraced the data lake concept to access their data from Hive or from Spark without running into Hive version incompatibility issues. In addition, this work lays the foundation for both Hive and Spark  to evolve independently while also allowing Spark to leverage Hive.
  • Improved Reliability and Scale of Spark-on-YARN
    The Spark API allows developers to create both iterative and in-memory applications on Apache Hadoop YARN. With the community interest behind it Spark is making great strides in efficient cluster resource usage. With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound. We continue to believe Spark can use the cluster resources more efficiently and are working with the community to promote a better resource usage.
  • YARN ATS Integration
    From an operations perspective, Hortonworks has integrated Spark with the YARN Application Timeline Server (ATS). ATS provides generic storage and retrieval of applications’ current and historic information. This permits a common integration point for certain classes of operational information and metrics. With this integration, the cluster operator can take advantage of information already available from YARN to gain additional visibility into the health and execution status of the Spark jobs.

Fundamentally, our strategy continues to focus on innovating at the core of Hadoop and we look forward to continuing to support our customers and partners by contributing to a vibrant Hadoop ecosystem that includes Apache Spark as yet another data access application running in YARN.


What's New in HDP 2.6


Access to Latest Innovation

  • Spark 2.1
  • Spark 1.6.3
  • Zeppelin 0.7

Enhanced Security

  • Spark Thrift Server doAs
  • Spark Streaming + Kafka over SSL

Improved Usability

  • Livy REST API
  • Package support in PySpark (Spark Python API) & SparkR: Data scientists using Spark with R language can now deploy their favorite R package with their Spark job
  • Multi-Cluster HBase support for SHC


  • SparkSQL: Fine-grained security with Ranger integration for Row/Column Security

Recent Progress in Spark

For additional details about this release review the following resources:

Spark  Version Notable Enhancements
2.1 Extensive support for ML algorithms in SparkR including LDA, Gaussian Mixture Models, ALS, Random Forest, Gradient Boosted Trees among others

Structured Streaming: API enhancements & stability

MLlib: API enhancements, performance, and stability

  • Focus on Structured Streaming, ML, SparkR
  • Kafka 0.10 support with Structured Streaming
  • Metrics with Structured Streaming (Latency, Resource Use, Delay)
  • Structured Streaming still alpha
  • SparkR now has LDA, ALS, RF, GMM, GBT etc.
  • SparkR now supports package distribution with add File
  • API Improvements
  • Performance Improvements
  • New Machine Learning API and distributed R algorithms
  • Improved SparkSQL with more SQL support
  • Accelerated Spark streaming
  • Introducing Dataset API
  • Automatic memory tuning
  • New version of Zeppelin technical preview
  • Adds several new machine learning algorithms and utilities, and extends Spark’s new R API
  • Adds web visualization of SQL and DataFrame query plans
  • Delivers first major pieces of Project Tungsten
  • Adds backpressure support to Spark Streaming

Hortonworks Focus for Spark

Hortonworks continues to invest in Spark for Open Enterprise Hadoop so users can deploy Spark-based applications alongside other Hadoop workloads in a consistent, predictable and robust way.  At Hortonworks we believe that Spark & Hadoop are perfect together and our focus is on:

  Aceleração de Ciência de Dados

  • Improve data science productivity by improving Apache Zeppelin and by contributing additional Spark algorithms and packages to ease the development of key solutions.

  Acesso a dados sem interrupção

  • There are additional opportunities for Hortonworks to contribute to and maximize the value of technologies that interact with Spark.
  • We are focused on improving Spark’s integration with YARN, HDFS, Hive, HBase and ORC.
  •  Specifically, we believe that we can further optimize data access via the new DataSources API. For example, this should allow SparkSQL users to take full advantage of the following capabilities:
    • ORC File instantiation as a table
    • Column pruning
    • Language integrated queries
    • Predicate pushdown

  Inove no núcleo

  • Contribute additional machine learning algorithms and enhance Spark’s enterprise security, governance, operations, and readiness.

You can read further details of our focus from the Spark and Hadoop Perfect Together Blog.

What's New in Spark 2.0

Apache Spark 2.0 marks a big milestone for the project. This new release is focused on feature improvements based on community feedback. There are four main areas of improvement regarding Spark’s development.


One of the most popular interfaces for Apache Spark based applications is SQL. Spark 2.0 offers support for all the 99 TPC-DS queries which are largely based on SQL:2003 specification. This alone can help porting existing data loads into a Spark backend with minimal rewriting of the application stack.

Machine Learning

Machine Learning has a big emphasis in this new release. The new package that is based on DataFrames will replaces the existing Spark MLlib. Machine Learning pipelines and models can now be persisted across all languages supported by Spark. K-Means, Generalized Linear Models (GLM), Naive Bayes and Survival Regression are now supported in R.


DataFrames and Datasets are now unified for Scala and Java programming languages under the new Datasets class, which also serves as an abstraction for structured streaming. SQLContext and HiveContext are now replaced by the unified SparkSession. Old APIs have been deprecated but remain for backwards compatibility.

Structured Streaming API

The new Structured Streaming API aims to allow managing streaming data sets without added complexity, in the same way that programmers and existing machine learning algorithms deal with batch loaded data sets. Performance has also improved with the second generation Tungsten engine, allowing for up to 10 times faster execution.

Spark 1.6 Technical Overview

This technical overview allows you to evaluate Apache Spark 1.6 on YARN with HDP 2.4.

With YARN, Hadoop supports various types of workloads. Spark on YARN becomes yet another workload running against the same set of hardware resources.

This technical overview describes how to:

  • Run Spark on YARN and run the canonical Spark examples: SparkPi and Wordcount.
  • Run Spark 1.6 on HDP 2.4.
  • Use the Spark DataFrame API.
  • Read/write data from Hive.
  • Use SparkSQL Thrift Server for JDBC/ODBC access.
  • Use ORC files with Spark, with examples.
  • Use SparkR.
  • Use the DataSet API.

When you are ready to go beyond these tasks, try the machine learning examples at Apache Spark.

HDP Cluster Requirements:

This technical overview can be installed on any HDP 2.3.x or 2.4 cluster, whether it is a multi-node cluster or a single-node HDP Sandbox.


The Spark 1.6 Technical Overview is provided in RPM and DEB package formats. The following instructions assume RPM packaging:

  1. Download the Spark 1.6 RPM repository:
    wget -nv -O /etc/yum.repos.d/HDP-TP.repo
    For installing on Ubuntu use the following
  2. Install the Spark Package:
    Download the Spark 1.6 RPM (and pySpark, if desired) and set it up on your HDP 2.3 cluster:
    yum install spark_2_3_4_1_10-master -y

    If you want to use pySpark, install it as follows and make sure that Python is installed on all nodes.

    yum install spark_2_3_4_1_10-python -y

    The RPM installer will also download core Hadoop dependencies. It will create “spark” as an OS user, and it will create the /user/spark directory in HDFS.

    Make sure that you set JAVA_HOME before you launch the Spark Shell or thrift server.
    export JAVA_HOME=<path to JDK 1.8>

    The Spark install creates the directory where Spark binaries are unpacked (/usr/hdp/ Set the SPARK_HOME variable to this directory:

    export SPARK_HOME=/usr/hdp/
  4. Create hive-site in the Spark conf directory:
    As user root, create the file SPARK_HOME/conf/hive-site.xml. Edit the file to contain only the following configuration setting:
        <!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
        <description>URI for client to contact metastore server</description>

Run the Spark Pi Example

To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle — it generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates  pi/4, which is used to estimate Pi.

  1. Change to your Spark directory and switch to the spark OS user:
    cd $SPARK_HOME
    su spark
  2. Run the Spark Pi example in yarn-client mode:
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

    Note: The Pi job should complete without any failure messages. It should produce output similar to the following. Note the value of pi near the end of the output.

    15/12/16 13:21:05 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 4.313782 s
    Pi is roughly 3.139492
    15/12/16 13:21:05 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}

Using WordCount with Spark

First, copy the input file for the Spark WordCount Example.

Upload the input file you want to use in WordCount to HDFS. You can use any text file as input. In the following example, is used as an example.

As user spark:

hadoop fs -copyFromLocal /etc/hadoop/conf/ /tmp/data

To run WordCount:

  1. Run the Spark shell:
    ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

    Output similar to the following will be displayed, followed by the “scala>” REPL prompt:

    Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
    Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
    Type in expressions to have them evaluated.
    Type :help for more information.
    15/12/16 13:21:57 INFO SparkContext: Running Spark version 1.6.0
  2. At the Scala REPL prompt, enter:
    val file = sc.textFile("/tmp/data")
    val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
  3. Review output
    To view WordCount output in the scala shell:
    scala > counts.count()

    To print the full output of the WordCount job:

    scala > counts.toArray().foreach(println)

    To view WordCount output using HDFS:

    1. Exit the scala shell.
      scala > exit
    2. View WordCount Results:
      hadoop fs -ls /tmp/wordcount

      It should display output similar to:

    3. Use the HDFS cat command to see the WordCount output. For example:
      hadoop fs -cat /tmp/wordcount/part-00000

Using the Spark DataFrame API

The DataFrame API provides easier access to data, because it looks conceptually like a table. Developers familiar with the Python Pandas library or R data frames will be familiar with Spark DataFrames.

  1. As user spark, upload people.txt and people.json files to HDFS:
    cd $SPARK_HOME
    su spark
    hdfs dfs -copyFromLocal examples/src/main/resources/people.txt people.txt
    hdfs dfs -copyFromLocal examples/src/main/resources/people.json people.json
  2. As user spark, launch the Spark Shell:
    cd $SPARK_HOME
    su spark
    ./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client
  3. At the Spark Shell, type the following:
    scala>val df ="json").load("people.json")
  4. Using, display the contents of the DataFrame:
    15/12/16 13:28:15 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool
    | age|   name|
    |  30|   Andy|
    |  19| Justin|

Additional DataFrame API examples

scala>import org.apache.spark.sql.functions._
// Select all, and increment age by 1
scala>"name"), df("age") + 1).show()
// Select people older than 21
scala>df.filter(df("age") > 21).show()
// Count people by age

Programmatically Specifying Schema

import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("people.txt")
val schemaString = "name age"
import org.apache.spark.sql.types.{StructType,StructField,StringType}
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val rowRDD =",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
val results = sqlContext.sql("SELECT name FROM people") => "Name: " + t(0)).collect().foreach(println)

This will produce output similar to the following:

15/12/16 13:29:19 INFO DAGScheduler: Job 9 finished: collect at :39, took 0.251161 s
15/12/16 13:29:19 INFO YarnHistoryService: About to POST entity application_1450213405513_0012 with 10 events to timeline service http://green3:8188/ws/v1/timeline/
Name: Michael
Name: Andy
Name: Justin

Running Hive Examples

The following example reads and writes to HDFS under Hive directories.

  1. Launch the Spark Shell on YARN cluster:
    su hdfs
    ./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client
  2. Create Hive Context:
    scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

    You should see output similar to the following:

    hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@7d9b2e8d
  3. Create a Hive Table:
    scala> hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (key INT, value STRING)")

    You should see output similar to the following:

    15/12/16 13:36:12 INFO PerfLogger: </PERFLOG 
    start=1450290971011 end=1450290972561 duration=1550 
    res8: org.apache.spark.sql.DataFrame = [result: string]
  4. Load example KV value data into Table:
    scala> hiveContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE TestTable")
  5. Invoke Hive collect_list UDF:
    scala> hiveContext.sql("from TestTable SELECT key, collect_list(value) group by key order by key").collect.foreach(println)

Reading & Writing ORC Files

Hortonworks worked in the community to bring full ORC support to Spark. Recently we blogged about using ORC with Spark. See the blog post for all ORC examples, with advances such as partition pruning and predicate pushdown.

Accessing the Spark SQL Thrift Server for JDBC/ODBC Access

With this Tech Preview, Spark SQL’s Thrift Server provides JDBC access to Spark SQL.

  1. As root user, create a logs directory and make the spark user its owner:
    mkdir logs
    chown spark:hadoop logs
  2. Start the Thrift Server:
    Then, from SPARK_HOME, start the Spark SQL Thrift Server. Specify the port value of the Thrift JDBC Server (the default is 10015).
    su spark 
    ./sbin/ --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10015
  3. Connect to the Thrift Server over Beeline:
    Launch Beeline from SPARK_HOME:
    su spark
    cd $SPARK_HOME
  4. Connect to the Thrift Server and issue SQL commands
    On the Beeline prompt specify the following command, replacing <hostname> with the hostname where you launched the Spark Thrift Server:
    beeline>!connect jdbc:hive2://<hostname>:10015

    – This example does not have security enabled, so any username-password combination should work.
    – The connection might take a few seconds to be available in a Sandbox environment. Try the “show tables” command after waiting 10-15 seconds in a Sandbox environment.

    15/12/16 13:43:02 INFO HiveConnection: Will try to open client 
    transport with JDBC Uri: jdbc:hive2://green4:10015
    Connected to: Spark SQL (version 1.6.0)
    Driver: Spark Project Core (version
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    1: jdbc:hive2://green4:10015> show tables;
    |  tableName   | isTemporary  |
    | testtable    | false        |
    | testtable16  | false        |
    2 rows selected (0.893 seconds)
    1: jdbc:hive2://green4:10015>
  5. Stop the Thrift Server:

Running SparkR

Before you run SparkR, make sure R is installed on all nodes. For information about how to install R on CentOS, see Be sure to set JAVA_HOME.

  1. Launch SparkR:
    su spark
    cd $SPARK_HOME

    This will show output similar to the following:

    Welcome to
        ____              __ 
       / __/__  ___ _____/ /__ 
      _\ \/ _ \/ _ `/ __/  '_/ 
     /___/ .__/\_,_/_/ /_/\_\   version  1.6.0 
    Spark context is available as sc, SQL context is available as sqlContext
  2. At your R prompt, create a DataFrame and list the first few lines:
    sqlContext <- sparkRSQL.init(sc)
    df <- createDataFrame(sqlContext, faithful)

    This will show output similar to the following:

     eruptions waiting
    1     3.600      79
    2     1.800      54
    3     3.333      74
    4     2.283      62
    5     4.533      85
    6     2.883      55
  3. Read the “people” DataFrame:
    people <- read.df(sqlContext, "people.json", "json")

    This will produce output similar to the following:

      age    name
    1  NA   Michael
    2  30   Andy
    3  19   Justin

For additional SparkR examples, see

To exit R:

> quit()

Using the DataSet API

The Spark Dataset API brings the best of RDD and Data Frames together, for type safety and user functions that run directly on existing JVM types.

  1. As user spark, launch the Spark Shell:
    cd $SPARK_HOME
    su spark 
    ./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client
  2. At the Spark Shell, type the following:
    val ds = Seq(1, 2, 3).toDS() + 1).collect() // Returns: Array(2, 3, 4)
    // Encoders are also created for case classes.
    case class Person(name: String, age: Long)
    val ds = Seq(Person("Andy", 32)).toDS()
    // DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name.
    val path = "people.json"
    val people =[Person]

Running the Machine Learning Spark Application

To optimize MLlib performance, install the netlib-java native library. If this native library is not available at runtime, you will see a warning message and a pure JVM implementation will be used instead.

To use MLlib in Python, you will need NumPy version 1.4 or later.

Visit for Spark ML examples.

If you have questions about Spark and Data Science, visit the Hortonworks Community Connection:

Getting Started with Spark

For developers new to Spark, our conversations typically revolve around two stages in their journey building Spark-based applications:

Stage 1 – Explore and Develop in Spark Local Mode

The first stage starts with a local mode of Spark where Spark runs on a single node. The developer uses this system to learn Spark and starts to build a prototype of the their application leveraging the Spark API. Using Spark Shells (Scala REPL & PySpark), a developer rapidly prototypes and packages a Spark application with tools such as Maven or Scala Build Tool (SBT). Even though the dataset is typically small (so that it fits on a developer machine), a developer can easily debug the application on a single node.

Stage 2 – Deploy Production Spark Applications

The second stage involves running the prototype application against a much larger dataset to fine tune it and get it ready for a production deployment. Typically, this means running Spark on YARN as another workload in the enterprise data lake and allowing it to read data from HDFS. The developer takes the custom application created against a local mode of Spark and submits the application as a Spark job to a staging or production cluster.

Data Science with Spark

For data scientists, Spark is a highly effective data processing tool. It offers first class support for machine learning algorithms and provides an expressive and higher-level API abstraction for transforming or iterating over datasets. Put simply, Apache Spark makes it easier to build machine learning pipelines compared to other approaches.
Data scientists often use tools such as Notebooks (e.g. iPython) to quickly create prototypes and share their work. Many of data scientists love R, and the Spark community is hard at work to deliver R integration with SparkR. We are excited about this emerging capability.

For ease of use, Apache Zeppelin is an emerging tool that provides Notebook features for Spark. We have been exploring Zeppelin and discovered that it makes Spark more accessible and useful.

Here is a screenshot that provides a view into the compelling user interface that Zeppelin can provide for Spark.

Zeppelin Image

With the release of Spark 1.3, it brought forth new features such as a DataFrames API and Direct Kafka support in Spark Streaming. With Spark 1.4, SparkR gives R support. Given the pace with which these capabilities continue to appear, we plan to continue to provide updates via tech previews between our major releases to allow customers to keep up with the speed of innovation in Spark.

Learn, Try, and Do

The next step is to traverse the tutorials tabs and try some tutorials that demonstrate how to load data into HDFS, create Hive tables, process data in memory, and query data, all using Spark APIs and Spark shell.

To get started download Spark and Zeppelin in the Hortonworks Sandbox and then take a look at the following Spark tutorials:


Spark Tutorials

Spark in the Press

Webinários e apresentações