newsletter

Obtenha atualizações recentes da Hortonworks por e-mail

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Uma vez por mês, receba os mais recentes insights, tendências, informações analíticas e conhecimentos sobre Big Data.

cta

Comece a Usar

nuvem

Pronto para começar?

Baixar sandbox

Como podemos ajudá-lo?

* Eu entendo que posso cancelar a inscrição a qualquer momento. Eu também reconheço as informações adicionais encontradas na Política de Privacidade da Hortonworks.
fecharBotão Fechar
HDP > Desenvolva com o Hadoop > Apache Spark

Hands-On Tour of Apache Spark in 5 Minutes

nuvem Pronto para começar?

BAIXAR SANDBOX

Introduction

In this tutorial, we will provide an overview of Apache Spark, it’s relationship with Scala, Zeppelin notebooks, Interpreters, Datasets and DataFrames. Finally, we will showcase Apache Zeppelin notebook for our development environment to keep things simple and elegant.

Zeppelin will allow us to run in a pre-configured environment and execute code written for Spark in Scala and SQL, a few basic Shell commands, pre-written Markdown directions, and an HTML formatted table.

To make things fun and interesting, we will introduce a film series dataset from the Silicon Valley Comedy TV show and perform some basic operations with Spark in Zeppelin.

Prerequisites

Outline

Concepts

Apache Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow developers to execute a variety of data intensive workloads.

Spark Logo

Spark Datasets are strongly typed distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python.

New to Scala?

Throughout this tutorial we will use basic Scala syntax.

Learn more about Scala, here’s an excellent introductory tutorial.

New to Zeppelin?

If you haven’t already, checkout the Hortonworks Apache Zeppelin page as well as the Getting Started with Apache Zeppelin tutorial. You will find the official Apache Zeppelin page here.

New to Spark?

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

If you would like to learn more about Apache Spark visit:

What are Interpreters?

Zeppelin Notebooks supports various interpreters which allow you to perform many operations on your data. Below are just a few of operations you can do with Zeppelin interpreters:

  • Ingestion
  • Munging
  • Wrangling
  • Visualization
  • Analysis
  • Processing

These are some of the interpreters that will be utilized throughout our various Spark tutorials.

Interpreter Descrição
%spark2 Spark interpreter to run Spark 2.x code written in Scala
%spark2.sql Spark SQL interpreter (to execute SQL queries against temporary tables in Spark)
%sh Shell interpreter to run shell commands like move files
%angular Angular interpreter to run Angular and HTML code
%md Markdown for displaying formatted text, links, and images

Note the % at the beginning of each interpreter. Each paragraph needs to start with % followed by the interpreter name.

Learn more about Zeppelin interpreters.

What are Datasets and DataFrames?

Datasets and DataFrames are distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python. Key difference between the Dataset and the DataFrame is that Datasets are strongly typed.

There are complex manipulations possible on Datasets and DataFrames, however they are beyond this quick guide.

Learn more about Datasets and DataFrames.

Apache Spark in 5 Minutes Notebook Overview

Silicon Valley Image

We will download and ingest an external dataset about the Silicon Valley Show episodes into a Spark Dataset and perform basic analysis, filtering, and word count.

After a series of transformations, applied to the Datasets, we will define a temporary view (table) such as the one below.

DataFrame Contents Table

You will be able to explore those tables via SQL queries likes the ones below.

Complex SQL Query Graph

Once you have a handle on the data and perform a basic word count, we will add a few more steps for a more sophisticated word count analysis like the one below.

Improved Word Count Sample

By the end of this tutorial, you should have a basic understanding of Spark and an appreciation for its powerful and expressive APIs with the added bonus of a developer friendly Zeppelin notebook environment.

Import the Notebook

Import the Apache Spark in 5 Minutes notebook into your Zeppelin environment. (If at any point you have any issues, make sure to checkout the Getting Started with Apache Zeppelin tutorial).

To import the notebook, go to the Zeppelin home screen.

1. Click Import note

2. Select Add from URL

3. Copy and paste the following URL into the Note URL

# Getting Started ApacheSpark in 5 Minutes Notebook

https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/hands-on-tour-of-apache-spark-in-5-minutes/assets/Getting%20Started%20_%20Apache%20Spark%20in%205%20Minutes.json

4. Click on Import Note

Once your notebook is imported, you can open it from the Zeppelin home screen by:

5. Clicking Getting Started

6. Select Apache Spark in 5 Minutes

Once the Apache Spark in 5 Minutes notebook is up, follow all the directions within the notebook to complete the tutorial.

Summary

We hope that you’ve been able to successfully run this short introductory notebook and we’ve got you interested and excited enough to further explore Spark with Zeppelin.

Further Reading

User Reviews

User Rating
0 No Reviews
5 Star 0%
4 Star 0%
3 Star 0%
2 Star 0%
1 Star 0%
Tutorial Name
Hands-On Tour of Apache Spark in 5 Minutes

To ask a question, or find an answer, please visit the Hortonworks Community Connection.

No Reviews
Write Review

Registrar

Please register to write a review

Share Your Experience

Example: Best Tutorial Ever

You must write at least 50 characters for this field.

Success

Thank you for sharing your review!