Apache Spark

( Duration: 3 Days )

In Apache Spark training course, Participants will build complete, unified big data applications that combine batch, streaming, and interactive analytics on all data. They will learn to use Spark to write sophisticated parallel applications for faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

By attending Apache Spark workshop, Participants will learn to:

  • Use the Spark shell for interactive data analysis
  • Features of Spark's Resilient Distributed Datasets
  • Fundamentals of running Spark on a cluster
  • Parallel programming with Spark
  • Write Spark applications
  • Process streaming data with Spark

  • Some programming experience (Python and Scala suggested)
  • Basic knowledge of Linux
  • Knowledge of Hadoop not required



Why Spark?

  • Problems with Traditional Large-Scale Systems
  • Introducing Spark

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • Resilient Distributed Datasets (RDDs)
  • Functional Programming with Spark

Working with RDDs

  • RDD Operations
  • Key-Value Pair RDDs
  • MapReduce and Pair RDD Operations

The Hadoop Distributed File System

  • Why HDFS?
  • HDFS Architecture
  • Using HDFS

Running Spark on a Cluster

  • A Spark Standalone Cluster
  • The Spark Standalone Web UI

Parallel Programming with Spark

  • RDD Partitions and HDFS Data Locality
  • Working with Partitions
  • Executing Parallel Operations

Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

Writing Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Configuring Spark Properties
  • Building and Running a Spark Application
  • Logging

Spark, Hadoop, and the Enterprise Data Center

  • Spark and the Hadoop Ecosystem
  • Spark and MapReduce

Spark Streaming

  • Example: Streaming Word Count
  • Other Streaming Operations
  • Sliding Window Operations
  • Developing Spark Streaming Applications

Common Spark Algorithms

  • Iterative Algorithms
  • Graph Analysis
  • Machine Learning

Improving Spark Performance

  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators
  • Common Performance Issues

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 3,500 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 20,000 corporate candidates across india and abroad
  • All our trainings are conducted in workshop mode with more focus on hands On

View our other course offerings by visiting www.encartalabs.com/course-catalogue

Contact us for delivering this course as a public/open-house workshop for a group of 10+ candidates at our venue