Call : (+91) 968636 4243
Mail : info@EncartaLabs.com

PySpark

( Duration: 3 Days )

The PySpark training course is designed to provide Developers and/or Data Analysts a gentle immersive hands-on introduction to the Python programming language and Apache PySpark. In this course, you will learn to write Spark apps in Python

Programming and/or scripting experience in another language other than python

This PySpark workshop is recommended for Developers and/or Data Analysts

COURSE AGENDA

1

Introduction to Python

Data Types and Variables
Python Collections
Control Statements and Looping
Functions in Python
Working With Data in Python
Reading and Writing Text Files
Functional Programming Primer

2

Introduction to Apache Spark

What is Apache Spark
A Short History of Spark
Where to Get Spark?
The Spark Platform
Spark Logo
Common Spark Use Cases
Languages Supported by Spark
Running Spark on a Cluster
The Driver Process
Spark Applications
Spark Shell
The spark-submit Tool
The spark-submit Tool Configuration
The Executor and Worker Processes
The Spark Application Architecture
Interfaces with Data Storage Systems
Limitations of Hadoop's MapReduce
Spark vs MapReduce
Spark as an Alternative to Apache Tez
The Resilient Distributed Dataset (RDD)
Datasets and DataFrames
Spark Streaming (Micro-batching)
Spark SQL
Example of Spark SQL
Spark Machine Learning Library
GraphX
Spark vs R

3

The Spark Shell

The Spark Shell
The Spark v.2 + Command-Line Shells
The Spark Shell UI
Spark Shell Options
Getting Help
Jupyter Notebook Shell Environment
Example of a Jupyter Notebook Web UI (Databricks Cloud)
The Spark Context (sc) and Spark Session (spark)
Creating a Spark Session Object in Spark Applications
The Shell Spark Context Object (sc)
The Shell Spark Session Object (spark)
Loading Files
Saving Files

4

Spark RDDs

The Resilient Distributed Dataset (RDD)
Ways to Create an RDD
Supported Data Types
RDD Operations
RDDs are Immutable
Spark Actions
RDD Transformations
Other RDD Operations
Chaining RDD Operations
RDD Lineage
The Big Picture
What May Go Wrong
Checkpointing RDDs
Local Checkpointing
Parallelized Collections
More on parallelize() Method
The Pair RDD
Where do I use Pair RDDs?
Example of Creating a Pair RDD with Map
Example of Creating a Pair RDD with keyBy
Miscellaneous Pair RDD Operations
RDD Caching
RDD Persistence

5

Parallel Data Processing with Spark

Running Spark on a Cluster
Data Partitioning
Data Partitioning Diagram
Single Local File System RDD Partitioning
Multiple File RDD Partitioning
Special Cases for Small-sized Files
Parallel Data Processing of Partitions
Spark Application, Jobs, and Tasks
Stages and Shuffles
The "Big Picture"

6

Shared Variables in Spark

Shared Variables in Spark
Broadcast Variables
Creating and Using Broadcast Variables
Example of Using Broadcast Variables
Problems with Global Variables
Example of the Closure Problem
Accumulators
Creating and Using Accumulators
Example of Using Accumulators (Scala Example)
Example of Using Accumulators (Python Example)
Custom Accumulators

7

Introduction to Spark SQL

What is Spark SQL?
Uniform Data Access with Spark SQL
Hive Integration
Hive Interface
Integration with BI Tools
What is a DataFrame?
Creating a DataFrame in PySpark
Commonly Used DataFrame Methods and Properties in PySpark
Grouping and Aggregation in PySpark
The "DataFrame to RDD" Bridge in PySpark
The SQLContext Object
Examples of Spark SQL / DataFrame (PySpark Example)
Converting an RDD to a DataFrame Example
Example of Reading / Writing a JSON File
Using JDBC Sources
JDBC Connection Example
Performance, Scalability, and Fault-tolerance of Spark SQL

8

Repairing and Normalizing Data

Repairing and Normalizing Data
Dealing with the Missing Data
Sample Data Set
Getting Info on Null Data
Dropping a Column
Interpolating Missing Data in pandas
Replacing the Missing Values with the Mean Value
Scaling (Normalizing) the Data
Data Preprocessing with scikit-learn
Scaling with the scale() Function
The MinMaxScaler Object

9

Data Grouping and Aggregation in Python

Data Aggregation and Grouping
Sample Data Set
The pandas.core.groupby.SeriesGroupBy Object
Grouping by Two or More Columns
Emulating SQL's WHERE Clause
The Pivot Tables
Cross-Tabulation

Encarta Labs Advantage

One Stop Corporate Training Solution Providers for over 6,000 various courses on a variety of subjects
All courses are delivered by Industry Veterans
Get jumpstarted from newbie to production ready in a matter of few days

Trained more than 50,000 Corporate executives across the Globe
All our trainings are conducted in workshop mode with more focus on hands-on sessions

View our other course offerings by visiting https://www.encartalabs.com/course-catalogue-all.php

Contact us for delivering this course as a public/open-house workshop/online training for a group of 10+ candidates.

Top

Notice

X