EncartaLabs

Apache Spark Programming with Databricks

( Duration: 2 Days )

This Apache Spark Programming with Databricks training course uses a case study driven approach to explore the fundamentals of Spark Programming with Databricks, including Spark architecture, the DataFrame API, query optimization, and Structured Streaming. First, you will become familiar with Databricks and Spark, recognize their major components, and explore datasets for the case study using the Databricks environment. After ingesting data from various file formats, you will process and analyze datasets by applying a variety of DataFrame transformations, Column expressions, and built-in functions. Lastly, you will execute streaming queries to process streaming data and highlight the advantages of using Delta Lake.

By attending Apache Spark Programming with Databricks workshop, delegates will learn to:

  • Define the major components of Spark architecture and execution hierarchy
  • Describe how DataFrames are built, transformed, and evaluated in Spark
  • Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark
  • Apply the Structured Streaming API to perform analytics on streaming data
  • Navigate the Spark UI and describe how the catalyst optimizer, partitioning, and caching affect Spark's execution performance

  • Familiarity with basic SQL concepts (select, filter, groupby, join, etc.)
  • Beginner programming experience with Python or Scala (syntax, conditions, loops, functions)

The Apache Spark Programming with Databricks class is ideal for:

  • Data engineers
  • Data scientists
  • Machine learning engineers
  • Data architects

COURSE AGENDA

1

DataFrames

  • Introduction: Databricks Ecosystem, Spark Overview, Case Study
  • Databricks Platform: Databricks Concepts, Databricks Platform.
  • Spark SQL: Spark SQL, DataFrames, SparkSession.
  • Reader and Writer: Data Sources, DataFrameReader/Writer.
2

DataFrames and Transformations

  • DataFrame and Column: Columns and Expressions, Transformations, Actions, Rows.
  • Aggregation: Groupby, Grouped Data Methods, Aggregate Functions, Math Functions.
  • Datetimes: Dates and Timestamps, Datetime Patterns, Date Functions.
  • Complex types: String Functions, Collection Functions
  • Additional Functions: Non-aggregate Functions, Na Functions.
3

Transformations and Spark Internals

  • Transformations: UDFs: UDFs, Vectorized UDFs, Performance.
  • Spark Architecture: Spark Cluster, Spark Execution, Shuffling, Query Optimization, Catalyst Optimizer, Adaptive Query Execution
  • Query Optimization: Query Optimization, Catalyst Optimizer, Adaptive Query Execution
  • Partitioning: Partitions vs. Cores, Default Shuffle Partitions, Repartition.
4

Structured Streaming and Delta

  • Streaming Query: Streaming Concepts, Streaming Query, Transformations, Monitoring.
  • Processing Streams.b
  • Delta Lake: Delta Lake Concepts, Batch and Streaming

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 4,000 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 50,000 Corporate executives across the Globe
  • All our trainings are conducted in workshop mode with more focus on hands-on sessions

View our other course offerings by visiting http://encartalabs.com/course-catalogue-all.php

Contact us for delivering this course as a public/open-house workshop/online training for a group of 10+ candidates.

Top