EncartaLabs

Cloudera Developer

( Duration: 5 Days )

This Cloudera Developer training course delivers the key concepts and expertise you need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. You will learn to identify which tool is the right one to use in a given situation, and will gain hands-on experience in developing using those tools

This course is an excellent place to start for people working towards the CCA Spark & Hadoop Developer certification. Although further study is required before passing the exam, this course covers many of the subjects tested in the CCA Spark & Hadoop Developer exam.

By attending Cloudera Developer workshop, delegates will learn:

  • How data is distributed, stored, and processed in a Hadoop cluster
  • How to use Sqoop and Flume to ingest data
  • How to process distributed data with Apache Spark
  • How to model structured data as tables in Impala and Hive
  • How to choose the best data storage format for different data usage patterns
  • Best practices for data storage

  • Experience with programming. Apache Spark examples and hands-on exercises are presented in Scala and Python, so the ability to program in one of those languages is required.
  • Basic familiarity with the Linux command line is assumed.
  • Basic knowledge of SQL is helpful.

The Cloudera Developer class is ideal for:

  • Developers and engineers who have programming experience.

COURSE AGENDA

1

Introduction to Hadoop and the Hadoop Ecosystem

  • Problems with Traditional Large-Scale Systems
  • Hadoop!
  • Data Storage and Ingest
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
2

Hadoop Architecture and HDFS

  • Distributed Processing on a Cluster
  • Storage: HDFS Architecture
  • Storage: Using HDFS
  • Resource Management: YARN Architecture
  • Resource Management: Working with YARN
3

Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2
4

Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Querying Data With Impala and Hive
  • Comparing Hive and Impala to Traditional Databases
5

Modeling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching
6

Data Formats

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Impala, Hive, and Sqoop
  • Avro Schema Evolution
  • Compression
7

Data File Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive
8

Capturing Data with Apache Flume

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration
9

Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark
10

Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations
11

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging
12

Parallel Processing in Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-Based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks
13

Spark RDD Persistence

  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence
14

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
15

DataFrames and Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala, and Hive-on-Spark

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 4,000 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 50,000 Corporate executives across the Globe
  • All our trainings are conducted in workshop mode with more focus on hands-on sessions

View our other course offerings by visiting http://encartalabs.com/course-catalogue-all.php

Contact us for delivering this course as a public/open-house workshop/online training for a group of 10+ candidates.

Top