EncartaLabs

Hadoop Internals

( Duration: 3 Days )

In Hadoop Internals training course, you will gain a comprehensive understanding of all the steps necessary to operate and maintain a Hadoop cluster, covering topics from installation and configuration through load balancing and tuning, this course is the best preparation for the real-world challenges faced by Hadoop administrators.

By attending Hadoop Internals workshop, delegates will learn:

  • The internals of MapReduce and HDFS and how to build Hadoop architecture
  • Proper cluster configuration and deployment to integrate with systems and hardware in the data center
  • How to load data into the cluster from dynamically generated files using Flume and from RDBMS using Sqoop
  • Configuring the Fair Scheduler to provide service-level agreements for multiple users of a cluster
  • Installing and implementing Kerberos-based security for your cluster
  • Best practices for preparing and maintaining Apache Hadoop in production
  • Troubleshooting, diagnosing, tuning, and solving Hadoop issues

Hadoop Internals class is designed for system administrators and IT managers who have basic Linux systems administration experience. Prior knowledge of Hadoop is not required.

System administrators and others responsible for managing Apache Hadoop clusters in production or development environments.

COURSE AGENDA

1

Hadoop Introduction

  • Move computation not data
  • Hadoop performance and data scale facts
  • Hadoop in the context of other data stores
  • The Apache Hadoop Project
  • Hadoop - an inside view: MapReduce and HDFS
  • The Hadoop Ecosystem
  • What about NoSQL?
  • Comparison with Other Systems
  • RDBMS
  • Grid Computing
  • Volunteer Computing
  • A Brief History of Hadoop
  • Apache Hadoop and the Hadoop Ecosystem
  • Hadoop Releases
2

MapReduce

  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Java MapReduce Scaling Out
  • Data Flow Combiner Functions
  • Running a Distributed MapReduce Job
  • Hadoop Streaming
    • Ruby
    • Python
  • Hadoop Pipes
  • Constructing the basic template of a MapReduce program
  • Counting things
  • Adapting for Hadoop’s API changes
  • Streaming in Hadoop
    • Streaming with Unix commands
    • Streaming with scripts
    • Streaming with key/value pairs
    • Streaming with the Aggregate package
  • Improving performance with combiners
3

Distributing Data with HDFS

  • The Design of HDFS
  • HDFS Concepts
    • Blocks
    • Namenodes and Datanodes
    • HDFS Federation
    • HDFS High-Availability
  • The Command-Line Interface
    • Basic Filesystem Operations
  • Hadoop Filesystems
  • Interfaces
  • The Java Interface
    • Reading Data from a Hadoop URL
    • Reading Data Using the FileSystem API
    • Writing Data
    • Directories
    • Querying the Filesystem
    • Deleting Data
  • Data Flow
    • Anatomy of a File Read
    • Anatomy of a File Write
    • Coherency Model
  • Parallel Copying with distcp
    • Keeping an HDFS Cluster Balanced
    • Hadoop Archives
  • Using Hadoop Archives
    • Limitations
4

Understanding Hadoop I/O

  • Data Integrity
    • Data Integrity in HDFS
    • Local FileSystem
    • Checksum FileSystem
  • Compression
    • Codecs
    • Compression and Input Splits
    • Using Compression in MapReduce
  • Serialization
    • The Writable Interface
    • Writable Classes
    • Implementing a Custom Writable
    • Serialization Frameworks
    • Avro
  • File-Based Data Structures
    • SequenceFile
    • MapFile
5

Advanced MapReduce

  • Chaining MapReduce jobs
    • Chaining MapReduce jobs in a sequence
    • Chaining MapReduce jobs with complex dependency
    • Chaining preprocessing and postprocessing steps
  • Joining data from different sources
    • Reduce-side joining
    • Replicated joins using DistributedCache
    • Semijoin: reduce-side join with map-side filtering
  • Creating a Bloom filter
    • What does a Bloom filter do?
    • Implementing a Bloom filter
    • Bloom filter in Hadoop version 0.20+
6

Writing Map-Reduce Applications

  • The Configuration API
  • Configuring the Development Environment
  • Running Locally on Test Data
  • Cluster Specs
  • Cluster Setup and Installation
  • Hadoop Configuration
  • YARN Configuration
  • Benchmarking a Hadoop Cluster
  • Hadoop in the Cloud
  • Tuning
  • MapReduce Workflows
  • Monitoring and debugging on a production cluster
  • Tuning for performance
7

Map-Reduce Internals

  • Anatomy of a MapReduce Job Run
    • Classic MapReduce (MapReduce 1)
    • YARN (MapReduce 2)
  • Failures
    • Failures in Classic MapReduce
    • Failures in YARN
  • Job Scheduling
    • The Fair Scheduler
    • The Capacity Scheduler
  • Shuffle and Sort
    • The Map Side
    • The Reduce Side
    • Configuration Tuning
  • Task Execution
    • The Task Execution Environment
    • Speculative Execution
    • Output Committers
    • Task JVM Reuse
    • Skipping Bad Records
8

Managing Hadoop

  • Setting up parameter values for practical use
  • Checking system’s health
  • Setting permissions
  • Managing quotas
  • Enabling trash
  • Removing DataNodes
  • Adding DataNodes
  • Managing NameNode and Secondary NameNode
  • Recovering from a failed NameNode
  • Designing network layout and rack awareness
  • Map-Reduce Features
    • Counters
    • Sorting
    • Joins
    • Side Data Distribution
    • Map-Reduce Library
9

Map-Reduce Ecosystem

  • Pig
    • Thinking like a Pig
      • Data flow language
      • Data types
      • User-defined functions
  • Installing Pig
    • Managing the Grunt shell
    • Learning Pig Latin through Grunt
  • Speaking Pig Latin
    • Data types and schemas
    • Expressions and functions
    • Relational operators
    • Execution optimization
  • Hive
    • Installing and configuring Hive
    • Example queries
    • HiveQL in details
    • Hive Sum-up
  • Hbase
    • Intoduction
    • Concepts
    • Clients
    • Hbase vs RDBMS

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 4,000 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 50,000 Corporate executives across the Globe
  • All our trainings are conducted in workshop mode with more focus on hands-on sessions

View our other course offerings by visiting http://encartalabs.com/course-catalogue-all.php

Contact us for delivering this course as a public/open-house workshop/online training for a group of 10+ candidates.

Top