EncartaLabs

Hadoop Ecosystem

( Duration: 4 Days )

This Hadoop Ecosystem Training course is, vendor agnostic, technical overview of the Hadoop landscape. No prior knowledge of databases or programming is assumed. This course is targeted towards both technical and non-technical personnel who want to understand the emerging world of Big Data, with a specific focus on Hadoop.

By attending Hadoop Ecosystem workshop, Participants will:

  • Learn the core concepts of the Hadoop ecosystem
  • Deep dive into the critical architecture paths of HDFS, MapReduce and HBase
  • Learn the basics of how to effectively write Pig and Hive scripts
  • Explain how to choose the correct use cases for Hadoop

  • Engineers, Programmers, Networking specialists, Managers, Executives

COURSE AGENDA

1

Intro to Hadoop

  • Parallel Computer vs. Distributed Computing
  • Brief history of Hadoop
  • RDBMS/SQL vs. Hadoop
  • Hadoop vs SETI
  • Structured vs Unstructured data
  • Scaling with Hadoop
  • Google white papers: GFS, MapReduce, BigTable, Chubby
  • Intro to the Hadoop ecosystem: HDFS, MapReduce, Pig, Hive, HBase
  • HDFS overview: NameNode vs DataNode
  • MapReduce overview: JobTracker vs TaskTracker
  • Hadoop XML files for configuration: core-site.xml, hdfs-site.xml, mapred-site.xml
  • Hardware recommendations
  • Hadoop ecosystem: Hive, Pig, HBase, ZooKeeper, Mahout, Hue, Talend, Sqoop, Flume, oozie
  • Book recommendations for Hadoop
  • Vendor Comparison (Cloudera, Hortonworks, MapR, Intel, Amazon EMR)
  • Use cases
2

HDFS Deep Dive

  • Linux File system options (ext3, ext4, XFS)
  • NameNode architecture (EditLog, FsImage, location of replicas)
  • Secondary NameNode architecture
  • DataNode architecture
  • Write Pipeline
  • Read Pipeline
  • Heartbeats, DataNode commissioning/decommissioning, Rack Awareness, Block Scanner, Balancer, Trash, Health Check
  • HDFS disk space quotas and number of files quotas
  • Benchmarking HDFS
  • Settings in the hdfs-site.xml file
  • Exploring the HDFS Web UI
  • Next-gen HDFS: NameNode high availability, snapshots, federation
  • Quick Intro to the Java API interface
  • HDFS Benchmarking with DFSIO
3

Beginning MapReduce

  • MapReduce Architecture
  • JobTracker/TaskTracker
  • Combiner
  • Shuffle and Sort
  • Partitioner
  • Speculative Execution
  • Exploring the MapReduce Web UI
  • Walkthrough of a simple MapReduce example: WordCount
  • Walkthrough of a unstructured file MapReduce example: Facial recognition against video files
  • Walkthrough of structured file MapReduce example: web log files
4

Advanced MapReduce

  • Partitioner
  • Distributed Cache
  • Job Scheduling: FIFO, Fair Scheduler, Capacity Scheduler
  • Thinking in the MapReduce way
  • Serialization and File-Based Data Structures
  • Mapper and Reducer predefined implementations (IndentityMapper, InverseMapper, SumReducer, etc)
  • Default datatypes for k/v pairs: BoonleanWritable, ByteWritable, Text, IntWritable, etc
  • Input/output formats
  • Backlisted TaskTrackers
  • Counters
  • MapReduce configuration files: mapred-site.xml
  • Intro to Monitoring and Debugging on a production cluster
  • Next-gen MapReduce: YARN architecture details
5

Pigs Eat Anything

  • Pig philosophy and architecture
  • Pig Latin and the Grunt shell
  • Loading data
  • Data types and schemas
  • Pig Latin details: structure, functions, expressions, relational operators
  • Intro to User Defined Functions and Scripts
6

Hive for Structured Data

  • Hive philosophy and architecture
  • Hive vs. RDBMS
  • HiveQL and Hive Shell
  • Managing tables
  • Data types and schemas
  • Querying data
  • Partitions and Buckets
  • Intro to User Defined Functions
7

Real-time I/O with HBase

  • NoSQL architectures overview: Key-value, Key-document, Column Family, Graph, Real Time
  • HBase architecture
  • HBase vs Cassandra
  • HBase versions and origins
  • HBase vs. RDBMS
  • HBase Master and Region Servers
  • Intro to ZooKeeper
  • Data Modeling
  • Column Families and Regions
  • Bloom Filters and Block Indexes
  • Block Cache
  • Write Pipeline/ Read Pipeline
  • Deletes and Tombstones
  • Compactions: Minor vs. Major
  • Table Scans and Filters
  • Increment columns
  • Hardware trends for HBase, Sizing
  • HBase Operations and Troubleshooting: HTrace, Hanibal, Ganglia

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 3,500 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 20,000 corporate candidates across india and abroad
  • All our trainings are conducted in workshop mode with more focus on hands On

View our other course offerings by visiting www.encartalabs.com/course-catalogue

Contact us for delivering this course as a public/open-house workshop for a group of 10+ candidates at our venue

Top