EncartaLabs

Apache Apex

( Duration: 5 Days )

Apache Apex is a next-generation stream processing framework designed to operate on data at large scale, with minimum latency, maximum reliability, and strict correctness guarantees. This Apache Apex training course introduces Apache Apex's unified stream processing architecture, and walks through the creation of a distributed application using Apex on Hadoop.

By attending Apache Apex workshop, delegates will learn to:

  • Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
  • Build, scale and optimize an Apex application
  • Process real-time data streams reliably and with minimum latency
  • Use Apex Core and the Apex Malhar library to enable rapid application development
  • Use the Apex API to write and re-use existing Java code
  • Integrate Apex into other applications as a processing engine
  • Tune, test and scale Apex applications

  • An understanding of big data concepts
  • An understanding of Java
  • Experience with Hadoop

The Apache Apex class is ideal for:

  • Developers
  • Enterprise architects

COURSE AGENDA

1

Introduction to Apex

  • Unbounded data and continuous processing
    • Stream processing
    • Stream processing systems
    • What is Apex and why is it important?
  • Application Model and API
    • Directed Acyclic Graph (DAG)
    • Apex DAG Java API
    • SQL
    • JSON
    • Windowing and time
  • Value proposition of Apex
    • Low latency and stateful processing
    • Native streaming versus micro-batch
    • Performance
    • Where Apex excels
    • Where Apex is not suitable
2

Getting Started with Application Development

  • Development process and methodology
  • Setting up the development environment
  • Creating a new Maven project
  • Application specifications
  • Custom operator development
  • Application configuration
  • Testing in the IDE
  • Running the application on YARN
  • Working on the cluster
3

The Apex Library

  • An overview of the library
  • Integrations
    • Apache Kafka
    • Kafka input
    • Kafka output
    • Other streaming integrations
  • Files
    • File input
    • File splitter and block reader
    • File writer
  • Databases
    • JDBC input
    • JDBC output
    • Other databases
  • Transformations
    • Parser
    • Filter
    • Enrichment
    • Map transform
    • Custom functions
    • Windowed transformations
      • Windowing
      • State
      • Watermarks
      • Triggering
      • Merging of streams
      • The windowing example
    • Dedup
    • Join
    • State Management
4

Scalability, Low Latency, and Performance

  • Scalability, Low Latency, and Performance
  • Partitioning and how it works
  • Elasticity
  • Partitioning toolkit
  • Custom dynamic partitioning
  • Performance optimizations
  • Low-latency versus throughput
  • Sample application for dynamic partitioning
  • Performance - other aspects for custom operators
5

Fault Tolerance and Reliability

  • Distributed systems need to be resilient
  • Fault-tolerance components and mechanism in Apex
  • Checkpointing
  • Processing guarantees
6

Aggregation and Visualization

  • Streaming ETL and beyond
  • The application pattern in a real-world use case
  • Analyzing Twitter feed
  • Running the application
  • The Pub/Sub server
  • Grafana visualization
7

Data Processing

  • Datasource
  • The pipeline
  • Simulation of a real-time feed using historical data
  • Parsing the data
  • Looking up of the zip code and preparing for the windowing operation
  • Windowed operator configuration
  • Serving the data with WebSocket
  • Running the application
  • Running the application on GCP Dataproc
8

ETL Using SQL

  • The application pipeline
  • Building and running the application
  • Application configuration
  • The application code
  • Partitioning
  • Application testing
  • Understanding application logs
  • Calcite integration
9

Introduction to Apache Beam

  • Beam concepts
    • Pipelines, PTransforms, and PCollections
    • Windowing, watermarks, and triggering in Beam
    • Advanced topic - stateful ParDo
  • WordCount in Apache Beam
    • Setting up your pipeline
    • Testing the pipeline at small scale with DirectRunner
  • Running Apache Beam WordCount on Apache Apex
10

The Future of Stream Processing

  • Lower barrier for building streaming pipelines
    • Visual development tools
    • Streaming SQL
    • Better programming API
    • Bridging the gap between data science and engineering
    • Machine learning integration
    • State management
    • State query and data consistency
    • Containerized infrastructure
    • Management tools

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 4,000 Modules on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 50,000 Corporate executives across the Globe
  • All our trainings are conducted in workshop mode with more focus on hands-on sessions

View our other course offerings by visiting http://encartalabs.com/course-catalogue-all.php

Contact us for delivering this course as a public/open-house workshop/online training for a group of 10+ candidates.

Top