Apache Spark Big Data Boot Camp

Learn to use Apache Spark for your own applications

Course Code : 1726

[RICH_REVIEWS_SNIPPET stars_only=”true” category=””]

$2750

Overview

Apache Spark is a popular toolset for powering Big Data solutions with distributed cluster computing, owing to its speed, expanded versatility and access to powerful APIs and libraries. Spark gives applications the ability to support data science capabilities with R-type dataframes and Big Data streaming helping overcome the time constraints. This fast-paced three day course provides a thorough, hands-on overview of the Apache Spark platform as well as the technologies and paradigms that form a part of Spark. The course will help participants master all the skills necessary to be able to use Apache Spark for their own applications.

Schedule Classes

Delivery Format
Starting Date
Starting Time
Duration
Location

Live Classroom
Monday, 17 June 2019
8:30 AM - 4:30 PM EST
3 Days
San Diego, CA

Delivery Format
Starting Date
Starting Time
Duration

Live Classroom
Monday, 17 June 2019
11:30 AM - 7:30 PM EST
3 Days

Delivery Format
Starting Date
Starting Time
Duration

Live Classroom
Monday, 22 July 2019
8:30 AM - 4:30 PM EST
3 Days

Looking for more sessions of this class?

Course Delivery

This course is available in the following formats:

Live Classroom
Duration: 5 days

Live Virtual Classroom
Duration: 5 days

What You'll learn

  • The origin of Apache Spark
  • Apache Spark vs. Apache Hadoop
  • Apache Spark use cases
  • Streaming architecture of Spark
  • SQL architecture in Spark
  • Apache Spark and Machine Learning
  • Machine Learning libraries
  • Apache Spark GraphX

Outline

  • Introduction to data analysis
  • Introduction to Big Data
  • Definition of Big Data
  • Introduce the techniques and challenges in Big Data
  • Introduce the techniques and challenges in Distributed Computing
  • Show how the functional programming approach is particularly useful in tackling these challenges
  • Short overview of previous solutions – Google’s MapReduce and Apache Hadoop
  • Introduction to Apache Spark
  • Exercise: Exposure to Admin and setup
  • Spark architecture in a cluster
  • Spark ecosystem and cluster management
  • Deploying Spark on a cluster
  • Deploying Spark on a standalone cluster
  • Deploying Spark on Mesos cluster
  • Deploying Spark on YARN cluster
  • Cloud-based deployment
  • Exercise: Learn to deploy and begin using Spark
  • Dig deeper into Apache Spark
  • Introduce Resilient Distributed Datasets (RDD)
  • Apache Spark installation
  • Introduce the Spark Shell
  • Actions and Transformations (Laziness)
  • Caching
  • Loading and saving data files from the file system
  • Exercise: Get hands-on with Spark Code and RDDs
  • Tailored RDD
  • Pair RDD
  • NewHadoop RDD
  • Aggregations
  • Partitioning
  • Broadcast variables
  • Accumulators
  • Exercise: You’ll learn expanded RDD capabilities
  • SparkSQL and DataFrames
  • DataFrame and SQL API
  • DataFrame Schema
  • Datasets and Encoders
  • Loading and saving data
  • Aggregations
  • Joins
  • Exercise: Learn to use one of Spark’s most powerful features – DataFrames using R-style modelling supported by supercomputer clusters
  • A brief introduction to streaming
  • Spark streaming
  • Discretized streams
  • Structured streaming
  • Stateful/stateless transformations
  • Checkpointing
  • Inter-operability with streaming platforms (Apache Kafka)
  • Exercise: Another of Spark 2.1’s most exciting features is the ability to provide Big Data streaming to allow beating the timeframe constraints of previous Big Data solutions
  • Introduction to Machine Learning
  • Spark Machine Learning APIs
  • Feature extractor and transformation
  • Classification using logistic regression
  • Best practices in machine learning for the practitioners
  • Exercise: Use Spark to perform production-friendly calls for powerful machine learning service and predictive analysis
  • Brief introduction to Graph theory
  • GraphX
  • Vertex and Edge RDDs
  • Graph operators
  • Pregel API
  • PageRank/ Travelling salesman problem
  • Exercise: get hands-on practice using GraphX
  • Testing in a distributed environment
  • Testing Spark application
  • Debugging Spark application
  • Exercise: Lab practice supporting Spark solutions with best practices for testing, debugging and normal-day production issues for Spark solutions
View More

Prerequisites

There are no mandatory prerequisites for this course, however, completing having a basic understanding of Scala/Python would be beneficial. It is also recommended to complete the Fundamentals of DevOps course prior to taking the Apache Spark Big Data boot camp.

Who Should Attend

This boot camp is highly recommended for –

  • Developers and team leads
  • Software engineers
  • Business analysts
  • System analysts
  • Data analysts and scientists
  • Data scientists
  • Operations and DevOps engineers
  • Java developers
  • Big Data engineers

Interested in this course? Let’s connect!