Apache Spark for Data Scientists

Learn how to make the most of Apache Spark for data analysis

Course Code : 1341

$2295

Overview

This three-day course equips participants with the knowledge, skills and tools to use Apache Spark effectively for their data analysis needs. Participants will gain practical training in Apache Spark and will be taught how to build comprehensive big data applications by integrating batch, streaming, and interactive analytics on all their data. The course explores using Apache Spark for common data related activities. With Spark, participants will learn how to write sophisticated parallel applications to execute faster, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

Schedule Classes

Looking for more sessions of this class?

Course Delivery

This course is available in the following formats:

Live Classroom
Duration: 5 days

Live Virtual Classroom
Duration: 5 days

What You'll learn

  • Introduction to Spark
  • Spark and data science
  • Build comprehensive big data applications by integrating batch, streaming, and interactive analytics on all data
  • Use Apache Spark for common data related activities
  • Understand data frames and Spark SQL
  • Work with Spark MLib, Spark GraphX and Spark streaming
  • Understand memory management

Outline

  • Data Science: The state of the art
  • Hadoop, Yarn, and Spark
  • Architectural overview
  • Spark and Storm
  • MLib and Mahout
  • Distributed vs. local run modes
  • Hello, Spark
  • Spark core
  • Spark SQL
  • Spark and Hive
  • MLib
  • Mahout
  • Spark streaming
  • Spark API
  • DataFrames and resilient distributed datasets (RDDs)
  • Partitions
  • DataFrame types
  • DataFrame operations
  • Map/Reduce with DataFrames
  • Spark SQL overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table definitions
  • ETL in Spark
  • Queries
  • MLib overview
  • MLib algorithms overview
  • Streaming overview
  • Real-time data ingestion
  • State
  • Window operations
  • GraphX overview
  • ETL with GraphX
  • Graph computation
  • Broadcast variables
  • Accumulators
  • Memory management
  • Standalone cluster
  • Masters and workers
  • Configurations
  • Working with large data sets
View More

Prerequisites

There are no mandatory prerequisites for this course, however, completing the Foundations of Agile course prior to taking up this course would be beneficial.

Who Should Attend

This course is intended for systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data.

This course is highly recommended for:

  • Big data & analytics – Consultants
  • Data scientists
  • Data engineers
  • System administrators
  • System testers
  • Data analysts

Interested in this course? Let’s connect!

Customer Reviews

Name
Email
Rating
Comments

No reviews yet