Developing with Spark for Big Data| Enterprise-Grade Spark Programming for the Hadoop & Big Data Ecosystem

Discover enterprise-grade Spark programming for developing data science solutions

Course Code : 1050

[RICH_REVIEWS_SNIPPET stars_only=”true” category=””]

$2895

Overview

The Developing with Spark for Big Data course introduces participants to enterprise-grade Spark programming, covering intermediate-level and advanced-level concepts of Spark programming, and enabling participants to work with the key components of Apache Spark for developing data science solutions. The course equips participants with the skills and knowledge to work with Apache Spark in real-world enterprises and execute effective data-driven decisions. The course involves various hands-on exercises to ensure that participants have a thorough understanding of all the concepts covered in the course.

Schedule Classes

Delivery Format
Starting Date
Starting Time
Duration

Live Classroom
Monday, 16 September 2019
10:00 AM - 6:00 PM EST
5 Days

Looking for more sessions of this class?

Course Delivery

This course is available in the following formats:

Live Classroom
Duration: 5 days

Live Virtual Classroom
Duration: 5 days

What You'll learn

  • Basics of Spark architecture and applications
  • Executing Spark programs
  • Creating and manipulating both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • Restoring data frames
  • Essential NoSQL access
  • Integrating machine learning into Spark applications
  • Sing Spark streaming and Kafka to create streaming applications

Outline

  • Hadoop ecosystem
  • Hadoop YARN vs. Mesos
  • Spark vs. MapReduce
  • Spark with MapReduce – Lambda Architecture
  • Spark in the enterprise data science architecture
  • Spark Shell
  • RDDs: Resilient Distributed Datasets
  • Data frames
  • Spark 2 unified DataFrames
  • Spark sessions
  • Functional programming
  • Spark SQL
  • MLib
  • Structured streaming
  • Spark R
  • Spark and Python
  • Coding with RDDs
  • Transformation
  • Actions
  • Lazy evaluation and optimization
  • RDDs n MapReduce
  • RDDs vs. DataFrames
  • Unified DataFrames (UDF) in Spark 2.0
  • Partitioning
  • Spark sessions
  • Running applications
  • Logging
  • RDD persistence
  • DataFrame and Unified DataFrame persistence
  • Distributed Persistence
  • Streaming overview
  • Streams
  • Structured streaming
  • DStreams and Apache Kafka
  • Ingesting data
  • Parquet files
  • Relational databases
  • Graph databases (Neo4J, GraphX)
  • Interacting with Hive
  • Accessing Cassandra data
  • Document databases (MongoDB, CouchDB)
  • MapReduce and Lambda integration
  • Camel integration
  • Drools and Spark
  • MLib and Mahout
  • Classification
  • Clustering
  • Decision trees
  • Decompositions
  • Pipelines
  • Spark packages
  • Spark SQL
  • SQL and DataFrames
  • Spark SQL and Hive
  • Spark SQL and JDBC
  • Graph APIs
  • GraphX
  • ETL in GraphX
  • Exploratory analysis
  • Graph computation
  • Pregel APi overview
  • GraphX algorithms
  • Neo4J as an alternative
  • Using web notebooks (Zeppelin, Jupyter)
  • R on Spark
  • Python on Spark
  • Scala on Spark
  • Parallelizing Spark applications
  • Clustering concerns for developers
  • Monitoring Spark performance
  • Tuning memory
  • Tuning CPU
  • Tuning Data Locality
  • Troubleshooting
View More

Prerequisites

Participants need to have experience working in a development role and have an understanding of the Big Data and Hadoop ecosystem.

Who Should Attend

The course is highly recommended for –

  • Developers
  • Architects
  • Big Data professionals
  • Hadoop professionals

Interested in this course? Let’s connect!