DEV 361 - Build and Monitor Apache Spark Applications

Part 2 of the Apache Spark training program, for developers and data analysts. This course covers pair RDD, DataFrames and monitoring Spark applications.

Processing...
Processing...

About this Course

This on-demand course is designed to be flexible to fit your schedule. Each lesson and quiz takes approximately 30 to 45 minutes to complete.

  • Option 1: Complete the course in one session, approximately 90 to 120 minutes
  • Option 2: Complete the course over a few days, 3 days of 30-45min/day

Lab activities take additional time and vary based on your system.

DEV 361 is the second in the Apache Spark series. You will learn to create and modify pair RDDs, perform aggregations, and control the layout of pair RDDs across nodes with data partitioning.

This course also discusses Spark SQL and DataFrames, the programming abstraction of Spark SQL. You will learn the different ways to load data into DataFrames, perform operations on DataFrames using DataFrame functions, actions and language integrated queries, and create and use user-defined functions with DataFrames.

This course also describes the components of the Spark execution model using the Spark Web UI to monitor Spark applications. The concepts are taught using scenarios in Scala that also form the basis of hands-on labs. Lab solutions are provided in Scala and Python.

Syllabus

Lesson 4 - Work with Pair RDD

  • Describe pair RDD
  • Why use pair RDD
  • Create pair RDD
  • Apply transformations and actions to pair RDD
  • Control partitioning across nodes
  • Changing paritions
  • Determine the partitioner

Lesson 5 - Work with Spark DataFrames

  • Create Apache Spark DataFrames
  • Work with data in DataFrames
  • Create user defined functions
  • Repartition DataFrame

Lesson 6 - Monitor a Spark Application

  • Describe the components of the Spark execution model
  • Use the SparkUI to monitor a Spark application
  • Debug & tune Spark applications
  • Prerequisites for Success in the Course

    Review the following prerequisites carefully and decide if you are ready to succeed in this programming-oriented course. The Instructor will move forward with lab exercises, assuming that you have mastered the skills listed below.

    Required
    • Basic to intermediate Linux knowledge, including the ability to use a text editor, such as vi and familiarity with basic command-line options such a mv, cp, ssh, grep, cd, useradd
    • Knowledge of application development principles
    • A Linux, Windows or MacOS computer with the MapR Sandbox installed (On-demand course)
    • Connection to a Hadoop cluster via SSH and web browser (for the ILT and vILT course)
    Recommended
    • Knowledge of functional programming
    • Knowledge of Scala or Python
    • Beginner fluency with SQL
    • HDE 100 - Hadoop Essentials Certification
    • This course is part of the preparation for the MapR Certified Spark Developer (MCSD) certification exam.

Curriculum

  • Pre-test
  • Lesson 4: Work with Pair RDD
  • Quiz 4
  • Lesson 5: Work with DataFrames
  • Quiz 5
  • Lesson 6: Monitor Apache Spark Applications
  • Quiz 6
  • Course Materials
  • Spark Developer Certification Study Guide
  • Slide Guide (Transcript)
  • Lab Guide
  • Join MapR Community Discussions
  • Lab Files

About this Course

This on-demand course is designed to be flexible to fit your schedule. Each lesson and quiz takes approximately 30 to 45 minutes to complete.

  • Option 1: Complete the course in one session, approximately 90 to 120 minutes
  • Option 2: Complete the course over a few days, 3 days of 30-45min/day

Lab activities take additional time and vary based on your system.

DEV 361 is the second in the Apache Spark series. You will learn to create and modify pair RDDs, perform aggregations, and control the layout of pair RDDs across nodes with data partitioning.

This course also discusses Spark SQL and DataFrames, the programming abstraction of Spark SQL. You will learn the different ways to load data into DataFrames, perform operations on DataFrames using DataFrame functions, actions and language integrated queries, and create and use user-defined functions with DataFrames.

This course also describes the components of the Spark execution model using the Spark Web UI to monitor Spark applications. The concepts are taught using scenarios in Scala that also form the basis of hands-on labs. Lab solutions are provided in Scala and Python.

Syllabus

Lesson 4 - Work with Pair RDD

  • Describe pair RDD
  • Why use pair RDD
  • Create pair RDD
  • Apply transformations and actions to pair RDD
  • Control partitioning across nodes
  • Changing paritions
  • Determine the partitioner

Lesson 5 - Work with Spark DataFrames

  • Create Apache Spark DataFrames
  • Work with data in DataFrames
  • Create user defined functions
  • Repartition DataFrame

Lesson 6 - Monitor a Spark Application

  • Describe the components of the Spark execution model
  • Use the SparkUI to monitor a Spark application
  • Debug & tune Spark applications
  • Prerequisites for Success in the Course

    Review the following prerequisites carefully and decide if you are ready to succeed in this programming-oriented course. The Instructor will move forward with lab exercises, assuming that you have mastered the skills listed below.

    Required
    • Basic to intermediate Linux knowledge, including the ability to use a text editor, such as vi and familiarity with basic command-line options such a mv, cp, ssh, grep, cd, useradd
    • Knowledge of application development principles
    • A Linux, Windows or MacOS computer with the MapR Sandbox installed (On-demand course)
    • Connection to a Hadoop cluster via SSH and web browser (for the ILT and vILT course)
    Recommended
    • Knowledge of functional programming
    • Knowledge of Scala or Python
    • Beginner fluency with SQL
    • HDE 100 - Hadoop Essentials Certification
    • This course is part of the preparation for the MapR Certified Spark Developer (MCSD) certification exam.

Curriculum

  • Pre-test
  • Lesson 4: Work with Pair RDD
  • Quiz 4
  • Lesson 5: Work with DataFrames
  • Quiz 5
  • Lesson 6: Monitor Apache Spark Applications
  • Quiz 6
  • Course Materials
  • Spark Developer Certification Study Guide
  • Slide Guide (Transcript)
  • Lab Guide
  • Join MapR Community Discussions
  • Lab Files