DEV 362 - Create Data Pipelines Using Apache Spark

DEV 362 is the third in the Apache Spark series. In this course you cover the following Apache Spark libraries - Spark Streaming, Spark SQL, Spark MLlib and Spark GraphX.

Processing...
Processing...

About this course

This on-demand course is designed to be flexible to fit your schedule. Each lesson and quiz takes approximately 30 to 45 minutes to complete.

  • Option 1: Complete the course in one session, approximately 120 to 150 minutes
  • Option 2: Complete the course over a few days, 4 days of 30-45min/day

Lab activities take additional time and vary based on your system.

DEV 362 describes the benefits of the Apache Spark unified platform and how to build data pipeline application using Spark streaming, Spark SQLSpark GraphX and MLlib. The concepts are taught using scenarios in Scala that also form the basis of hands-on labs.

Syllabus

Lesson 7 – Introduction to Apache Spark Data Pipelines

  • Identify Components of Apache Spark Unified Stack
  • Benefits of the Apache Spark Unified Stack over Hadoop eco-system
  • Describe Data Pipeline Use Cases

Lesson 8 – Create an Apache Spark Streaming Application

  • Spark Streaming Architecture
  • Create DStreams
  • Create a simple Spark Streaming Application
    • Lab: Create a Spark Streaming Application
  • DStream Operations
    • Lab: Apply operations on DStreams
  • Apply DStream Operations
  • Use Spark SQL to query DStreams
  • Define Window Operations
  • Lab: Add windowing operations
  • Describe how DStreams are fault-tolerant

Lesson 9 – Use Apache Spark GraphX to Analyze Flight Data

  • Describe GraphX
  • Define a property graph
    • Lab: Create a Property Graph
  • Perform operations on Graphs
    • Lab: Apply Graph Operations

Lesson 10 – Use Apache Spark MLlib to Predict Flight Delays

  • Describe Spark MLlib
  • Describe a generic classification workflow
  • Describe common terms for supervised learning
  • Use a decision tree for Classfication and Regression
  • Lab:Create a DecisionTree model to predict flight delays on streaming data

Prerequisites for Success in the Course

Review the following prerequisites carefully and decide if you are ready to succeed in this programming-oriented course. The Instructor will move forward with lab exercises, assuming that you have mastered the skills listed below.

Required
  • Basic to intermediate Linux knowledge, including the ability to use a text editor, such as vi and familiarity with basic command-line options such a mv, cp, ssh, grep, cd, useradd
  • Knowledge of application development principles
  • A Linux, Windows or MacOS computer with the MapR Sandbox installed (On-demand course)
  • Connection to a Hadoop cluster via SSH and web browser (for the ILT and vILT course)
Recommended
  • Knowledge of functional programming
  • Knowledge of Scala or Python
  • Beginner fluency with SQL
  • HDE 100 - Hadoop Essentials Certification
  • This course is part of the preparation for the MapR Certified Spark Developer (MCSD) certification exam.

Curriculum

  • Lesson 7 - Introduction to Apache Spark Data Pipelines
  • Lesson 8 - Create Data Pipelines With Apache Spark
  • Lesson 8 Quiz
  • Lesson 9 - Use Apache Spark GraphX
  • Lesson 9 Quiz
  • Lesson 10 - Use Apache Spark MLlib
  • Lesson 10 Quiz
  • Course Materials
  • Spark Developer Certification Study Guide
  • Slide Guide (Transcript)
  • Lab Guide
  • Lab Files and Data
  • Lab Environment Connection Guide
  • Join course discussions in the MapR Academy Community

About this course

This on-demand course is designed to be flexible to fit your schedule. Each lesson and quiz takes approximately 30 to 45 minutes to complete.

  • Option 1: Complete the course in one session, approximately 120 to 150 minutes
  • Option 2: Complete the course over a few days, 4 days of 30-45min/day

Lab activities take additional time and vary based on your system.

DEV 362 describes the benefits of the Apache Spark unified platform and how to build data pipeline application using Spark streaming, Spark SQLSpark GraphX and MLlib. The concepts are taught using scenarios in Scala that also form the basis of hands-on labs.

Syllabus

Lesson 7 – Introduction to Apache Spark Data Pipelines

  • Identify Components of Apache Spark Unified Stack
  • Benefits of the Apache Spark Unified Stack over Hadoop eco-system
  • Describe Data Pipeline Use Cases

Lesson 8 – Create an Apache Spark Streaming Application

  • Spark Streaming Architecture
  • Create DStreams
  • Create a simple Spark Streaming Application
    • Lab: Create a Spark Streaming Application
  • DStream Operations
    • Lab: Apply operations on DStreams
  • Apply DStream Operations
  • Use Spark SQL to query DStreams
  • Define Window Operations
  • Lab: Add windowing operations
  • Describe how DStreams are fault-tolerant

Lesson 9 – Use Apache Spark GraphX to Analyze Flight Data

  • Describe GraphX
  • Define a property graph
    • Lab: Create a Property Graph
  • Perform operations on Graphs
    • Lab: Apply Graph Operations

Lesson 10 – Use Apache Spark MLlib to Predict Flight Delays

  • Describe Spark MLlib
  • Describe a generic classification workflow
  • Describe common terms for supervised learning
  • Use a decision tree for Classfication and Regression
  • Lab:Create a DecisionTree model to predict flight delays on streaming data

Prerequisites for Success in the Course

Review the following prerequisites carefully and decide if you are ready to succeed in this programming-oriented course. The Instructor will move forward with lab exercises, assuming that you have mastered the skills listed below.

Required
  • Basic to intermediate Linux knowledge, including the ability to use a text editor, such as vi and familiarity with basic command-line options such a mv, cp, ssh, grep, cd, useradd
  • Knowledge of application development principles
  • A Linux, Windows or MacOS computer with the MapR Sandbox installed (On-demand course)
  • Connection to a Hadoop cluster via SSH and web browser (for the ILT and vILT course)
Recommended
  • Knowledge of functional programming
  • Knowledge of Scala or Python
  • Beginner fluency with SQL
  • HDE 100 - Hadoop Essentials Certification
  • This course is part of the preparation for the MapR Certified Spark Developer (MCSD) certification exam.

Curriculum

  • Lesson 7 - Introduction to Apache Spark Data Pipelines
  • Lesson 8 - Create Data Pipelines With Apache Spark
  • Lesson 8 Quiz
  • Lesson 9 - Use Apache Spark GraphX
  • Lesson 9 Quiz
  • Lesson 10 - Use Apache Spark MLlib
  • Lesson 10 Quiz
  • Course Materials
  • Spark Developer Certification Study Guide
  • Slide Guide (Transcript)
  • Lab Guide
  • Lab Files and Data
  • Lab Environment Connection Guide
  • Join course discussions in the MapR Academy Community