This course is divided into three parts.

Module 1: Fundamental Big Data Engineering

This part explores introductory topics pertaining to the field of developing data processing solutions-data engineering-in the context of Big Data environments. Specifically it covers concepts, techniques and technologies related to the processing and storage of Big Data datasets including MapReduce and NoSQL. It highlights the unique challenges faced when processing and storing Big Data datasets and further introduces the main components of Hadoop-the de-facto platform for data processing and data storage within Big Data environments. The following primary topics are covered:

  • Big Data Engineering - Big Data Engineering Challenges

  • Big Data Storage Terminologies (including sharding, replication, CAP theorem, ACID, BASE)

  • Big Data Storage Requirements

  • On-Disk Storage (including distributed file system - databases)

  • Introduction to NoSQL - NewSQL

  • NoSQL Rationale - Characteristics

  • NoSQL Database Types (including key-value, document, column-family and graph databases)

  • Big Data Processing Requirements

  • Big Data Processing (including batch mode and realtime mode)

  • Introduction to MapReduce for Big Data Processing (batch mode)

  • MapReduce Explained (including map, combine, partition, shuffle and sort, and reduce)

Module 2: Advanced Big Data Engineering

The second module builds upon the first part by exploring advanced topics pertaining to the storage and processing of Big Data datasets. Specifically it covers advanced Big Data engineering mechanisms, in-memory data storage and realtime data processing. It presents further considerations for developing MapReduce algorithms and also introduces the Bulk Synchronous Parallel (BSP) processing engine, along with a discussion of graph data processing. The Big Data mechanisms required for developing Big Data pipelines, its stages and the design process involved in developing Big Data processing solutions are also explored. The following primary topics are covered:

  • Advanced Big Data Engineering Mechanisms (including serialization & compression engines)

  • In-Memory Storage Devices, In-Memory Data Grids & In-Memory Databases

  • Read-Through, Read-Ahead, Write-Through & Write-Behind Integration Approaches

  • Polyglot Persistence (including Explanation, Issues & Recommendations)

  • Realtime Big Data Processing Concepts (including Speed Consistency Volume (SCV), Event Stream Processing (ESP) & Complex Event Processing (CEP))

  • General Realtime Big Data Processing & Realtime Big Data Processing & MapReduce

  • Advanced MapReduce Algorithm Design

  • Bulk Synchronous Parallel (BSP) Processing Engine & BSP vs. MapReduce

  • Graph Data & Graph Data Processing using BSP

  • Big Data Pipelines (including Definition and Stages)

  • Big Data with Extract-Load-Transform (ELT)

  • Big Data Solutions (including Characteristics, Design Considerations & Design Process)

Module 3: Big Data Engineering Lab

This course module covers a series of exercises and problems designed to test the participant's ability to apply knowledge of topics covered previously in course modules 7 and 8. Completing this lab will help highlight areas that require further attention, and will further prove hands-on proficiency in Big Data engineering practices as they are applied and combined to solve real-world problems.

As a hands-on lab, this course incorporates a set of detailed exercises that require participants to solve various inter-related problems, with the goal of fostering a comprehensive understanding of how different data engineering technologies, mechanisms and techniques can be applied to solve problems in Big Data environments.

For instructor-led delivery of this lab course, the Certified Trainer works closely with participants to ensure that all exercises are carried out completely and accurately. Attendees can voluntarily have exercises reviewed and graded as part of the class completion.