AWS Data Engineer Master's Program | DASVM Technologies

DASVM’s AWS Data Engineer Master Program course is geared for people who want to enhance their skills in AWS to help organizations design and migrate their architecture to the cloud. This course is designed to equip learners with the essential skills and knowledge required to build, manage, and scale data pipelines, data lakes, and analytical solutions on the Amazon Web Services (AWS) platform. This course is ideal for individuals who are aiming to become proficient in leveraging AWS for various data engineering tasks such as data ingestion, storage, processing, analytics, and security. If you want to make your career in AWS Data Engineering Domain, get through this online course here you will learn to implement the concepts of AWS data engineering on the AWS platform. Adding to this, you will learn to build Data Engineering pipelines with the help of Lambda, Athena, Glue, EMR, etc.

Course Objectives:

In this course, you will learn to:

Data engineering concepts and AWS services
AWS Essentials such as S3, IAM, EC2, etc
Managing AWS IAM users, groups, roles and policies for RBAC (Role Based Access Control)
Engineering Batch Data Pipelines using AWS Glue Jobs
Running Queries using AWS Athena - Server less query engine service
Using AWS Elastic Map Reduce (EMR) Clusters for reports and dashboards
Data Ingestion using AWS Lambda Functions
Engineering Streaming Pipelines using AWS Kinesis
Streaming Web Server logs using AWS Kinesis Firehose
Running AWS Athena queries or commands using CLI
Creating AWS Redshift Cluster, Create tables and perform CRUD Operations

Course content

1. Python

Introduction to Python (8hrs)

Basics of Python
Data Structures in Python
Control Structures
Functions in Python
OOPS in Python

Programming in Python (3hrs)

Basic Coding
Lists
Strings
Other Data Structures

Python Practice Questions (2hrs)

2. Database Management System

Database Design (3hrs)

What is a Data Warehouse?
Structure of a Data Warehouse
Star Schema
OLAP vs OLTP
SETL
Entity Constraints
Referential Constraints
Semantic Constraints
ERDs
Star Schema: A Demonstration

Database Creation (2hrs)

Introduction to DDL and DML
DDL
DML
Modifying Columns

Querying in MySQL (2hrs)

Introduction
SQL Statements and Operators
Aggregate Functions
Ordering and Having Clause
Views

Joins (3hrs)

Types of Joins
Self Joins
Cross Joins
Set Operations

Advanced SQL (4hrs)

Window Functions
Case Statements
Stored Routines and Cursors
Query Optimisation techniques

Problem Solving using SQL (2hrs)

3. Python for Data Science

NumPy (2hrs)

Basics of NumPy
Operations of 1-D Arrays
Multidimensional Arrays
Computation Times in NumPy vs Python Lists

Pandas (2hrs)

Basics of Pandas
Pandas -Rows and Columns
Describing Data
Indexing and Slicing
Operations on Dataframes
Group by Aggregate Functions
Merging Data Frames
Pivot Tables

4. Data Visualisation in Python

Data Visualisation in Python (2hrs)

Industry Level Case Study
Matplotlib
Seaborn

5. Exploratory Data Analysis

EDA (3hrs)

Data Sourcing
Data Cleaning
Univariate Analysis
Bivariate and Multivariate Analysis

Industry Level Data Analysis (2hrs)

6. Data Management and Relational Modelling

Data Management and Relational Modelling (3hrs)

Data Management
E-R Models
Relational Models
Data Normalisation

7. Introduction to Cloud Computing & AWS Setup

Cloud Computing (3hrs)

Introduction to Cloud Computing
Benefits of Cloud Computing
Cloud-based Architecture & Deployment Models
Types of Cloud Services

AWS (3hrs)

Introduction to AWS
Virtual Machine on Cloud – EC2
EC2 – Login, File Transfer & Instance Termination

AWS EMR (3hrs)

Setting up an Amazon EMR cluster
EMR – Login & File Transfer
Practising Linux Commands
EMR – Instance Termination

AWS Services (6hrs)

AWS Glue
AWS Lambda
AWS S3

Virtual Machines (1hr)

Introduction to Virtualisation

8. Introduction to Hadoop and MapReduce Programming

Introduction to Hadoop (2hrs)

Introduction to Distributed Systems
Introduction to GFS and MapReduce
Introduction to Hadoop
Hadoop 2.x 7. YARN
Task Processing in Hadoop
Tools for Hadoop

Introduction to HDFS (2hrs)

File Storage in HDFS
Basic Commands in HDFS
Write Operation in HDFS
Rack Awareness in Hadoop
Read Operation in HDFS
Features and Limitations of HDFS

MapReduce Programming (3hrs)

Introduction to MapReduce Framework
Basic Implementation of MapReduce using Python
Hadoop Streaming
The Combiner
The Partitions
Job Scheduling and Fault Tolerance

9. NoSQL Databases and Apache HBase

Introduction to NoSQL Databases and Apache HBase (2hrs)

Introduction
Why NoSQL Databases?
How Are NoSQL Databases Designed?
Types of NoSQL Databases and Use Cases
Introduction to HBase
Data Model of HBase
Setting up an EMR instance for HBase
HBase Shell Commands

Programming in HBase (2hrs)

Introduction
HappyBase – HBase Python API
HappyBase – Use case

How HBase Works (3hrs)

Introduction
HBase Architecture
Read Operation in HBase
Write Operation in HBase
HBase Schema Design
HBase Use Cases
HBase Advantages and Disadvantages

10. Data Ingestion with Apache Sqoop and Apache Flume

Introduction to Data Ingestion (2hrs)

Introduction
Session Overview
Data Ingestion
Challenges in Data Ingestion
Key Steps of Data Ingestion
Tools for Data Ingestion
Types of Data and File Formats

Apache Sqoop – I (2hrs)

Session Overview
Introduction to Sqoop and its Architecture
Case Studies for Apache Sqoop
Apache Sqoop Set-Up and Database Set-Up
Exporting Data – Sqoop Export
Importing Data – Sqoop Import
Importing Data – Importing All Tables Using Sqoop
Importing Data – Handling NULL Values
Importing Data – Handling Mappers for a Sqoop Job
Importing Data – Importing in Various File Formats
Importing Data – Compression using Sqoop
Extra Coding Questions – Sqoop – I

Apache Sqoop – II (2hrs)

Session Overview
Importing Data – Importing Specific Rows in Sqoop
Importing Data – SQL Queries in Sqoop Import
Importing Data – Using Incremental Import in Sqoop
Sqoop Jobs
Tuning Sqoop
Extra Coding Questions – Sqoop – II

Apache Flume (2hrs)

Session Overview
Introduction to Apache Flume
Components of Flume
Characteristics and Use Cases of Flume
Case Study – Log Collection
Installation of Flume on Amazon EMR Instance
Flume Configuration Files
Flume Flows
Log Collection using Flume
Tuning Flume
Sqoop vs. Flume
Flume Practice Questions

11.Hive and Querying

Introduction to Hive (3hrs)

Module Mind Map
Session Overview
Introduction to Hive
Hive at Ola & Pinterest
Key Features of Hive
Use Cases of Hive
Architecture of Hive
Hive vs Relational Databases
Hive Data Models
Data Types in Hive

Basic Hive Queries (2hrs)

Session Overview
EMR and Hue Setup
Database Creation
Internal and External Tables I
Internal and External Tables II
Operations on Tables
Order By and Sort By
Distribute By and Cluster By
Indexing I
Indexing II
User-Defined Functions
Practice Question

Advanced Hive Queries (2hrs)

Introduction
Joins in Hive
Static Partitioning
Dynamic Partitioning and Dropping the Partitions
Bucketing
Practice Questions

Data Analysis using Hive (1hr)

Introduction
Load Amazon Review Data Set
External Table Creation
Data Analysis Without Partition
Data Analysis Using Partition
HBase-Hive Integration
Practice Questions

12. Amazon Redshift

Traditional Warehouse Vs. Amazon Redshift (1hr)

Module Introduction
Session Introduction
Recap: Data Warehousing
On-Premise vs Cloud Data Warehouses
Why Amazon Redshift?
Industrial Use Cases of Amazon Redshift

Redshift: Introduction and Architecture (2hrs)

Session Introduction
Introduction to Amazon Redshift
Redshift Architecture
Key Performance Features of Redshift
SORT Key I
SORT Key II & ZONE Maps
Data Distribution: DIST Key

Redshift Administration (2hrs)

Session Introduction
Creating a Redshift Cluster
Redshift Cluster: Node Types & Maintenance
Workload Management
Fault Tolerance and Security

Redshift Development (2hrs)

Session Introduction
Getting Started With Redshift Queries
Best Practices for Redshift Tables
Loading Data Into Redshift Tables
Data Analysis With Redshift
Custom Queries With Redshift
Query Optimisation in Redshift

13. Introduction to Apache Spark

Getting Started with Apache Spark (2hrs)

Module Introduction
Session Overview
Spark Overview
Spark vs. MapReduce
Spark Ecosystem
Spark Architecture
Spark APIs

Programming with Spark RDD (2hrs)

Session Overview
Spark Installation
Introduction to Spark RDDs
Creating RDDs
Operations on RDDs
Transformation Operations
Action Operations
Lazy Evaluation in Spark

Spark Structured APIs (5hrs)

Session Overview
Introduction to Structured APIs
DataFrames and Datasets
Catalyst Optimizer
Getting Started with DataFrame APIs
From Pandas Dataframe
DataFrame Operations
Spark SQL

ETL Project (3hrs)

14. Optimising Spark for Large Scale Data Processing

Optimising Disk IO for Spark (2hrs)

Course Introduction
Module Introduction
Session Overview
Spinning Up a Spark EMR Cluster
Spark jobs – Can We Do Better?
Analysing a Spark job
Why Optimise a Spark job?
Understanding Disk IO in Spark
Using Various File Formats in Spark
Serialization and Deserialization in Spark
Spark Memory Management Parameters
Practice Coding Questions

Optimising Network IO for Spark (2hrs)

Session Overview
Understanding Network IO
Understanding Shuffles
Optimising Joins in Spark
Understanding Data Partitioning in Spark
Practice Coding Questions

Optimising the Spark Clusters (2hrs)

Session Overview
Why Optimise Cluster Utilisation for Spark?
Job Deployment Modes in Spark
Tuning Spark Memory and CPU Parameters
Cost and Performance Trade-Offs
Apache Spark in the Production Environment
Best Practices While Working with Apache Spark
The Optimised Spark Job!
Practice Coding Questions

15. Real-Time Data Streaming with Apache Kafka

Introduction to Kafka (2hrs)

Module Introduction
Session Introduction
Batch and Real-Time Processing
Traditional Messaging System
Kafka – Introduction and Features
Use-Cases of Kafka
Kafka Architecture

Kafka Internals (2hrs)

Session Introduction
Topics and Partitions
Producers and Consumers
Consumer Groups
Rebalancing
Topic Replication

Producer and Consumer Demo (2hrs)

Session Introduction
Starting Kafka Servers
Creating a Topic
Using CLI to Start Producers and Consumers
Python Code For Producers
Python Code For Consumers

Kafka Connect and Kafka Streams (2hrs)

Session Introduction
Introduction: Kafka Connect API
Intricacies of Kafka Connect
Demo: Kafka Connect – Fetching Tweets
Introduction: Kafka Streams
Stream Processing Topology
Kafka Streams: Word Count Application
Running Word Count Demo Application
Practice Problem – Kafka Connect/Streams

16. Real-Time Data Processing using Spark Streaming

Introduction To Spark Streaming (3hrs)

Module Introduction
Session Overview
What Is Streaming?
Differences Between Streaming And Micro-Batching
What Is Spark Streaming?

Getting Started With Structured Streaming (2hrs)

Session Overview
What Is Structured Streaming?
First Spark Structured Streaming Application
Triggers And Output Modes
Implementing Triggers And Output Modes
Using Transformations And Aggregations
Joins With Streams
Implementing Joins In Structured Streaming

Advanced Structured Streaming (2hrs)

Session Overview
Windows
Implementing Windows
Late-Arriving Data and Watermarks

Spark Integration – Apache Kafka (2hrs)

Session Overview
Kafka Integration
Session Summary
Module Summary

17. Automating Data Pipelines using Apache Airflow

Introduction to Apache Airflow (2hrs)

Module Introduction
Session Introduction
Understanding Data Pipelines
Data Pipeline Use Case: Uber
How to Automate a Data Pipeline?
Introduction to Apache Airflow
DAGs: Data Pipelines in Airflow
Airflow Architecture

Hands-On with Apache Airflow (2hrs)

Session Introduction
Airflow Installation on EMR instance
Operators
Bash Operator
Python Operator
Sqoop Operator
Hive Operator
Spark Operator

Real-World Use Case of Airflow (2hrs)

Session Overview
Problem Statement
Coding Demonstration
DAG Construction
Spark Applications
Setting Task Dependencies
Running our DAG
Airflow Best Practices
Advantages and Limitations Of Airflow

18. Analytics Using Pyspark

Basic EDA Using Spark ML Library (2hrs)

Module Introduction
Session Introduction
MLLib Overview
Impute
Feature Transformer: Vector Assembler
Pipeline

Analysis using Spark (2hrs)

Capstone Project (3hrs)

19. AWS Other Services

Amazon Simple Storage Service (Amazon S3)

Amazon Cloud Front, Edge locations and Route53
Demonstration and hands on Labs on creating S3 Buckets, Hosting Static, CloudFront
Putting Objects, Bucket Properties
Introduction to S3
AWS Management Console, AWS CLI, Boto3
S3 Multipart Upload, Storage Classes
S3 Security and Encryption
Database Engine Types
Relational Database Service (RDS)
Serverless Options
Lab: RDS Instances and Engines
AWS Elastic Map Reduce
Use Cases & Hands on

AWS Security using IAM – Managing AWS Users, Roles and Policies

Creating AWS IAM Users with Programmatic and Web Console Access
Logging into AWS Management Console using AWS IAM User
Validate Programmatic Access to AWS IAM User via AWS CLI
Getting Started with AWS IAM Identity-based Policies
Managing AWS IAM User Groups
Managing AWS IAM Roles for Service Level Access
Overview of AWS Custom Policies to grant permissions to Users, Groups, and Roles
Managing AWS IAM Groups, Users, and Roles using AWS CLI

AWS Lambda

Introduction to AWS Lambda
Introduction to Data Collection and Getting Data Into AWS
Direct Connect, Snowball, Snowball Edge, Snowmobile
Database Migration Service
Data Pipeline
Lambda, API Gateway, and CloudFront
Features
Use Cases
Limitation
Hands on Labs

AWS Glue

What is AWS Glue
ETL With Glue
Working in Python Shell
Working in Spark Shell
Checking Logs
AWS Glue Data Catalog
AWS Glue Jobs
Glue Job Demo
Job Bookmarks
AWS Glue Crawlers
ETL Project
Use Cases & Pricing

AWS Kinesis

Building Streaming Pipeline using Kinesis
Rotating Logs
Setup Kinesis Firehose Agent
Create Kinesis Firehose Delivery Stream
Planning the Pipeline
Create IAM Group and User
Granting Permissions to IAM User using Policy
Configure Kinesis Firehose Agent
Start and Validate Agent
Conclusion – Building Simple Steaming Pipeline

AWS Athena

What is Athena
Features
Use Cases
Creating Athena Tables
Using Glue Crawlers
Querying Athena Tables
When To Use Athena
Visualizations and Dashboards
Security and Authentication

AWS Integrations and Use Cases

Amazon MQ
Amazon SNS
Amazon SQS
Amazon SWS
STEP FUNCTIONS
Interview Preparations
Mock Interviews
Implementing End to End Project
Assessments

Introduction to Aurora

DB Clusters
Connection Management
Storage and Reliability
Security
High Availability and Global Databases for Aurora
Replication with Aurora
Setting Environment
Amazon RDS Aurora Architecture
Aurora Metrics, Logging & Events
Aurora Scaling and High Availability
Configuring Security

To see the full course content Download now

Course Prerequisites

Experience and expertise working with AWS services to design, build, secure, and maintain analytics solutions
Python & SQL knowledge
A fundamental level understanding of big data and Hadoop concepts

Who can attend

Existing AWS power users
Beginner python developers curious for data science
Data engineers
BI/ ETL Developers
Data Scientists / Analysts
Data Engineers
Anyone from technical background who wants to learn Data engineering in AWS
Professionals who wish to learn advanced ways to AWS and build a data warehouse
Beginners can also learn from scratch but will have to go through extensive lectures

Number of Hours: 150hrs

Certification

AWS Data Engineer Associate (DEA-C01)

Key features

One to One Training
Online Training
Fastrack & Normal Track
Resume Modification
Mock Interviews
Video Tutorials
Materials
Real Time Projects
Virtual Live Experience
Preparing for Certification

FAQs

Why should I learn from DASVM technologies?

DASVM Technologies offers 300+ IT training courses with 10+ years of Experienced Expert level Trainers.

One to One Training
Online Training
Fastrack & Normal Track
Resume Modification
Mock Interviews
Video Tutorials
Materials
Real Time Projects
Materials
Preparing for Certification

Are you looking for existing offer?

Call now: +91-99003 49889 and know the exciting offers available for you!

Does DASVM Technologies offer placement assistance after course completion?

We working and coordinating with the companies exclusively to get placed. We have a placement cell focussing on training and placements in Bangalore. Our placement cell help more than 600+ students per year.

Who is my trainer and how they selected?

Learn from experts active in their field, not out-of-touch trainers. Leading practitioners who bring current best practices and case studies to sessions that fit into your work schedule. We have a pool of experts and trainers are composed with highly skilled and experienced in supporting you in specific tasks and provide professional support. 24x7 Learning support from mentors and a community of like-minded peers to resolve any conceptual doubts. Our trainers has contributed in the growth of our clients as well as professionals.

All of our highly qualified trainers are industry experts with at least 10-12 years of relevant teaching experience. Each of them has gone through a rigorous selection process which includes profile screening, technical evaluation, and a training demo before they are certified to train for us. We also ensure that only those trainers with a high alumni rating continue to train for us.

What if I Miss a class?

No worries. DASVM technologies assure that no one misses single lectures topics. We will reschedule the classes as per your convenience within the stipulated course duration with all such possibilities. If required you can even attend that topic with any other batches.

What are the different modes of training that DASVM Technologies provides?

DASVM Technologies provides many suitable modes of training to the students like:

Classroom training
One to One training
Fast track training
Live Instructor LED Online training
Customized training

Is the course material accessible to the students even after the course training is over?

Yes, the access to the course material will be available for lifetime once you have enrolled into the course.

What Certification will I receive after the course completion?

You will receive DASVM Technologies recognized course completion certification & we will help you to crack global certification with our training.

Does DASVM Technologies provide corporate trainings?

Yes, DASVM Technologies provides corporate trainings with Course Customization, Learning Analytics, Cloud Labs, Certifications, Real time Projects with 24x7 Support.

How about group discounts or Corporate training for our team?

Yes, DASVM Technologies provides group discounts for its training programs. Depending on the group size, we offer discounts as per the terms and conditions.

What are the payment options?

We accept all major kinds of payment options. Cash, Card (Master, Visa, and Maestro, etc), Wallets, Net Banking, Cheques and etc.

What is the refund policy?

DASVM Technologies has a no refund policy. Fees once paid will not be refunded. If the candidate is not able to attend a training batch, he/she is to reschedule for a future batch. Due Date for Balance should be cleared as per date given. If in case trainer got cancelled or unavailable to provide training DASVM will arrange training sessions with other backup trainer.

What if I have queries after I complete this course?

Your access to the Support Team is for lifetime and will be available 24/7. The team will help you in resolving queries, during and after the course.

Have more queries?

Please Contact our course advisor +91-99003 49889. Or you can share your queries through info@dasvmtechnologies.com