DS321 3

Page 1 of 25
NIELIT Virtual Academy

National Institute of Electronics and Information Technology, Chennai
Autonomous Scientific Society of Ministry of Electronics & Information Technology (MeitY), Govt.
of India, ISTE Complex, 25, Gandhi Mandapam Road, Chennai – 600025
Course Prospectus
NSQF Aligned
Mode: ONLINE (Blended)
PG Program in Data Engineering

Page 2 of 25
Index
Topic Page No.
Objective of the Course……………………………………………… 3
Outcome of the Course……………………………………………... 4
Full Flow of Course………………………………………………… 4
Course Structure ……………………………………………………. 5
Course Fee Structure ………………………………………………… 5
Registration Fee ……………………………………………………… 5
Eligibility……………………………………………………………… 6
Number of Seats………………………………………………………. 6
How to Apply …………………………………………………………. 6
Registration……………………………………………………………. 7
Selection Criteria of candidates…………………………………….... 7
Admission……………………………………………………………… 7
Discontinuing the course……………………………………………. 8
Location and how to reach…………………………………………… 8
Important Dates………………………………………………………. 8
Examination & Certification…………………………………………. 9

Page 3 of 25
Course Prospectus
Course Name : PG Program in Data Engineering Course Code: DS321
NSQF Level: 06
Duration: 960 Hours, 7 Months
Last Date of Registration: 13-12-2024
Date of publishing Provisional Selection List: 13-12-2024
Payment of first instalment fee: 16-12-2024
Course Start Date: 18-12-2024
Preamble:
Data Science refers to extraction of knowledge from large volumes of data that are structured or
unstructured, which is continuation of data mining and predictive analytics. It involves different
categories of analytical approaches for modelling various types of business scenarios and arriving at
solution and strategies for optimal decision-making in marketing, finance, operations,
organizational behaviour and other managerial aspects. This new field of study breaks down into a
number of different areas, from constructing big data infrastructure and configuring the various
server tools that sit on top of the hardware, to performing the analysis and developing the right
transformations to generate useful results.
Objective of the Course:
The PG Program in Data Engineering a 7-month program (960 Hours),3 hours theory and 5 hours
practical per day program offered by NIELIT Chennai is an excellent blend of knowledge and
practice in the field of Data Science and its industrial applications. The program is targeted for
creating qualified Data Science Engineers. The course progresses through the Operating System,
concepts of Data and its storage, programming for data science, Big Data Technology and its
implementation. Various advanced tools such as R and Python, along with MySQL, Apache
Cassandra, Java Programming and Hadoop Framework are used for achieving the goal of solving
critical business and Analytic problems.
The Program also offers six weeks of hands-on real – life analytical projects for participants to
Page 4 of 25
get equipped with strong analytical and programming skills which makes them highly demanding and
employable on completion of the program. The course has been designed after proper industry survey
and consultation with multiple industry leaders to ensure that participants learn exactly what
employers need.
The objective of this program is to make the participants to take up roles as Statistical Analysts,
Data Scientists, Data Analysts, Big Data Engineer, Hadoop Developer. There is a huge demand for
skilled manpower in Data Science, and candidate there is huge shortage of Data Science Professionals
world-wide. So, it is quite obvious that existing candidates who are interested in perusing career in this
field needs to be trained. Our objective is to create a pool of talent who can meet this demand. This
course is meant to sensitize students for computational statistics applications and usage as well as
provide hands-on experience with solving real world data science issues.
Outcome of the Course:
On completion of the Course, the Participants will learn the concept of Data Analytics using open
source statistical tools like R, Python and some very good visualization tools and techniques. They
will be able to implement industry-oriented Data Analytics Project.
Full Flow of Course
(4) Implementing Data Analytic Technique through Project
(3) Understanding & Working on Big Data Technology
(2) Data Analytics & Machine Learning
(1) Configuring Platform for Data Engineering

Page 5 of 25
Course Structure
This course contains total three modules. After completing the three modules, the students have to do
a 120 hours project using any of the topics studied in the course.
S.No Module Name Th. Pr. Total

1. Configuring Platform for Data Engineering 90 150 240
2. Data Analytics and Machine Learning 90 150 240
3. Big Data Analytics 90 150 240
4. Live Project /Presentation/Assignment - - 120
5. Employability Skills(Internal Assesment) - - 120
6. Major Project
Duration (in Hours) / Total Marks 270 450 960
Course Fee Structure:(Including GST)
General SC/ST Last Date

Rs.1000/- NIL 13-12-2024
Registration Fee
Tuition Fee
(Including NSQF Registration & Examination Fee)
Rs.22,000/- NIL 16-12-2024

1st Installment
Rs.22,000/- NIL 12-04-2025
2nd Installment
Total Rs 45,000 NIL
*GST is Applicable as per Govt. Norms GST (currently it is 18%).
Registration Fee : Rs 1000/-( Exemption for SC/ST Candidates)
Registration Fee- Refund Policy:
(Non-Refundable if candidate is selected for admission but did not join and if
a candidate has applied but not eligible.)
However, the registration fee shall be refunded on few special cases as given below:
1) Candidates are eligible but not selected for admission.
2) Course postponed and new date is not convenient for the student.
3) Course cancelled.



Page 6 of 25
Eligibility
S. No. Academic/Skill Qualification (with Specialization - if Required Experience
applicable) (with Specialization - if
applicable)
1 ◣Pursuing first year of 2-year PG

program in Science/Maths/Statistics/Economics/ Operations
Research after completing 3 year UG degree in -NA-
Science/Maths/Statistics/Economics / Operations Research
◣Pursuing PG diploma in Computer Application after 3

year UG degree in CS or IT .
◣Completed 4 year B.E./B.Tech
12th Grade Pass with 2 years of Vocational Education &

2 Training. E.g. ◣12th Grade with 1 year NTC plus 1 year 2 year relevant
NAC/CITS experience
◣12th grade with 1 year NAC plus CITS

3 12th grade pass 4 year relevant
experience
4 Previous relevant Qualification of NSQF Level 5.5 1.5 year relevant
experience
5 Previous relevant Qualification of NSQF Level 5 3 year relevant
experience
Number of Seats: 80(Eighty) – Total

Note: Seats are allocated based on the merit of the Qualification
How to Apply?
Candidates can apply online in our website http://nva.nielit.gov.in. Payment towards non-
refundable registration fee can be paid through any of the following modes:
 Payment Gateway
 Online transaction: Account No: 31185720641 Branch: Kottur
(Chennai), IFSCode: SBIN0001669.
 GPAY/any UPI, Credit Card
Note: The Institute will not be responsible for any mistakes done by either the bank
concerned or by the depositor while remitting the amount into our account
Last date of Registration: 13-12-2024
Registration Procedure
All interested candidates are required to fill the Registration form online with registrationfees of
Rs. 1,000/- (wherever applicable) and with all the necessary information.
Page 7 of 25
Selection Criteria :
Selection of candidates will be based on their marks in the qualifying examination subject to eligibility
and availability of seats.
 The first list of Provisionally Selected Candidates will be published on NIELIT Chennai
website (www.nielit.gov.in/chennai/index.php ) on 13-12-2024 by 5:00 PM. In case of
vacancy, an additional selection list will be prepared and the selection will be intimated
by email only.
 Following documents of candidates will be verified:
 Qualifying Degree (Consolidated Marksheet/Degree
Certificate/Course Completion Certificate), 10th and 12th mark
sheet.
 One passport size photograph.
 Self-attested copy of Govt. issued photo ID card.
 AADHAR Copy
 All provisionally selected candidates have to pay first instalment of Rs. 22,000/- on or
before 16-12-2024 by payment mode mentioned above.
 Selected candidates are requested to upload the proof of remittance of fee on registration
portal and also send the proof of remittance of fee as email to
[email protected]/[email protected]/[email protected]
Admission:
All provisionally selected who have paid the fees (full or first instalment) andverified by accounts
section of NIELIT Chennai will get a welcome message in his login id provided during registration.
Note-All Provisionally Selected Candidates have to visit

NIELIT Chennai for Certificate Verification.
Otherwise their candidature will be cancelled without any

intimation.
The credentials and URL for online portal will be shared through WhatsApp or email.
Page 8 of 25
Discontinuing the course

 No fees under any circumstances, shall be refunded in the event of a student who
have completed the process of admission or discontinuing the course in between. No
certificate shall be issued for the classes attended.
 If candidates are not uploading consecutive 3 assignments within assigned time,
then their candidature will be cancelled without any notice and all fees paid will be
forfeited.
 If candidates are not appearing for any internal examinations/practical their
candidature will be cancelled without any notice and all fees paid will be forfeited.
Course Timings:
This program is a practical oriented one and hence there shall be more lab than
theory classes. The classes and labs are online cloud-based from 10 am to 5:30 pm and
Monday to Friday. In between any 04 hours can be fixed as your class timings according to
the candidate’s convenience and the faculty’s availability and remaining student can do
their lab.
Course enquiries
Students can enquire about the various courses either on telephone or by personal
contact between 9.15 A.M. to 5.15 P.M. (Lunch time 1.00 pm to 1.30 pm) Monday to
Friday.
Placement:
Students who have completed the course successfully and qualified, Placement
guidance and career counselling will be given to assist in their interviews.
Important Dates
Last Date of Registration: 13-12-2024

Display of Provisional Selection List: 13-12-2024
Payment of first installment fee: 16-12-2024
Course Start Date: 18-12-2024
Payment of second instalment fee: 12-04-2025
Page 9 of 25
Examination & Certification

 Final Certificates will be issued after successful completion of all the modules
including mini project. For getting certificate a candidate has to pass each module
individually with minimum required marks of 50%.
NSQF Examination Pattern:

 Means of assessment:
S. No Examination Pattern Modules Duration in Maximum Marks
Covered Minutes
1. Theory 1: Basic Linux, Java Module 1 90 100

& Data Warehousing
2 Theory 2: Data Analytics & Module 2 90 100

Machine Learning
3 Theory 3: Big Data Analytics Module 3 90 100
4 Practical 1: Basic Linux, Module1 & 180 90

Java, Data Warehousing & Module 2
Data Analytics
5 Practical 2: Big Data Module 3 180 90

Analytics
6 Internal Assessment Module 1,2,3 - 60
7 Assignment Module 1,2,3 - 60
8 Major Project Module - 100

1,2,3
Total 700
Theory Papers
 Theory 1 – Configuring Platform for Data Engineering (Basic Linux, Java & Data
Warehousing)
 Theory 2 – Data Analytics and Machine Learning
 Theory 3 – Big Data Analytics
Practical Papers
 Practical 1 – Configuring Platform for Data Engineering& Machine Learning
(Basic Linux,Java, Data Warehousing & Data Analytics)
 Practical 2 – Big Data Analytics
Examination Centre: NIELIT Chennai, Mode: Online
Page 10 of 25
Grading Scheme
Following Grading scheme (on the basis of total marks) will be followed:
Grade S A B C D
Marks >=85% >=75% and >=65% and >=55% and >=50% and
Range (in <85% <75% <65% <55%
%)
Page 11 of 25
Page 12 of 25
……….many more.
Page 13 of 25
Detailed curriculum
Module 1: Configuring Platform for Data Engineering
Understanding Linux Environment & Basic Commands:
• Understanding Linux Environment:
• Introduction, Linux Architecture, Boot Process, Kernel, System Initialization, GUI, and
CLI(Access a shell prompt and issue commands with correct syntax.
• Commands:
• file handling commands, sort, tr, cut, find, grep, egrep, using filters, cat, mkdir, who and
otherbasic commands. vi editor
Linux Package management and Process Monitoring:
• su login, sudo, apt-get, ps command, kill command and other related commands, single and
multi-user mode of Ubuntu.
Important Files and Directories.
BASH Scripting:
• Introduction to BASH, Variables (System & User defined),Exporting Variables, Special Shell
Variables, Control Structures, Understanding execution mode of BASH script, Array, functions,
BASH debugging
Case Study:
1. BASH Script for removing missing value
2. BASH Script for generating Frequency Distribution Table from given data (consisting of
10000records).
3. BASH Script for removing blank lines from a file.
4. BASH Script to find frequency of a word from several files.
BASH Script to Merge files based on some fields.
Configuring Secure Shell & LAN
• LAN: Introduction, Configure LAN on Ubuntu
• Secure Shell:
Understanding & Configuring Secure Shell, Access remote systems using ssh, SCP, PasswordlessSSH,
Configure key-based authentication for SSH
User Administration
• User Management:
• Adding/Modifying/Deleting new users, Understanding User Id and other related fields.
Understanding /etc/passwd and /etc/shadow, Password Aging Policies, Switching Accounts,
sudo access
• Group Management:
• User Private Groups, Group Administration.
• Understanding SUID and SGID Executable, Sticky Bit, Default File
chmod and chown command
Page 14 of 25
Virtualization
• Introduction to Virtualization
Virtual Machine installation, Configuring Virtual Machines, Install Ubuntu/Centos systems asvirtual guests,
configure systems to launch virtual machines at boot. , Creating Clone of a VirtualMachine and its restoration,
virtual LAN, Memory addressing, Paging, Memory mapping, virtualmemory, complexities and solutions of
memory virtualization, VM configurations, VM migrations, Migration types and process.
 Basics of Information Security & Cloud
Java for Hadoop
Java Introduction:
 Concept of OOPs
 Introduction to Java
 Configure JAVA PATHs in PATH variable and other related places in Linux.
 Features of Java
 Working with Java Variables
 Declaring and Initializing Variables
 Primitive Data Types
 Class & Object Fundamentals
 Object Lifecycle
 Read and Write Java Object Fields
Understanding JAR file and its working
Java Operators and Decision Constructs
Using loop Constructs in Java
 while, for, switch case etc.
Array & String:
 Creating and using One-Dimensional Array
 Creating and using Multi-Dimensional Array
 String Class and related functions.
Methods and Encapsulation:
 Java Method
 Static and Final Keyword
 Constructors and Access Modifiers in Java
 Encapsulation
Inheritance:
 Polymorphism Casting and Super
 Abstract Class and Interfaces
Exception Handling:
 Types of Exceptions and Try-catch Statement
 Throws Statement and Finally Block
 Exception Classes
 Creating Custom Exception Classes
Work with Selected classes
 String & String Buffer
 Create and Manipulate Calendar Data
Declare and Use of Array list
Page 15 of 25
Collection Framework
 Introduction to Collection Framework
 Core Collection in Java
 Core Collection framework
 Types of Collection,
 Hierarchy of Collection Framework
 Commonly used methods of Collection interface
 Iterator Interface
 Methods of Iterator interface
File Handling and Serialization
 The Classes for Input and Output
 The Standard, Streams
 Working with File Object
o File I/O Basics, Reading and Writing to Files
o Buffer and Buffer Management
 Read/Write Operations with File Channel.
Serialization
Data warehousing using MySQL
• Data warehousing concept
• Data Base Design using MySQL:
• Concept of RDBMS, Storage Engine, Structure of MySQL
• Creating Database, Data Types, Table etc.
• Relational Model and SQL:
• Relation Model, MySQL Query, Creating and Using a Database, Select, Operators, group
by,order by, Primary Key, etc.
• Database Design using the Relational Model:
• Making Relation between tables, Foreign Key, joins etc.
• Export & Import Data
Export and Import of External Data, Interacting with different tables, Backup and Recovery.
Basics of NoSQL and Apache Cassandra
• Introduction to NoSQL and Cassandra:
• Understanding NoSQL, Types of NoSQL databases, Usage of NoSQL databases,
NoSQL Eco System, Overview of Cassandra, Features of Cassandra, Cassandra Vs.
MongoDB
• Architecture of Apache Cassandra:
• Understanding high level Cassandra architecture,
• Peer-to-Peer design, Network topology, Virtual Node, Components of
Cassandra, Partitioner and Replication, Memtables and SSTables, Bloom Filters, Managers
and Services,
Cassandra read and write process, Failure scenario.
Page 16 of 25
Apache Cassandra:
Installation &Configuration
 Versions of Apache Cassandra
 Understanding Pre-requisite for Installation
 Installing Cassandra
 Linux Commands to auto start Apache Cassandra
 Logging setup in Cassandra
 Understanding Replication Factor
 Cassandra Cluster
Miscellaneous setting
Understanding Apache Cassandra Data model:
• Introduction to Data Model
• Design between RDBMS and Cassandra
Understanding Cassandra API:CQL-API and thrift API)
Cassandra Monitoring Tools:
• Introduction of Monitoring Tools
• Cluster Statistics
i. nodetool
ii. JConsole
iii. Table Statistics
• Table Statistics
• Thread Pool
Compaction Metrics
Cassandra Cluster:
• Introduction to Cluster
• Layers of Cassandra Cluster
o Node Cluster
o Keyspace
o Column Families
o Rows
o Column
Cluster Builder
Cassandra CQLSH
• Introduction to CQL
• Documented Shell Commands:
• Help, Version, Color, No Color
• DEBUG, Execute, File, U,P
• Exit, Describe, Expand etc.
• CQL: Data Definition Commands:
o Create Keyspace
o Use Keyspace
o Alter Keyspace
o Drop Keyspace
o Create Table
o CRUD Operation
o Alter table
Add Column to a table
Page 17 of 25
o Drop a Column
o Truncate Table
o Drop Table
• CQL: Data Manipulation Commands:
o Insert Command
o Update Command
o Delete Command
o Batch Command
• CQL Clauses:
o Select
o where
o Order by
• Cassandra Data types
• Build-in
o (Boolean,blob,ascii,bigint,counter,decimal,double,float,inet,int,text,varchar,timestamp,
var int etc.)
• Collection data Type
o List (Create, Insert, Update, Verify)
o Map((Create, Insert, Update, Verify)
o Set(Create, Insert, Update, Verify)
• User Defined Data type
o Create
o Alter
o Add
o Drop
o Describe
• Database User and Roles
• Control Commands
• Complex query
• Built-in and User defined Function
• Run CQL Scripts from the command line
JSON support
Indexes and Composite Columns:
• Overview of Index and benefit:
o Understanding Index
o Create Index
o Drop Index
• Index on Distributed Database
• Clustered Indexes vs Non-Clustered Indexes
• Secondary Index
• Composite Columns
• Data Partitioning
Data Colocation
Cassandra Interfaces:
• Java interfaces to connect Cassandra
ODBC interface to connect Cassandra
Page 18 of 25
Module2: Data Analytics & Machine Learning

Basic Concept of Data Analytics & Data Manipulation in R
• Introduction to Data Analytics
• Basic Features of R, Installation of R Studio and method of accessing through URL.
• Basic Data Sets: Vector, Matrices, List, Array, Factors
• Data Frames, Data Types, Operators, Basic Constructs, R Functions, String Handling, R Packages
Data Reshaping, Data Pipelines, and Data Manipulation.
Python Basics
• Features of Python.
• Basic Syntax, Variable and Data Types, Operators.
Conditional Statement, Loops, Functions, File Handling in Python.
OOPs concept & Exception Handling in Python
• Concept of class, object and instances, Constructor, Inheritance.
Programming using Oops support, Exception Handling.
Understanding Data Frame in Python
• Working with Pandas data structures.
• Series and Data Frames,
• Accessing data: indexing, slicing, Boolean indexing, dropping
• Import from and Export to .csv Files, Output to and Input from EXCEL Files, selecting,
creating,and combining rows and columns
• Pandas: XLS
• Pandas: JSON
• Missing Value
• Data Aggregation, group by etc.
Reshaping and transforming data.
Data Visualization using Python
• Understanding on Data Visualization,
• Using Python Library for visualization: Matplotlib, Seaborn, plotly, ggplot, Geoplotlib, gleam etc.
Pie Chart, Histogram, Box Plot and other visualisation techniques.
Inferential Statistics in Python
 Introduction to Inferential statistics.
 Random Variable
 Measure of Central Tendency SciPy package
 Understanding Mathematical Expectation.
 Distribution Functions (Discrete and Continuous)
o Binomial, Poisson
o Normal Distribution
• Constructing a Statistical Model.
• Fitting Model to given data.
• Testing of Hypothesis:
o Introduction to ToH
o Understanding Null and Alternative hypothesis.
o Critical Region
Level of Significance, P Value
Page 19 of 25
o T Test
o Z Test
o Goodness of fit Test
Chi Square Test
Time Series Analysis using Python
• Time Series Introduction, Understanding Time Series Data.
• Importing Time Series Data in Python.
• Working with Time Series Libraries like autots, tsfresh, dart, atspy etc.
• Understanding Panel Data.
• Visualisation of Time Series Data
• Patterns in a Time Series
• Additive and Multiplicative Time Series
• Decomposing a time series into its components.
• Working with Seasonal & Nonseasonal Time Series.
• Working with Stationary and Non-Stationary Time series.
• Test for Stationarity of Time Series Data.
o Augmented Dickey Fuller(ADH) Test
o Kwiatkoski-Phillips-Schmidt-Shin(KPSS) Test
o Philips Peron(PP) Test
• Understanding noise in Time Series Data.
o Understanding white noise and stationary series.
• De-trending a Time series data.
• De-seasonalise a Time Series Data
• Test for Seasonality of Data
• Handling Missing Value in Time Series Data.
o Backward Fill
o Linear Interpolation
o Quadratic Interpolation
o Mean of Nearest Neighbours
o Mean of seasonal counterparts
• Smoothening a Time Series data.
• Autocorrelation & Partial autocorrelation function.
• Lag Plots
• Forecasting a Time Series Data.
Causality Test for Time Series Data.
Machine Learning
Introduction to Machine Learning:
 Basic Concepts of Machine Learning
 End-to-end Process of Investigating Data through a Machine Learning Lens.
 Application of Machine
Learning.Types of Machine
Learning Supervised
• Classification
Regression:
Page 20 of 25
o Linear Regression
o Generalized Linear Regression
o Logistic Regression
o Multiple regression
o Poisson Regression
Unsupervised
• Clustering
o The k-Means Clustering
o The k-Medoids Clustering
o Hierarchical Clustering
o Density-based Clustering
• Dimensionality Reduction
o Principle Component Analysis
o K-nearest neighbour
o Discriminant Analysis
• Anomaly Detection
o KNN
• Outlier Detection, Association Rules
o Basics of Association Rules
o Association Rule
• Mining
o Text Mining
Tree, Decision Tree, Splits, Entropy etc.
Neural Networks
• Introduction
o Understanding Neural Networks
o Building simple Neural Network in Python
o Multiple Input & Outputs
o Use of NumPy to build Neural Network
• Updating Weights in Simplest Neural Network
o Simple Error analysis
o Working with 1 attribute
o Small Steps
o Extending Simplest Neural Network to Multiple Inputs
o Extending to Multiple Outputs
o Combining Multiple Input and Outputs
• Extending Neural network to Completed Data Sets
o Extending Neural Network to Use Multiple Samples
o Goodness of Fit Parameters
• Perceptron Learning & Binary Classification
• Back Propagation Learning
Learning Feature Vectors for Words & Object Recognition.
Page 21 of 25
Deep Learning Applications

• Artificial Neural Networks
• Introduction to KERAS for Classification and Regression inTypical Data Science
Problems
• Creating a Neural Network Training Models and Monitoring
• Introducing TensorFlow,
• Neural Networks using TensorFlow, Debugging andMonitoring, Convolutional
Neural Networks
• CNN using TensorFlow
• Unsupervised Learning
• Working with PyTorch
• Case Studies
Cassandra Python Connectivity
Module 3: Big Data Analytics
Introduction of Big Data Analytics
Introduction to Big Data:
 Big Data for Data Engineering
 Big Data Introduction
 Attributes of Big Data
 Other technologies vs Big Data
 Big Data & Data Science
 Processing Big Data
Introduction to Hadoop:
 Introduction to Hadoop Ecosystem
 Compare Hadoop vs. traditional systems
 Hadoop Architecture
 Understanding HDFS
Configuring Hadoop:
 Installing Hadoop
 Standalone mode
 Pseudo Distributed Mode
 Fully Distributed
 Understanding Hadoop Cluster
 Monitoring the Cluster Health
 Starting and Stopping the Nodes
HDFS Architecture
 Distributing Processing System
 Core Components of Hadoop
 HDFS Architecture, HDFS Design
 HDFS role in Hadoop
 Features of HDFS
 Daemons of Hadoop and its functionality Name node, Data node
 Secondary Name Node
Page 22 of 25
 Job Tracker, Task Tracker

 Anatomy of File Write & File Read
Network Topology
 Heartbeat Signal
 How to Store the Data into HDFS
 How to Read the Data from HDFS
 CLI commands (Hadoop FS shell)
Hadoop Administration & Admin Commands
Hadoop MapReduce using Python
 Concepts of HDFS Java API
 Overview of MapReduce Framework,
 MapReduce Architecture & Daemons
 Job tracker and Task tracker
 YARN and its Processing Application
 YARN MR Application Execution Flow
 Data Flow In MapReduce
 Introduction to Hadoop Streaming
 Streaming Command Options
 Use cases of MapReduce, Anatomy of MapReduce Program
 Basic MapReduce API Concepts.
 Writing MapReduce Driver, Mappers, and Reducers, Unit Testing MapReduce Programs
etc,,Case Studies
 Generic Command Options
 Basic MapReduce Sample Program-1
 Basic MapReduce Sample Program-2
 Chaining of MR Jobs
 Custom Combiner
 Generic OptionParser
 Analysis of IRIS dataset
 Built-in and Custom Counters in Hadoop
 Custom Partititioner
 Hadoop Sequence File Format
 Read Write Sequence File
 Hadoop Data Types
 Processing of XML File
 Data Compression with Hadoop
 Data Serialization
 Use Cases
 Integration of Cassandra and Hadoop
Hadoop Map Reduce using Cassandra
Page 23 of 25
Apache HBase
 HBase Introduction
 HBase vs RDBMS (fixed Vs flexible schema)
 Understanding HBase Configuration files and its configuration in Hadoop Eco System
o HBase run modes: Standalone and Distributed
o Understanding and working with Zookeeper, Master and Region servers in
fullydistributed Mode.
 HBase Architecture and Components,
 HBase Data Model
 Understanding Conceptual View, Physical View, Namespace,Table,Row,Column
Family,Cells,Data Model Operations,Versions,Sort Order,Column Metadata, Joins,
ACID etc.
 HBase commands
o General Commands:
Status,Version,Table_help( scan, drop, get, put, disable, etc.), whoami etc.
o Table Management Commands:
Create,List,Describe,Disable,Disable_all,Enable,Enable_all,Drop,Drop_all,Show_filters,
A lter,Alter_status etc.
o Data manipulation commands:
count,put,get,delete,delete all,truncate,scan etc.
o Cluster Replication Commands:
add_peer,remove_peer,start_replication,stop_replication etc.
o Understanding TTL(Time to Live)
 HBase Constraints
 Case Study - Log Data and Timeseries Data
 Case Study - Customer/Order
 HBase and MapReduce
HBase Backup and Restore
Scala, Apache Spark, Kafka & Flume
 Introduction to Scala
 Scala REPL (“Read-Evaluate-Print-Loop”)
 Basic Scala Operations
 Variable Types in Scala
 Control Structures in Scala
Functions and Procedures
Page 24 of 25
 Collections in Scala
 Array,ArrayBuffer,Map,Tuple,Lists etc.
 Object Oriented Programming and Functional Programming Concepts in Scala
 Methods, classes, and objects in Scala
 Packages and package objects
 Traits and trait linearization
 Java Interoperability
 Introduction to functional programming
 Functional Scala for data science
 Importance of functional programming and Scala for learning Spark
 Pure functions and higher-order functions
 Using higher-order functions
 Error handling in functional Scala
 Functional programming and data mutability
 Scala Collection API & Scala Implicits
 Introducing Apache Spark
 Introduction of Apache Spark: Need of Apache Spark, Feature of Apache Spark.
 Understanding concept of Spark Cluster Modes on YARN.
 Apache Spark Installation and Configuration
 Understanding Spark Cluster Modes on YARN
 Spark Applications
 The back bone of Spark - RDD (Resilient Distributed Dataset )
 Loading Data
 What is Lambda
 Using the Spark shell
 Actions and Transformations
 Associative Property
 Implant on Data
 Persistence
 Caching
 Loading and Saving data
 Operations of RDD
o Challenges in Existing Computing Methods
o Probable Solution & How RDD Solves the Problem
o Introduction to RDD, its operations, Transformations & Actions
o Data Loading and Saving Through RDDs
o Key-Value Pair RDDs, Other Pair RDDs, Two Pair RDDs
o RDD Lineage & RDD Persistence
o WordCount Program Using RDD Concepts
o RDD Partitioning achieve Parallelization.
o Using Accumulators
o Creating custom Accumulators
o Using Broadcast variables
o Passing Functions to Spark
 Data Frames and Spark SQL
o Introduction to Spark SQL & its architecture
o SQL Context in Spark SQL
User Defined Functions
Page 25 of 25
o Data Frames & Datasets

o Interoperating with RDDs
o JSON and Parquet File Formats
o Loading Data
o Spark-Hive Integration
 Writing and deploying spark applications
o Creating the SparkContext
o Building a Spark Application using Scala.
o The Spark Application Web UI.
o Configuring Spark Properties
o Running Spark on Cluster
o Executing Parallel Operations
o Understanding Stages and Tasks
o Case Study
 Machine Learning using MLLib
o Introduction to MLlib
o Features of MLlib and MLlib Tool
o Various ML algorithms supported by MLlib
o Optimization Techniques
o Supervised Learning
 Linear Regression
 Logistic Regression
 Decision Tree
 Random Forest
o Unsupervised Learning
 K-Means Clustering
o Spark Algorithms for Machine Learning
o Pyspark MLLib.
Integration with Hadoop PySpark-Environment and Configuration.

DS321 3

Uploaded by

Copyright:

Available Formats

DS321 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS321 3

Uploaded by

Copyright:

Available Formats

Page 1 of 25

NIELIT Virtual Academy

Mode: ONLINE (Blended)

PG Program in Data Engineering

Objective of the Course……………………………………………… 3

Outcome of the Course……………………………………………... 4

Full Flow of Course………………………………………………… 4

Course Structure ……………………………………………………. 5

Course Fee Structure ………………………………………………… 5

Registration Fee ……………………………………………………… 5

How to Apply …………………………………………………………. 6

Selection Criteria of candidates…………………………………….... 7

Discontinuing the course……………………………………………. 8

Location and how to reach…………………………………………… 8

Examination & Certification…………………………………………. 9

Duration: 960 Hours, 7 Months

Last Date of Registration: 13-12-2024

Date of publishing Provisional Selection List: 13-12-2024

Payment of first instalment fee: 16-12-2024

Course Start Date: 18-12-2024

Objective of the Course:

Full Flow of Course

(4) Implementing Data Analytic Technique through Project

(3) Understanding & Working on Big Data Technology

(2) Data Analytics & Machine Learning

(1) Configuring Platform for Data Engineering

S.No Module Name Th. Pr. Total

Course Fee Structure:(Including GST)

General SC/ST Last Date

Rs.22,000/- NIL 16-12-2024

Registration Fee : Rs 1000/-( Exemption for SC/ST Candidates)

Registration Fee- Refund Policy:

1 ◣Pursuing first year of 2-year PG

◣Pursuing PG diploma in Computer Application after 3

◣Completed 4 year B.E./B.Tech

12th Grade Pass with 2 years of Vocational Education &

◣12th grade with 1 year NAC plus CITS

Number of Seats: 80(Eighty) – Total

Note-All Provisionally Selected Candidates have to visit

Otherwise their candidature will be cancelled without any

Discontinuing the course

Last Date of Registration: 13-12-2024

Examination & Certification

NSQF Examination Pattern:

1. Theory 1: Basic Linux, Java Module 1 90 100

2 Theory 2: Data Analytics & Module 2 90 100

3 Theory 3: Big Data Analytics Module 3 90 100

4 Practical 1: Basic Linux, Module1 & 180 90

5 Practical 2: Big Data Module 3 180 90

6 Internal Assessment Module 1,2,3 - 60

7 Assignment Module 1,2,3 - 60

8 Major Project Module - 100

Module2: Data Analytics & Machine Learning

Deep Learning Applications

 Job Tracker, Task Tracker

o Data Frames & Datasets

You might also like