DS321 3
DS321 3
DS321 3
Course Prospectus
NSQF Aligned
Index
Topic Page No.
Eligibility……………………………………………………………… 6
Number of Seats………………………………………………………. 6
Registration……………………………………………………………. 7
Admission……………………………………………………………… 7
Important Dates………………………………………………………. 8
Course Prospectus
Course Name : PG Program in Data Engineering Course Code: DS321
NSQF Level: 06
Preamble:
Data Science refers to extraction of knowledge from large volumes of data that are structured or
unstructured, which is continuation of data mining and predictive analytics. It involves different
categories of analytical approaches for modelling various types of business scenarios and arriving at
solution and strategies for optimal decision-making in marketing, finance, operations,
organizational behaviour and other managerial aspects. This new field of study breaks down into a
number of different areas, from constructing big data infrastructure and configuring the various
server tools that sit on top of the hardware, to performing the analysis and developing the right
transformations to generate useful results.
The PG Program in Data Engineering a 7-month program (960 Hours),3 hours theory and 5 hours
practical per day program offered by NIELIT Chennai is an excellent blend of knowledge and
practice in the field of Data Science and its industrial applications. The program is targeted for
creating qualified Data Science Engineers. The course progresses through the Operating System,
concepts of Data and its storage, programming for data science, Big Data Technology and its
implementation. Various advanced tools such as R and Python, along with MySQL, Apache
Cassandra, Java Programming and Hadoop Framework are used for achieving the goal of solving
critical business and Analytic problems.
The Program also offers six weeks of hands-on real – life analytical projects for participants to
Page 4 of 25
get equipped with strong analytical and programming skills which makes them highly demanding and
employable on completion of the program. The course has been designed after proper industry survey
and consultation with multiple industry leaders to ensure that participants learn exactly what
employers need.
The objective of this program is to make the participants to take up roles as Statistical Analysts,
Data Scientists, Data Analysts, Big Data Engineer, Hadoop Developer. There is a huge demand for
skilled manpower in Data Science, and candidate there is huge shortage of Data Science Professionals
world-wide. So, it is quite obvious that existing candidates who are interested in perusing career in this
field needs to be trained. Our objective is to create a pool of talent who can meet this demand. This
course is meant to sensitize students for computational statistics applications and usage as well as
provide hands-on experience with solving real world data science issues.
Outcome of the Course:
On completion of the Course, the Participants will learn the concept of Data Analytics using open
source statistical tools like R, Python and some very good visualization tools and techniques. They
will be able to implement industry-oriented Data Analytics Project.
Course Structure
This course contains total three modules. After completing the three modules, the students have to do
a 120 hours project using any of the topics studied in the course.
Tuition Fee
(Including NSQF Registration & Examination Fee)
(Non-Refundable if candidate is selected for admission but did not join and if
a candidate has applied but not eligible.)
However, the registration fee shall be refunded on few special cases as given below:
1) Candidates are eligible but not selected for admission.
2) Course postponed and new date is not convenient for the student.
3) Course cancelled.
Page 6 of 25
Eligibility
S. No. Academic/Skill Qualification (with Specialization - if Required Experience
applicable) (with Specialization - if
applicable)
Registration Procedure
All interested candidates are required to fill the Registration form online with registrationfees of
Rs. 1,000/- (wherever applicable) and with all the necessary information.
Page 7 of 25
Selection Criteria :
Selection of candidates will be based on their marks in the qualifying examination subject to eligibility
and availability of seats.
The first list of Provisionally Selected Candidates will be published on NIELIT Chennai
website (www.nielit.gov.in/chennai/index.php ) on 13-12-2024 by 5:00 PM. In case of
vacancy, an additional selection list will be prepared and the selection will be intimated
by email only.
Following documents of candidates will be verified:
Qualifying Degree (Consolidated Marksheet/Degree
Certificate/Course Completion Certificate), 10th and 12th mark
sheet.
One passport size photograph.
Self-attested copy of Govt. issued photo ID card.
AADHAR Copy
All provisionally selected candidates have to pay first instalment of Rs. 22,000/- on or
before 16-12-2024 by payment mode mentioned above.
Selected candidates are requested to upload the proof of remittance of fee on registration
portal and also send the proof of remittance of fee as email to
[email protected]/[email protected]/[email protected]
Admission:
All provisionally selected who have paid the fees (full or first instalment) andverified by accounts
section of NIELIT Chennai will get a welcome message in his login id provided during registration.
Course Timings:
This program is a practical oriented one and hence there shall be more lab than
theory classes. The classes and labs are online cloud-based from 10 am to 5:30 pm and
Monday to Friday. In between any 04 hours can be fixed as your class timings according to
the candidate’s convenience and the faculty’s availability and remaining student can do
their lab.
Course enquiries
Students can enquire about the various courses either on telephone or by personal
contact between 9.15 A.M. to 5.15 P.M. (Lunch time 1.00 pm to 1.30 pm) Monday to
Friday.
Placement:
Students who have completed the course successfully and qualified, Placement
guidance and career counselling will be given to assist in their interviews.
Important Dates
Total 700
Theory Papers
Theory 1 – Configuring Platform for Data Engineering (Basic Linux, Java & Data
Warehousing)
Theory 2 – Data Analytics and Machine Learning
Theory 3 – Big Data Analytics
Practical Papers
Practical 1 – Configuring Platform for Data Engineering& Machine Learning
(Basic Linux,Java, Data Warehousing & Data Analytics)
Practical 2 – Big Data Analytics
Examination Centre: NIELIT Chennai, Mode: Online
Page 10 of 25
Grading Scheme
Following Grading scheme (on the basis of total marks) will be followed:
Grade S A B C D
Marks >=85% >=75% and >=65% and >=55% and >=50% and
Range (in <85% <75% <65% <55%
%)
Page 11 of 25
Page 12 of 25
……….many more.
Page 13 of 25
Detailed curriculum
Module 1: Configuring Platform for Data Engineering
Understanding Linux Environment & Basic Commands:
• Understanding Linux Environment:
• Introduction, Linux Architecture, Boot Process, Kernel, System Initialization, GUI, and
CLI(Access a shell prompt and issue commands with correct syntax.
• Commands:
• file handling commands, sort, tr, cut, find, grep, egrep, using filters, cat, mkdir, who and
otherbasic commands. vi editor
Linux Package management and Process Monitoring:
• su login, sudo, apt-get, ps command, kill command and other related commands, single and
multi-user mode of Ubuntu.
Important Files and Directories.
BASH Scripting:
• Introduction to BASH, Variables (System & User defined),Exporting Variables, Special Shell
Variables, Control Structures, Understanding execution mode of BASH script, Array, functions,
BASH debugging
Case Study:
1. BASH Script for removing missing value
2. BASH Script for generating Frequency Distribution Table from given data (consisting of
10000records).
3. BASH Script for removing blank lines from a file.
4. BASH Script to find frequency of a word from several files.
BASH Script to Merge files based on some fields.
Configuring Secure Shell & LAN
• LAN: Introduction, Configure LAN on Ubuntu
• Secure Shell:
Understanding & Configuring Secure Shell, Access remote systems using ssh, SCP, PasswordlessSSH,
Configure key-based authentication for SSH
User Administration
• User Management:
• Adding/Modifying/Deleting new users, Understanding User Id and other related fields.
Understanding /etc/passwd and /etc/shadow, Password Aging Policies, Switching Accounts,
sudo access
• Group Management:
• User Private Groups, Group Administration.
• Understanding SUID and SGID Executable, Sticky Bit, Default File
chmod and chown command
Page 14 of 25
Virtualization
• Introduction to Virtualization
Virtual Machine installation, Configuring Virtual Machines, Install Ubuntu/Centos systems asvirtual guests,
configure systems to launch virtual machines at boot. , Creating Clone of a VirtualMachine and its restoration,
virtual LAN, Memory addressing, Paging, Memory mapping, virtualmemory, complexities and solutions of
memory virtualization, VM configurations, VM migrations, Migration types and process.
Basics of Information Security & Cloud
Java for Hadoop
Java Introduction:
Concept of OOPs
Introduction to Java
Configure JAVA PATHs in PATH variable and other related places in Linux.
Features of Java
Working with Java Variables
Declaring and Initializing Variables
Primitive Data Types
Class & Object Fundamentals
Object Lifecycle
Read and Write Java Object Fields
Understanding JAR file and its working
Java Operators and Decision Constructs
Using loop Constructs in Java
while, for, switch case etc.
Array & String:
Creating and using One-Dimensional Array
Creating and using Multi-Dimensional Array
String Class and related functions.
Methods and Encapsulation:
Java Method
Static and Final Keyword
Constructors and Access Modifiers in Java
Encapsulation
Inheritance:
Polymorphism Casting and Super
Abstract Class and Interfaces
Exception Handling:
Types of Exceptions and Try-catch Statement
Throws Statement and Finally Block
Exception Classes
Creating Custom Exception Classes
Work with Selected classes
String & String Buffer
Create and Manipulate Calendar Data
Declare and Use of Array list
Page 15 of 25
Collection Framework
Introduction to Collection Framework
Core Collection in Java
Core Collection framework
Types of Collection,
Hierarchy of Collection Framework
Commonly used methods of Collection interface
Iterator Interface
Methods of Iterator interface
File Handling and Serialization
The Classes for Input and Output
The Standard, Streams
Working with File Object
o File I/O Basics, Reading and Writing to Files
o Buffer and Buffer Management
Read/Write Operations with File Channel.
Serialization
Data warehousing using MySQL
• Data warehousing concept
• Data Base Design using MySQL:
• Concept of RDBMS, Storage Engine, Structure of MySQL
• Creating Database, Data Types, Table etc.
• Relational Model and SQL:
• Relation Model, MySQL Query, Creating and Using a Database, Select, Operators, group
by,order by, Primary Key, etc.
• Database Design using the Relational Model:
• Making Relation between tables, Foreign Key, joins etc.
• Export & Import Data
Export and Import of External Data, Interacting with different tables, Backup and Recovery.
Basics of NoSQL and Apache Cassandra
• Introduction to NoSQL and Cassandra:
• Understanding NoSQL, Types of NoSQL databases, Usage of NoSQL databases,
NoSQL Eco System, Overview of Cassandra, Features of Cassandra, Cassandra Vs.
MongoDB
• Architecture of Apache Cassandra:
• Understanding high level Cassandra architecture,
• Peer-to-Peer design, Network topology, Virtual Node, Components of
Cassandra, Partitioner and Replication, Memtables and SSTables, Bloom Filters, Managers
and Services,
Cassandra read and write process, Failure scenario.
Page 16 of 25
Apache Cassandra:
Installation &Configuration
Versions of Apache Cassandra
Understanding Pre-requisite for Installation
Installing Cassandra
Linux Commands to auto start Apache Cassandra
Logging setup in Cassandra
Understanding Replication Factor
Cassandra Cluster
Miscellaneous setting
Understanding Apache Cassandra Data model:
• Introduction to Data Model
• Design between RDBMS and Cassandra
Understanding Cassandra API:CQL-API and thrift API)
Cassandra Monitoring Tools:
• Introduction of Monitoring Tools
• Cluster Statistics
i. nodetool
ii. JConsole
iii. Table Statistics
• Table Statistics
• Thread Pool
Compaction Metrics
Cassandra Cluster:
• Introduction to Cluster
• Layers of Cassandra Cluster
o Node Cluster
o Keyspace
o Column Families
o Rows
o Column
Cluster Builder
Cassandra CQLSH
• Introduction to CQL
• Documented Shell Commands:
• Help, Version, Color, No Color
• DEBUG, Execute, File, U,P
• Exit, Describe, Expand etc.
• CQL: Data Definition Commands:
o Create Keyspace
o Use Keyspace
o Alter Keyspace
o Drop Keyspace
o Create Table
o CRUD Operation
o Alter table
Add Column to a table
Page 17 of 25
o Drop a Column
o Truncate Table
o Drop Table
• CQL: Data Manipulation Commands:
o Insert Command
o Update Command
o Delete Command
o Batch Command
• CQL Clauses:
o Select
o where
o Order by
• Cassandra Data types
• Build-in
o (Boolean,blob,ascii,bigint,counter,decimal,double,float,inet,int,text,varchar,timestamp,
var int etc.)
• Collection data Type
o List (Create, Insert, Update, Verify)
o Map((Create, Insert, Update, Verify)
o Set(Create, Insert, Update, Verify)
• User Defined Data type
o Create
o Alter
o Add
o Drop
o Describe
• Database User and Roles
• Control Commands
• Complex query
• Built-in and User defined Function
• Run CQL Scripts from the command line
JSON support
Indexes and Composite Columns:
• Overview of Index and benefit:
o Understanding Index
o Create Index
o Drop Index
• Index on Distributed Database
• Clustered Indexes vs Non-Clustered Indexes
• Secondary Index
• Composite Columns
• Data Partitioning
Data Colocation
Cassandra Interfaces:
• Java interfaces to connect Cassandra
ODBC interface to connect Cassandra
Page 18 of 25
o T Test
o Z Test
o Goodness of fit Test
Chi Square Test
Time Series Analysis using Python
• Time Series Introduction, Understanding Time Series Data.
• Importing Time Series Data in Python.
• Working with Time Series Libraries like autots, tsfresh, dart, atspy etc.
• Understanding Panel Data.
• Visualisation of Time Series Data
• Patterns in a Time Series
• Additive and Multiplicative Time Series
• Decomposing a time series into its components.
• Working with Seasonal & Nonseasonal Time Series.
• Working with Stationary and Non-Stationary Time series.
• Test for Stationarity of Time Series Data.
o Augmented Dickey Fuller(ADH) Test
o Kwiatkoski-Phillips-Schmidt-Shin(KPSS) Test
o Philips Peron(PP) Test
• Understanding noise in Time Series Data.
o Understanding white noise and stationary series.
• De-trending a Time series data.
• De-seasonalise a Time Series Data
• Test for Seasonality of Data
• Handling Missing Value in Time Series Data.
o Backward Fill
o Linear Interpolation
o Quadratic Interpolation
o Mean of Nearest Neighbours
o Mean of seasonal counterparts
• Smoothening a Time Series data.
• Autocorrelation & Partial autocorrelation function.
• Lag Plots
• Forecasting a Time Series Data.
Causality Test for Time Series Data.
Machine Learning
Introduction to Machine Learning:
Basic Concepts of Machine Learning
End-to-end Process of Investigating Data through a Machine Learning Lens.
Application of Machine
Learning.Types of Machine
Learning Supervised
• Classification
Regression:
Page 20 of 25
o Linear Regression
o Generalized Linear Regression
o Logistic Regression
o Multiple regression
o Poisson Regression
Unsupervised
• Clustering
o The k-Means Clustering
o The k-Medoids Clustering
o Hierarchical Clustering
o Density-based Clustering
• Dimensionality Reduction
o Principle Component Analysis
o K-nearest neighbour
o Discriminant Analysis
• Anomaly Detection
o KNN
• Outlier Detection, Association Rules
o Basics of Association Rules
o Association Rule
• Mining
o Text Mining
Tree, Decision Tree, Splits, Entropy etc.
Neural Networks
• Introduction
o Understanding Neural Networks
o Building simple Neural Network in Python
o Multiple Input & Outputs
o Use of NumPy to build Neural Network
• Updating Weights in Simplest Neural Network
o Simple Error analysis
o Working with 1 attribute
o Small Steps
o Extending Simplest Neural Network to Multiple Inputs
o Extending to Multiple Outputs
o Combining Multiple Input and Outputs
• Extending Neural network to Completed Data Sets
o Extending Neural Network to Use Multiple Samples
o Goodness of Fit Parameters
• Perceptron Learning & Binary Classification
• Back Propagation Learning
Learning Feature Vectors for Words & Object Recognition.
Page 21 of 25
Heartbeat Signal
How to Store the Data into HDFS
How to Read the Data from HDFS
CLI commands (Hadoop FS shell)
Hadoop Administration & Admin Commands
Hadoop MapReduce using Python
Concepts of HDFS Java API
Overview of MapReduce Framework,
MapReduce Architecture & Daemons
Job tracker and Task tracker
YARN and its Processing Application
YARN MR Application Execution Flow
Data Flow In MapReduce
Introduction to Hadoop Streaming
Streaming Command Options
Use cases of MapReduce, Anatomy of MapReduce Program
Basic MapReduce API Concepts.
Writing MapReduce Driver, Mappers, and Reducers, Unit Testing MapReduce Programs
etc,,Case Studies
Generic Command Options
Basic MapReduce Sample Program-1
Basic MapReduce Sample Program-2
Chaining of MR Jobs
Custom Combiner
Generic OptionParser
Analysis of IRIS dataset
Built-in and Custom Counters in Hadoop
Custom Partititioner
Hadoop Sequence File Format
Read Write Sequence File
Hadoop Data Types
Processing of XML File
Data Compression with Hadoop
Data Serialization
Use Cases
Integration of Cassandra and Hadoop
Hadoop Map Reduce using Cassandra
Page 23 of 25
Apache HBase
HBase Introduction
HBase vs RDBMS (fixed Vs flexible schema)
Understanding HBase Configuration files and its configuration in Hadoop Eco System
o HBase run modes: Standalone and Distributed
o Understanding and working with Zookeeper, Master and Region servers in
fullydistributed Mode.
HBase Architecture and Components,
HBase Data Model
Understanding Conceptual View, Physical View, Namespace,Table,Row,Column
Family,Cells,Data Model Operations,Versions,Sort Order,Column Metadata, Joins,
ACID etc.
HBase commands
o General Commands:
Status,Version,Table_help( scan, drop, get, put, disable, etc.), whoami etc.
o Table Management Commands:
Create,List,Describe,Disable,Disable_all,Enable,Enable_all,Drop,Drop_all,Show_filters,
A lter,Alter_status etc.
o Data manipulation commands:
count,put,get,delete,delete all,truncate,scan etc.
o Cluster Replication Commands:
add_peer,remove_peer,start_replication,stop_replication etc.
o Understanding TTL(Time to Live)
HBase Constraints
Case Study - Log Data and Timeseries Data
Case Study - Customer/Order
HBase and MapReduce
HBase Backup and Restore
Scala, Apache Spark, Kafka & Flume
Introduction to Scala
Scala REPL (“Read-Evaluate-Print-Loop”)
Basic Scala Operations
Variable Types in Scala
Control Structures in Scala
Functions and Procedures
Page 24 of 25
Collections in Scala
Array,ArrayBuffer,Map,Tuple,Lists etc.
Object Oriented Programming and Functional Programming Concepts in Scala
Methods, classes, and objects in Scala
Packages and package objects
Traits and trait linearization
Java Interoperability
Introduction to functional programming
Functional Scala for data science
Importance of functional programming and Scala for learning Spark
Pure functions and higher-order functions
Using higher-order functions
Error handling in functional Scala
Functional programming and data mutability
Scala Collection API & Scala Implicits
Introducing Apache Spark
Introduction of Apache Spark: Need of Apache Spark, Feature of Apache Spark.
Understanding concept of Spark Cluster Modes on YARN.
Apache Spark Installation and Configuration
Understanding Spark Cluster Modes on YARN
Spark Applications
The back bone of Spark - RDD (Resilient Distributed Dataset )
Loading Data
What is Lambda
Using the Spark shell
Actions and Transformations
Associative Property
Implant on Data
Persistence
Caching
Loading and Saving data
Operations of RDD
o Challenges in Existing Computing Methods
o Probable Solution & How RDD Solves the Problem
o Introduction to RDD, its operations, Transformations & Actions
o Data Loading and Saving Through RDDs
o Key-Value Pair RDDs, Other Pair RDDs, Two Pair RDDs
o RDD Lineage & RDD Persistence
o WordCount Program Using RDD Concepts
o RDD Partitioning achieve Parallelization.
o Using Accumulators
o Creating custom Accumulators
o Using Broadcast variables
o Passing Functions to Spark
Data Frames and Spark SQL
o Introduction to Spark SQL & its architecture
o SQL Context in Spark SQL
User Defined Functions
Page 25 of 25