Da Handbook
Da Handbook
I. COURSE PURPOSE:
More and more organizations these days use their data as decision supporting tool and to build
data-intensive products and services. The collection of skills required by organizations to
support these functions has been grouped under the term “Data Analytics”. This course will
cover the basic concepts of data analytics, methodologies for analyzing structured and
unstructured data with emphasis on the relationship between the data Scientist and the
business needs.
II. PRE-REQUISITES:
PROGRAM
Bloom’s Taxonomy
S. OUTCOMES,
Course Outcomes Levels
No. PROGRAM
SPECIFIC
OUTCOMES
L1-Remembering, L2-
Understand the impact of data analytics for PO1-PO6,PO9-
1. Understanding, L5-
business decisions and strategy PO12,PSO1-PSO3
Evaluating
L3-Applying, L5- PO1-PO6,PO9-
2. Carry out data analysis/statistical analysis
Evaluating PO12,PSO1-PSO3
To carry out standard data visualization L4-Analyzing, L5- PO1-PO6,PO9-
3.
and formal inference procedures Evaluating PO12,PSO1-PSO3
4 L4-Analyzing, L6-
PO1-PO6,PO9-
Design Data Architecture Creating, L1-
PO12,PSO1-PSO3
Remembering
5 L6-Creating, L1- PO1-PO6,PO9-
Understand various Data Sources Knowledge and L3- PO12,PSO1-PSO3
Applying
V. COURSE CONTENT:
UNIT – I
Data Management: Design Data Architecture and manage the data for analysis, understand
various sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality
(noise, outliers, missing values, duplicate data) and Data Processing & Processing
UNIT – II
UNIT – III
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics
applications to various Business Domains etc.
UNIT – IV
Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach, Extract features
from generated model as Height, Average Energy etc and Analyze for prediction
UNIT – V
Teaching
Course Learning
S.NO WEEK TOPICS methodologi REFERENCES
Outcomes
es
UNIT-1
Understanding and
Remembering the
1 Design Data Architecture T1
basics of data
architecture
I
Applying the definition of
2 Design Data Architecture T1
data Mining
Understanding data
3 Managing Data Analysis T1
analysis
Understanding Various Analyzing the different
4 T1
Sources of Data sources of data Chalk and
Understanding the board,
5 2 Data Management T1
management PPT
Applying the data for presentati
6 Data Quality T1
quality processing on
Analyzing data
7 Data Processing T1
processing
3 Analyzing data
8 Data Processing T1
processing
9 Revision of Unit-1 T1
10 MOCK TEST-1
4
11 Tutorial/bridge class #1 T1
UNIT-2
Introduction to Data Understanding data
12 4 T1,T2
Analytics analytics
Understanding and
Introduction to Tools and
13 5 analyzing tools T1,T2
Environment
environment
Application of Modeling in Understanding real-time
14 T1,T2
Business applications
Application of Modeling in Understanding real-time
15 T1,T2
Business applications
Databases & Types of Data Understanding and
16 T1,T2
and Variables analyzing databases Chalk and
Analyzing modeling board,
17 Data Modeling techniques T1,T2
6 techniques PPT
Analyzing modeling presentat T1,T2
18 Data Modeling techniques techniques ion
T1,T2
Understanding the
19 Missing Imputations consequences of missing T1,T2
imputations
7 Creating a representation of
20 Need for Business Modeling T1,T2
business model
21 Revision of Unit-II T1,T2
22 8 Tutorial/bridge class #2
I-MID EXAMINATIONS(WEEK-9)
UNIT-3
Understanding the
23 Regression Concepts T1,T2
concepts
8
Evaluating the
24 Blue Property Assumptions T1,T2
assumptions
Understanding the
25 Least Square Estimation T1,T2
algorithm
Creating and
26 9 Variable Rationalization understanding T1,T2
the concepts
Understanding
27 Model Building Chalk and board, T1,T2
model creation
Evaluating Model PPT
28 Logistic Regression presentation T1,T2
theory
Evaluating Model
29 Model Fit Statistics T1,T2
10 theory
Understanding
30 Model Construction Model T1,T2
Construction
Understanding
31 Analytics Applications applications to T1,T2
11 business domains
32 Tutorial/bridge class #3
UNIT-4
Understanding
11 Regression vs Segmentation object T1,T2
33 segmentation
Supervised and Unsupervised Evaluating
T1,T2
34 Learning Algorithms
Analyzing the
Tree Building - Regression
35 12 concepts
Understanding
Classification & Overfitting classification T1,T2
36 methods
Understanding and
Pruning and Complexity evaluating the Chalk and board, T1,T2
37 complexity PPT
13 Evaluating decision presentation
Multiple Decision Trees T1,T2
38 trees
Times Series Methods - Understandingalgori
T1,T2
39 Arima thms
Understanding and
Measures of Forecast
implementing real T1,T2
Accuracy
40 world examples
14 Understanding
STL Approach various T1,T2
41 approaches
Extract Features from Models Analyzing extraction T1,T2
42
UNIT-5
43 Data Visualization T1,T2
Understanding and
44 15 Geometric Projection Analyzing and T1,T2
evaluating
45 Icon-Based Techniques T1,T2
46 MOCK TEST-2 MOCK TEST-2 Chalk and board, T1,T2
PPT
47 Hierarchical Visualization Understanding and presentation T1,T2
Analyzing and
48 16 Complex Data Relationships evaluating T1,T2
49 Tutorial/bridge class #6
50 Tutorial/bridge class #7
II MID EXAMINATIONS (WEEK 17)
TEXT BOOKS:
1. Student’s Handbook for Associate Analytics – II, III.
2. Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan Kaufmann
Publishers.
REFERENCES:
1. Introduction to Data Mining, Tan, Steinbach and Kumar, AddisionWisley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira
3. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs
Jeffrey D Ullman Stanford Univ.
Proficiency
Program Outcomes (PO) Level assessed
by
PO1Engineering knowledge: Apply the knowledge of mathematics,
Lectures,
science, engineering fundamentals, and an engineering Assignments,
2.5
specialization to the solution of complex engineering problems Exams
related to Computer Science and Engineering.
PO2 Problem analysis: Identify, formulate, review research
literature, and analyze complexengineering problems
Lectures,
related to Computer Science and Engineering and Assignments,
1.5
reaching substantiated conclusions using first principles Exams
of mathematics, natural sciences, and engineering
sciences.
PO3 Design/development of solutions: Design solutions for
complex engineering problems related to Computer
Science and Engineering anddesign system components Lectures,
or processes that meet the specified needs with 3 Assignments,
appropriate consideration for the public health and safety, Exams
and the cultural, societal, and environmental
considerations.
PO4 Conduct investigations of complex problems: Use
research-based knowledge and researchmethods including Lectures,
design of experiments, analysis and interpretation of data, 1.5 Assignments,
and synthesis of the information to provide valid Exams
conclusions.
PO5 Modern tool usage: Create, select, and apply appropriate Lectures,
2.5 Assignments,
techniques, resources, and modernengineering and IT
Proficiency
Program Outcomes (PO) Level assessed
by
tools including prediction and modeling to complex Exams
engineering activities with an understanding of the
limitations.
PO6 The engineer and society: Apply reasoning informed by the
contextual knowledge to assesssocietal, health, safety, Lectures,
legal and cultural issues and the consequent 1 Assignments,
responsibilities relevant to the Computer Science and Exams
Engineering professional engineering practice.
PO7 Environment and sustainability: Understand the impact of
the Computer Science and Engineering professional
engineering solutionsin societal and environmental -
contexts, and demonstrate the knowledge of, and need for
sustainable development.
PO8 Ethics: Apply ethical principles and commit to professional
ethics and responsibilities and norms ofthe engineering -
practice.
PO9 Individual and team work: Function effectively as an Lectures,
individual, and as a member or leader indiverse teams, 1.5 Assignments,
and in multidisciplinary settings. Exams
PO10 Communication: Communicate effectively on complex
engineering activities with the engineeringcommunity and
Lectures,
with society at large, such as, being able to comprehend Assignments,
2.0
and write effective reports and design documentation, Exams
make effective presentations, and give and receive clear
instructions.
PO11 Project management and finance: Demonstrate knowledge
and understanding of theengineering and management Lectures,
principles and apply these to one’s own work, as a 1.5 Assignments,
member and leader in a team, to manage projects and in Exams
multidisciplinary environments.
PO12 Life-long learning: Recognize the need for, and have the
Lectures,
preparation and ability to engage inindependent and life- Assignments,
2.5
long learning in the broadest context of technological Exams
change.
VIII. HOW PROGRAM SPECIFIC OUTCOMES ARE ASSESSED:
Proficiency
Program Specific Outcomes (PSO) Level assessed
by
PSO1 Foundation of mathematical concepts: To use mathematical Lectures,
methodologies to crack problem using suitable mathematical 2.5 Assignments,
analysis, data structure and suitable algorithm. Exams
PSO2 Foundation of Computer System: The ability to interpret the 3.0 Lectures,
fundamental concepts and methodology of computer systems. Assignments,
Students can understand the functionality of hardware and software Exams
aspects of computer systems.
PSO3 Foundations of Software development: The ability to grasp the
software development lifecycle and methodologies of software
Lectures,
systems. Possess competent skills and knowledge of software design Assignments,
2.0
process. Familiarity and practical proficiency with a broad area of Exams
programming concepts and provide new ideas and innovations
towards research.
DESCRIPTIVE QUESTIONS
UNIT-1
Short Answer Questions
QUESTIONS Blooms Courseoutcomes
taxonom
y level
1. What is Data Management? Understand 1
2. What is Big Data? Understand 1
3. List out Enterprise Requirements Knowledge 1
4. What is workplace safety? Knowledge 1
5. What did you understand about is Big-data tools ? Analyze 1
UNIT-3
Short Answer Questions
QUESTIONS Blooms Course
taxonomy outco
level mes
1.State BLUE property assumptions? Understand 3
2.What is variable rationalization? Knowledge 3
3.Explain theoretically an analytics application in business Analysis 3
domain?
4.How to calculate a LSE regression line Knowledge 3
5.Explain OLS? Understand 3
UNIT-5
Short Answer Questions
QUESTIONS Blooms Course
taxonomy outco
level mes
1.Name some frequently used 2-D space-filling curves? Knowledge 5
2.What is a scatter plot an scatter-plot matrix? Understand 5
3.Speciy the dimensionality of Chernoff faces? Analysis 5
4. Write a short note on Hierarchical visualization techniques. Understand 5
5.Explain tag cloud briefly. Understand 5
Answer: A
2. All of the following accurately describe Hadoop, EXCEPT:
a. Open source
b. Real-time
c. Java-based
d. Distributed computing approach
Answer: B
3. __________ has the world’s largest Hadoop cluster.
a. Apple
b. Datamatics
c. Facebook
d. None of the mentioned
Answer: C
4. What are the five V’s of Big Data?
a. Volume
b. Velocity
c. Variety
Answer: D
5. _________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.
a. Scalding
b. Cascalog
c. Hcatalog
d. Hcalding
Answer: B
UNIT-3
1) True-False: Is Logistic regression a supervised machine learning algorithm?
A) TRUE
B) FALSE
Answer: A - True, Logistic regression is a supervised learning algorithm because it uses true
labels for training. Supervised learning algorithm should have input variables (x) and a target
variable (Y) when you train the model.
Answer: B - Logistic regression is a classification algorithm, don’t confuse with the name
regression.
Answer: A - Yes, we can apply logistic regression on 3 classification problem, we can use One
Vs all method for 3 class classification in logistic regression.
5) Which of the following methods do we use to best fit the data in Logistic Regression?
A) Least Square Error
B) Maximum Likelihood
C) Jaccard distance
D) Both A and B
Answer: B - Logistic regression uses maximum likely hood estimate for training a logistic
regression.
6) Which of the following evaluation metrics cannot be applied in case of logistic regression
output to compare with target?
A) AUC-ROC
B) Accuracy
C) Logloss
D) Mean-Squared-Error
Answer: D -Since, Logistic Regression is a classification algorithm so it’s output cannot be real
time value so mean squared error cannot use for evaluating it
UNIT-4
1. Choose the options that are correct regarding machine learning (ML) and artificial intelligence
(AI),
Answer: (D)
Answer: A
4. Supervised learning differs from unsupervised clustering in that supervised learning requires
A) Only 1
B) Both 1 and 2
C) Only 2
D) All of the statements
Solution: (C)
Integrated: In ARIMA time series analysis, integrated is denoted by d. Integration is the inverse
of differencing. When d=0, it means the series is stationary and we do not need to take the
difference of it. When d=1, it means that the series is not stationary and to make it
stationary, we need to take the first difference. When d=2, it means that the series has been
differenced twice. Usually, more than two time difference is not reliable.
Moving average component: MA stands for moving the average, which is denoted by q. In
ARIMA, moving average q=1 means that it is an error term and there is auto-correlation with
one lag.
UNIT-5
1. Worlds-within-Worlds is also known as? Answer: a
a. n-Vision
b. b. influence graph
c. c. binary attribute
d. 6-D dataset
WEBSITES:
1. Associate Analytics – II
https://satyasai2.files.wordpress.com/2017/02/associate-analytics-m1-sh.pdf
2. Associate Analytics – III
http://jntuhsd.in/uploads/programmes/Associate_Analytics_M3_final.pdf
LIST OF TOPICS FOR STUDENT SEMINARS (Optional):
1. Data Management
2. Data Analytics
3. Regression
4. Logistic Regression
5. Object Segmentation
6. Time Series Methods
7. Data Visualization
As the name suggests (no points for guessing), this data set provides the data on all the
passengers who were aboard the RMS Titanic when it sank on 15 April 1912 after colliding
with an iceberg in the North Atlantic ocean. It is the most commonly used and referred to
data set for beginners in data science. With 891 rows and 12 columns, this data set provides a
combination of variables based on personal characteristics such as age, class of ticket and
sex, and tests one’s classification skills.