0% found this document useful (0 votes)

57 views10 pages

Ia - Eda

Uploaded by

Priyanga Manokaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views10 pages

Ia - Eda

Uploaded by

Priyanga Manokaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

NAME PRIYANGA M

ROLL NUMBER 2214509443

PROGRAM MASTER OF BUSINESS ADMINISTRATION (MBA)

SEMESTER III

COURSE CODE & NAME DADS302 EXPLORATORY DATA ANALYSIS

INTERNAL ASSIGNMENT SET - 1
1.Explain various measures of dispersion in detail using specific examples.
A measure of dispersion is the scattering of data and non-negative real numbers.
Measures of dispersion help to represent the variability in data. The most significant use of
measures of dispersion is that they help to understand the distribution of data. The measure of
dispersion value is zero if the data points in a data set are equal.

Absolute Measures of Dispersion

1. Range:
The range is the simplest measure of dispersion. It represents the difference between
the largest and smallest values in a dataset.
Formula: Range = Largest Value - Smallest Value
Example: Consider a dataset of exam scores: {75, 80, 85, 90, 95}. The range is 95 - 75 = 20.
2. Mean Deviation:
The mean deviation calculates the average of the absolute differences between each data point
and the mean.
Formula: Mean Deviation = (Σ|Xi - Mean|) / n
Example: For the dataset {10, 12, 15, 18, 20}, the mean deviation is (|10-15| + |12-15| + |15-
15| + |18-15| + |20-15|) / 5 = 4.
3. Standard Deviation:
The standard deviation measures the spread of data points around the mean.
Formula: Standard Deviation (σ) = √[Σ(Xi - Mean)² / n]
Example: Given heights (in cm) of five students: {160, 165, 170, 175, 180}, calculate the
standard deviation.
4. Variance:
The variance is the average of the squared deviations from the mean.
Formula: Variance (σ²) = Σ(Xi - Mean)² / n
Example: For the dataset {8, 10, 12, 14, 16}, the variance is (8-12)² + (10-12)² + (12-12)² +
(14-12)² + (16-12)² / 5 = 8
5. Quartile Deviation:
The quartile deviation is half of the difference between the third quartile (Q3) and the first
quartile (Q1).
Formula: Quartile Deviation = (Q3 - Q1) / 2
Example: Given the dataset {20, 25, 30, 35, 40}, find the quartile deviation.
Relative Measure of Dispersion
Coefficient of Range
It is the ratio of the difference between the maximum and minimum values in a data set to the
sum of the maximum and minimum values.
C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin)
Coefficient of Variation
Coefficient of Variation is the ratio of the standard deviation to the mean of the given data set.
It is indicated in the form of a percentage.
C.D.= (S.D./Mean)*100
Coefficient of Mean Deviation
It is the ratio of the mean deviation to the value of the central point. C.D. = Mean
deviation/Average
Coefficient of Quartile Deviation
It is defined as the ratio of the difference between the third quartile and the first quartile to the
sum of the third and first quartiles.
C.D. = (Q3 – Q1) ⁄ (Q3 + Q1)
Coefficient of Standard Deviation
The coefficient of standard deviation is defined as the ratio of standard deviation to the mean.
C.D. = S.D.⁄Mean.

2. What is Data Science? Discuss the role of Data Science in various Domains.
Data science plays a major role in society due to the explosion of information present, creating
the opportunity for industries to grow in their own way. Fields like healthcare, finance, media,
and many others are discovering the insights of big data through data science for decision-
making and other activities.
Search Engines:
Data science powers search engines like Google, Yahoo, and Bing.
Algorithms analyze user behavior, search queries, and content relevance to provide faster and
more accurate search results.

Transport and Driverless Cars:

Data science plays a crucial role in driverless cars.
Analyzing real-time data helps optimize driving decisions, handle different road situations, and
reduce accidents.

Finance and Stock Market:

Data science aids financial industries in risk analysis, fraud detection, and stock market
predictions.Historical data is analyzed to predict future stock prices and guide investment
decisions.

Healthcare and Medical Research:

Data science improves patient outcomes by analyzing electronic health records, medical
images, and clinical data.
It helps identify disease patterns, predict epidemics, and personalize treatment plans.

Retail and Recommendation Systems:

E-commerce platforms use data science to recommend products based on user preferences and
past behavior.
Personalized recommendations enhance user experience and drive sales.

Social Media and Sentiment Analysis:

Data science techniques analyze social media data to understand public sentiment, trends, and
user engagement.
Sentiment analysis helps businesses adapt their strategies.

Marketing and Customer Segmentation:

Data science segments customers based on demographics, behavior, and preferences.
Targeted marketing campaigns yield better results.

Energy and Smart Grids:

Data science optimizes energy consumption, predicts demand, and enhances grid stability.
Smart meters and sensors generate vast data for analysis.

Environmental Monitoring and Climate Change:

Data science models analyze climate data, predict extreme events, and inform policy decisions.

Manufacturing and Quality Control:

Data science ensures product quality, detects defects, and optimizes production processes.

3.Discuss various techniques used for Data Visualization

Data visualization is a powerful way to represent information visually, making it easier for
viewers to understand patterns, trends, and insights. Let’s explore several essential data
visualization techniques along with their applications:
1. Pie Chart:
Ideal for illustrating proportions or part-to-whole comparisons.
Simple and easy to read.
Best suited for audiences unfamiliar with the data.
Example: Visualizing the distribution of different product categories in sales data.
2. Bar Chart:
Compares categories based on a measured value.
Effective for showing differences between groups.
Commonly used for sales, survey results, and market share analysis.
3. Histogram:
Displays the frequency distribution of continuous data.
Useful for understanding data distribution and identifying outliers.
Example: Analyzing the distribution of exam scores in a class.
4. Gantt Chart:
Depicts project timelines, tasks, and dependencies.
Essential for project management and scheduling.
Shows start and end dates for each task.
5. Heat Map:
Color-coded representation of data values on a grid.
Useful for visualizing correlations, patterns, and trends.
Often used in finance, biology, and geospatial analysis.
6. Box and Whisker Plot (Box Plot):
Displays the spread and central tendency of data.
Shows quartiles, outliers, and variability.
Useful for comparing distributions across different groups.
7. Waterfall Chart:
Illustrates cumulative effects of positive and negative values.
Commonly used for financial analysis and budgeting.
8. Area Chart:
Shows quantity over time.
Useful for visualizing trends and cumulative data.
Example: Tracking website traffic over months.
9. Scatter Plot:
Represents relationships between two continuous variables.
Helps identify correlations or clusters.
Used in scientific research, finance, and social sciences.
10. Pictogram Chart:
Uses icons or symbols to represent quantities.
Engaging and intuitive for conveying information.
Example: Showing population sizes of different countries using flags.
11. Timeline:
Displays chronological events or milestones.
Useful for historical data, project timelines, and personal achievements.
12. Highlight Table:
Emphasizes specific data points within a table.
Useful for highlighting key metrics or outliers.
13. Bullet Graph:
Combines bar chart and reference lines.
Efficiently communicates performance against targets.
Commonly used in dashboards and KPI tracking.
14. Choroleth Map:
Represents data by shading regions on a map.
Useful for visualizing geographic patterns (e.g., population density, election results).
15. Word Cloud:
Displays word frequency using font size.
Great for summarizing text data (e.g., customer reviews, social media posts).
16. Network Diagram:
Shows relationships between nodes (entities).
Used in social network analysis, organizational structures, and flowcharts.
17. Correlation Matrix:
Displays pairwise correlations between variables.
Helps identify strong or weak relationships.
Commonly used in finance, biology, and machine learning.
INTERNAL ASSIGNMENT SET - 2

4.What is feature selection? Discuss any two feature selection techniques used to get
optimal feature combinations

In machine learning, before training a model, it is necessary to obtain some useful features.
some features may be useful and some may not. It does not mean that increasing the number
of features may increase the model’s performance. Sometimes, it may increase the model
complexity, and performance may be reduced.

a. Filter Method
b. Wrapper Method
c. Embedded Method
d. Hybrid Method
1. Filter Methods:
o Filter methods evaluate features based on their intrinsic properties using univariate
statistics. These methods are computationally efficient and faster than other
techniques.
o They do not rely on cross-validation performance but instead focus on individual
feature characteristics.
o Common filter methods include:
▪ Information Gain: This technique calculates the reduction in entropy when
transforming a dataset. It assesses the information gain of each variable
concerning the target variable.
▪ Chi-square Test: Used for categorical features, the chi-square test evaluates
the independence between features and the target variable.
2. Wrapper Methods:
o Wrapper methods are more computationally intensive because they explore various
feature combinations. These methods use model performance (e.g., accuracy, F1-
score) as the evaluation criterion.
o They are considered “greedy” algorithms because they search exhaustively for the
optimal feature subset.
o Some common wrapper methods include:
▪ Forward Feature Selection: An iterative approach where we start with an
empty set of features and gradually add the best-performing features against
the target variable.
▪ Backward Feature Elimination: The opposite of forward selection, this
method begins with all features and iteratively removes the least relevant
ones.
3. C. Embedded Method
This approach encompasses the benefit of the wrapper and filter method and also
maintains reasonable computation cost. It takes care of every iteration of model training
and extracts the useful features that contributed more toward training.
Some of the embedded techniques are • Lasso L1 Regularization:
It adds some penalty to machine learning model parameters to reduce overfitting. It has
the property that can shrink a few coefficients to zero. Hence, that features could be
removed.
• Random Forest
It is a bagging algorithm that aggregates the number of decision trees. It ranks that how
well the purity of the node other hand decrease in the impurity Gini impurity. The nodes
with high impurity occur on the top of the tree, and less impurity occurs at the end of
the tree. Hence, pruning a tree can create a subset of the important features

D. Hybrid Method

In this, you can combine any of two approaches and make it a better approach for
feature selection

5.Discuss in detail the concept of Factor Analysis

An exploratory data analysis technique called factor analysis (FA) is used to look for significant
underlying factors or latent variables from a set of observed variables. Lowering the number
of variables aids in the interpretation of data. It takes the most common variance possible from
each variable and converts it into a single score.

Market research, advertising, psychology, finance, and operation research frequently

use FA.

1. Key Concepts:
o Observed Variables (Manifest Variables): These are the measured variables
in our dataset. For example, in a psychological study, observed variables could
be test scores, personality traits, or survey responses.
o Factors: Unobserved variables that explain the correlations among observed
variables. Factors represent underlying constructs or dimensions.
o Loading: The relationship between an observed variable and a factor. Loadings
indicate how much a variable contributes to a specific factor.
o Eigenvalues: Eigenvalues represent the variance explained by each factor.
Larger eigenvalues indicate more significant factors.
o Rotation Methods: Factor rotation improves interpretability. Techniques like
varimax, quartimax, and oblique rotation adjust factor loadings.
o Common Variance (Commonality): The proportion of variance in an
observed variable explained by all the factors.
o Specific Variance (Uniqueness): The unique variance in an observed variable
not explained by the factors.
2. Types of Factor Analysis:
o Exploratory Factor Analysis (EFA):
▪ Used when we don’t have specific hypotheses about the underlying
factors.
▪ EFA identifies factors without preconceived notions.
▪ Researchers explore the data to find the best-fitting model.
o Confirmatory Factor Analysis (CFA):
▪ Used when we have specific hypotheses about the underlying factors.
▪ CFA tests whether the observed variables align with the proposed factor
structure.
▪ Researchers confirm or reject a predefined model.
3. Steps in Factor Analysis:
o Data Preparation: Clean and preprocess the data (e.g., handle missing values,
normalize variables).
o Factor Extraction: Use methods like principal component analysis (PCA) or
maximum likelihood to extract factors.
o Factor Rotation: Improve interpretability by rotating the factors (e.g., varimax,
oblique rotation).
o Interpretation: Examine factor loadings, eigenvalues, and commonalities to
understand the factors’ meaning.
o Model Fit Assessment: In CFA, assess how well the proposed model fits the
data using goodness-of-fit indices.
4. Applications:
o Psychology: Identifying latent personality traits, intelligence factors, or
emotional dimensions.
o Marketing: Understanding consumer preferences and brand loyalty.
o Finance: Analyzing risk factors in stock returns.
o Healthcare: Identifying underlying health conditions from symptoms.
o Education: Evaluating the effectiveness of educational programs

6.Differentiate between Principal Component Analysis and and Linear Discriminant

Analysis

o Unsupervised Technique:
▪ PCA is an unsupervised method that focuses solely on the data’s
structure without considering class labels.
o Objective:
▪ The primary goal of PCA is to minimize dimensionality while
preserving as much variance as possible.
o Variance Retention:
▪ PCA constructs new variables (principal components) that are linear
combinations of the original features.
▪ These components capture the maximum variance present in the dataset.
▪ The first principal component explains the most variance, followed by
the second, and so on.
o Use Cases:
▪ Dimensionality Reduction: PCA reduces the number of features while
retaining critical information.
▪ Visualization: It simplifies high-dimensional data for visualization.
o Classification:
▪ PCA does not focus on class separability; it aims purely at variance
retention.
▪ It does not consider class labels during transformation.
o Supervised Technique:
▪ LDA is a supervised method that considers class labels during
dimensionality reduction.
o Objective:
▪ LDA aims to optimize the separability between classes while reducing
dimensionality.
o Separation of Classes:
▪ LDA constructs a new linear axis (discriminant) that maximizes the
distance between class means while minimizing the within-class
variance.
▪ It seeks to find the best subspace for distinguishing classes.
o Use Cases:
▪ Classification: LDA is primarily used for classification tasks.
▪ Feature Extraction: It combines dimensionality reduction with class
separability.
o Difference from PCA:
▪ LDA considers class information, whereas PCA does not.
▪ LDA aims at both dimensionality reduction and class discrimination.
▪ LDA identifies the most discriminative features for classification.

Summary:

o PCA: Focuses on variance retention and dimensionality reduction without

considering class labels.
o LDA: Optimizes class separability while reducing dimensions, making it
suitable for classification tasks

Data Visualization Notes
No ratings yet
Data Visualization Notes
22 pages
Data Visualization Seminar Report4.docx 11
No ratings yet
Data Visualization Seminar Report4.docx 11
40 pages
IMA Management Accounting Competency Framework PDF
100% (1)
IMA Management Accounting Competency Framework PDF
48 pages
Normative Legal Research
No ratings yet
Normative Legal Research
16 pages
R PPT 30
No ratings yet
R PPT 30
45 pages
A Case Study of ADEPR
100% (1)
A Case Study of ADEPR
20 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Overview of Process Analytics With SAP Master Data Governance On SAP S - 4HANA
No ratings yet
Overview of Process Analytics With SAP Master Data Governance On SAP S - 4HANA
36 pages
Frequency Distribution, Cross-Tabulation, and Hypothesis Testing (PPT) 1
No ratings yet
Frequency Distribution, Cross-Tabulation, and Hypothesis Testing (PPT) 1
22 pages
NADAforR Examples
No ratings yet
NADAforR Examples
30 pages
Irrigation and Drainage
No ratings yet
Irrigation and Drainage
34 pages
House Pricing Prediction System
No ratings yet
House Pricing Prediction System
36 pages
Explaining Psychological Statistics - 4th Edition ISBN 1118436601, 9781118436608 EPUB DOCX PDF Download
No ratings yet
Explaining Psychological Statistics - 4th Edition ISBN 1118436601, 9781118436608 EPUB DOCX PDF Download
16 pages
Week 8-Association Rules Part 1
No ratings yet
Week 8-Association Rules Part 1
31 pages
Past Papers Bmsi Solved
No ratings yet
Past Papers Bmsi Solved
73 pages
Data Science
No ratings yet
Data Science
59 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Unit 3
No ratings yet
Unit 3
30 pages
DV Lab Manual (Ex - No.1-10)
No ratings yet
DV Lab Manual (Ex - No.1-10)
23 pages
Unit-5 New
No ratings yet
Unit-5 New
31 pages
Unit 3 DATA VISUAIZATION
No ratings yet
Unit 3 DATA VISUAIZATION
25 pages
Chapter 10: Correlation and Regression Chapter 13: Nonparametric Statistics
No ratings yet
Chapter 10: Correlation and Regression Chapter 13: Nonparametric Statistics
27 pages
Automotive Servicing (Engine Repair) NC II Modules of Instruction Content 1
No ratings yet
Automotive Servicing (Engine Repair) NC II Modules of Instruction Content 1
42 pages
Industrial Marketing Intelligence
No ratings yet
Industrial Marketing Intelligence
26 pages
Unit .......
No ratings yet
Unit .......
45 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
Data Basics For ML
No ratings yet
Data Basics For ML
23 pages
Ds Unit 2 QB
No ratings yet
Ds Unit 2 QB
25 pages
Data+Visualization+in+Python
No ratings yet
Data+Visualization+in+Python
17 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
SMDS Unit 1
No ratings yet
SMDS Unit 1
36 pages
HDPM: An Effective Heart Disease Prediction Model For A Clinical Decision Support System
No ratings yet
HDPM: An Effective Heart Disease Prediction Model For A Clinical Decision Support System
17 pages
Pyq Elements of Statstics
No ratings yet
Pyq Elements of Statstics
16 pages
1.1 CS3352-FDS - Unit 1
No ratings yet
1.1 CS3352-FDS - Unit 1
42 pages
DBBA2102
No ratings yet
DBBA2102
10 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
Business Research Unit - 4
No ratings yet
Business Research Unit - 4
14 pages
Kuswanda 2013
No ratings yet
Kuswanda 2013
17 pages
EDA - Reviewer Midterm
No ratings yet
EDA - Reviewer Midterm
9 pages
File Ester
No ratings yet
File Ester
14 pages
UNIT4
No ratings yet
UNIT4
8 pages
Chapter 5
No ratings yet
Chapter 5
23 pages
Sab Theek Ho Jaega Unit 4 BRM
No ratings yet
Sab Theek Ho Jaega Unit 4 BRM
34 pages
Unit Iv BRM
No ratings yet
Unit Iv BRM
15 pages
Business Analytics (MIS171) Summary Notes
No ratings yet
Business Analytics (MIS171) Summary Notes
6 pages
Document
No ratings yet
Document
8 pages
Unit 2 - Merged
No ratings yet
Unit 2 - Merged
17 pages
Chapter 1 Exam Review - Graphical Displays of Data
No ratings yet
Chapter 1 Exam Review - Graphical Displays of Data
8 pages
AI's Impact On Digital Marketing
No ratings yet
AI's Impact On Digital Marketing
8 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
AS-410 Datasheet: Real-Time Vibration Analyzer Software
No ratings yet
AS-410 Datasheet: Real-Time Vibration Analyzer Software
6 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Da End Sem
No ratings yet
Da End Sem
5 pages
Probability and Stat Unit 1
No ratings yet
Probability and Stat Unit 1
12 pages
Reasearch Methodology and Statistics
No ratings yet
Reasearch Methodology and Statistics
13 pages
Foundation of Data Science Imp Notes
No ratings yet
Foundation of Data Science Imp Notes
6 pages
S.1 Bansal, P. & Corley, K. (2012) What's Different About Qualitative Research. Pág. 509-513
No ratings yet
S.1 Bansal, P. & Corley, K. (2012) What's Different About Qualitative Research. Pág. 509-513
6 pages
Dsbda Ut6
No ratings yet
Dsbda Ut6
11 pages
BADM lý thuyết 1-6
No ratings yet
BADM lý thuyết 1-6
11 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Data Visualization
No ratings yet
Data Visualization
18 pages
DV Unit 2
No ratings yet
DV Unit 2
5 pages
Notes
No ratings yet
Notes
5 pages
ADS Imp Ans
No ratings yet
ADS Imp Ans
11 pages
Chapter - 3
No ratings yet
Chapter - 3
11 pages
Data Visualization CAE-1
No ratings yet
Data Visualization CAE-1
8 pages
Grey Minimalist Business Project Presentation
No ratings yet
Grey Minimalist Business Project Presentation
5 pages
Economics Assignment - Group B
No ratings yet
Economics Assignment - Group B
2 pages
PL 300t00 Power Bi Data Analyst Training
No ratings yet
PL 300t00 Power Bi Data Analyst Training
5 pages
Creative and Minimal Portfolio Presentation
No ratings yet
Creative and Minimal Portfolio Presentation
5 pages
FINN 321 Econometrics Muhammad Asim
No ratings yet
FINN 321 Econometrics Muhammad Asim
4 pages
Ba Theory
No ratings yet
Ba Theory
10 pages
Data Analysis and Visualization Uploaded 1744086898389
No ratings yet
Data Analysis and Visualization Uploaded 1744086898389
47 pages
Course Syllabus
No ratings yet
Course Syllabus
7 pages
Assignment EDA
No ratings yet
Assignment EDA
4 pages
Individual Summarys
No ratings yet
Individual Summarys
3 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
UNIT I Complete Notes
No ratings yet
UNIT I Complete Notes
5 pages
Tipping Point Analysis
No ratings yet
Tipping Point Analysis
5 pages
Data Analytics (Finished
No ratings yet
Data Analytics (Finished
4 pages
Data Exploration and Visualization Unit 1
No ratings yet
Data Exploration and Visualization Unit 1
4 pages
Control Chart Pom
No ratings yet
Control Chart Pom
5 pages
Assignment 3 - Exploratory Data Analysis
No ratings yet
Assignment 3 - Exploratory Data Analysis
2 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
Module 2 Lesson 3 Measures of Dispersion
No ratings yet
Module 2 Lesson 3 Measures of Dispersion
6 pages
Probability
No ratings yet
Probability
2 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)

Ia - Eda

Uploaded by

Ia - Eda

Uploaded by

NAME PRIYANGA M

ROLL NUMBER 2214509443

PROGRAM MASTER OF BUSINESS ADMINISTRATION (MBA)

COURSE CODE & NAME DADS302 EXPLORATORY DATA ANALYSIS

Absolute Measures of Dispersion

Transport and Driverless Cars:

Finance and Stock Market:

Healthcare and Medical Research:

Retail and Recommendation Systems:

Social Media and Sentiment Analysis:

Marketing and Customer Segmentation:

Energy and Smart Grids:

Environmental Monitoring and Climate Change:

Manufacturing and Quality Control:

3.Discuss various techniques used for Data Visualization

5.Discuss in detail the concept of Factor Analysis

Market research, advertising, psychology, finance, and operation research frequently

6.Differentiate between Principal Component Analysis and and Linear Discriminant

o PCA: Focuses on variance retention and dimensionality reduction without

You might also like