0% found this document useful (0 votes)
57 views10 pages

Ia - Eda

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views10 pages

Ia - Eda

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

NAME PRIYANGA M

ROLL NUMBER 2214509443

PROGRAM MASTER OF BUSINESS ADMINISTRATION (MBA)

SEMESTER III

COURSE CODE & NAME DADS302 EXPLORATORY DATA ANALYSIS


INTERNAL ASSIGNMENT SET - 1
1.Explain various measures of dispersion in detail using specific examples.
A measure of dispersion is the scattering of data and non-negative real numbers.
Measures of dispersion help to represent the variability in data. The most significant use of
measures of dispersion is that they help to understand the distribution of data. The measure of
dispersion value is zero if the data points in a data set are equal.

Absolute Measures of Dispersion


1. Range:
The range is the simplest measure of dispersion. It represents the difference between
the largest and smallest values in a dataset.
Formula: Range = Largest Value - Smallest Value
Example: Consider a dataset of exam scores: {75, 80, 85, 90, 95}. The range is 95 - 75 = 20.
2. Mean Deviation:
The mean deviation calculates the average of the absolute differences between each data point
and the mean.
Formula: Mean Deviation = (Σ|Xi - Mean|) / n
Example: For the dataset {10, 12, 15, 18, 20}, the mean deviation is (|10-15| + |12-15| + |15-
15| + |18-15| + |20-15|) / 5 = 4.
3. Standard Deviation:
The standard deviation measures the spread of data points around the mean.
Formula: Standard Deviation (σ) = √[Σ(Xi - Mean)² / n]
Example: Given heights (in cm) of five students: {160, 165, 170, 175, 180}, calculate the
standard deviation.
4. Variance:
The variance is the average of the squared deviations from the mean.
Formula: Variance (σ²) = Σ(Xi - Mean)² / n
Example: For the dataset {8, 10, 12, 14, 16}, the variance is (8-12)² + (10-12)² + (12-12)² +
(14-12)² + (16-12)² / 5 = 8
5. Quartile Deviation:
The quartile deviation is half of the difference between the third quartile (Q3) and the first
quartile (Q1).
Formula: Quartile Deviation = (Q3 - Q1) / 2
Example: Given the dataset {20, 25, 30, 35, 40}, find the quartile deviation.
Relative Measure of Dispersion
Coefficient of Range
It is the ratio of the difference between the maximum and minimum values in a data set to the
sum of the maximum and minimum values.
C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin)
Coefficient of Variation
Coefficient of Variation is the ratio of the standard deviation to the mean of the given data set.
It is indicated in the form of a percentage.
C.D.= (S.D./Mean)*100
Coefficient of Mean Deviation
It is the ratio of the mean deviation to the value of the central point. C.D. = Mean
deviation/Average
Coefficient of Quartile Deviation
It is defined as the ratio of the difference between the third quartile and the first quartile to the
sum of the third and first quartiles.
C.D. = (Q3 – Q1) ⁄ (Q3 + Q1)
Coefficient of Standard Deviation
The coefficient of standard deviation is defined as the ratio of standard deviation to the mean.
C.D. = S.D.⁄Mean.

2. What is Data Science? Discuss the role of Data Science in various Domains.
Data science plays a major role in society due to the explosion of information present, creating
the opportunity for industries to grow in their own way. Fields like healthcare, finance, media,
and many others are discovering the insights of big data through data science for decision-
making and other activities.
Search Engines:
Data science powers search engines like Google, Yahoo, and Bing.
Algorithms analyze user behavior, search queries, and content relevance to provide faster and
more accurate search results.

Transport and Driverless Cars:


Data science plays a crucial role in driverless cars.
Analyzing real-time data helps optimize driving decisions, handle different road situations, and
reduce accidents.

Finance and Stock Market:


Data science aids financial industries in risk analysis, fraud detection, and stock market
predictions.Historical data is analyzed to predict future stock prices and guide investment
decisions.

Healthcare and Medical Research:


Data science improves patient outcomes by analyzing electronic health records, medical
images, and clinical data.
It helps identify disease patterns, predict epidemics, and personalize treatment plans.

Retail and Recommendation Systems:


E-commerce platforms use data science to recommend products based on user preferences and
past behavior.
Personalized recommendations enhance user experience and drive sales.

Social Media and Sentiment Analysis:


Data science techniques analyze social media data to understand public sentiment, trends, and
user engagement.
Sentiment analysis helps businesses adapt their strategies.

Marketing and Customer Segmentation:


Data science segments customers based on demographics, behavior, and preferences.
Targeted marketing campaigns yield better results.

Energy and Smart Grids:


Data science optimizes energy consumption, predicts demand, and enhances grid stability.
Smart meters and sensors generate vast data for analysis.

Environmental Monitoring and Climate Change:


Data science models analyze climate data, predict extreme events, and inform policy decisions.

Manufacturing and Quality Control:


Data science ensures product quality, detects defects, and optimizes production processes.

3.Discuss various techniques used for Data Visualization

Data visualization is a powerful way to represent information visually, making it easier for
viewers to understand patterns, trends, and insights. Let’s explore several essential data
visualization techniques along with their applications:
1. Pie Chart:
Ideal for illustrating proportions or part-to-whole comparisons.
Simple and easy to read.
Best suited for audiences unfamiliar with the data.
Example: Visualizing the distribution of different product categories in sales data.
2. Bar Chart:
Compares categories based on a measured value.
Effective for showing differences between groups.
Commonly used for sales, survey results, and market share analysis.
3. Histogram:
Displays the frequency distribution of continuous data.
Useful for understanding data distribution and identifying outliers.
Example: Analyzing the distribution of exam scores in a class.
4. Gantt Chart:
Depicts project timelines, tasks, and dependencies.
Essential for project management and scheduling.
Shows start and end dates for each task.
5. Heat Map:
Color-coded representation of data values on a grid.
Useful for visualizing correlations, patterns, and trends.
Often used in finance, biology, and geospatial analysis.
6. Box and Whisker Plot (Box Plot):
Displays the spread and central tendency of data.
Shows quartiles, outliers, and variability.
Useful for comparing distributions across different groups.
7. Waterfall Chart:
Illustrates cumulative effects of positive and negative values.
Commonly used for financial analysis and budgeting.
8. Area Chart:
Shows quantity over time.
Useful for visualizing trends and cumulative data.
Example: Tracking website traffic over months.
9. Scatter Plot:
Represents relationships between two continuous variables.
Helps identify correlations or clusters.
Used in scientific research, finance, and social sciences.
10. Pictogram Chart:
Uses icons or symbols to represent quantities.
Engaging and intuitive for conveying information.
Example: Showing population sizes of different countries using flags.
11. Timeline:
Displays chronological events or milestones.
Useful for historical data, project timelines, and personal achievements.
12. Highlight Table:
Emphasizes specific data points within a table.
Useful for highlighting key metrics or outliers.
13. Bullet Graph:
Combines bar chart and reference lines.
Efficiently communicates performance against targets.
Commonly used in dashboards and KPI tracking.
14. Choroleth Map:
Represents data by shading regions on a map.
Useful for visualizing geographic patterns (e.g., population density, election results).
15. Word Cloud:
Displays word frequency using font size.
Great for summarizing text data (e.g., customer reviews, social media posts).
16. Network Diagram:
Shows relationships between nodes (entities).
Used in social network analysis, organizational structures, and flowcharts.
17. Correlation Matrix:
Displays pairwise correlations between variables.
Helps identify strong or weak relationships.
Commonly used in finance, biology, and machine learning.
INTERNAL ASSIGNMENT SET - 2

4.What is feature selection? Discuss any two feature selection techniques used to get
optimal feature combinations

In machine learning, before training a model, it is necessary to obtain some useful features.
some features may be useful and some may not. It does not mean that increasing the number
of features may increase the model’s performance. Sometimes, it may increase the model
complexity, and performance may be reduced.

a. Filter Method
b. Wrapper Method
c. Embedded Method
d. Hybrid Method
1. Filter Methods:
o Filter methods evaluate features based on their intrinsic properties using univariate
statistics. These methods are computationally efficient and faster than other
techniques.
o They do not rely on cross-validation performance but instead focus on individual
feature characteristics.
o Common filter methods include:
▪ Information Gain: This technique calculates the reduction in entropy when
transforming a dataset. It assesses the information gain of each variable
concerning the target variable.
▪ Chi-square Test: Used for categorical features, the chi-square test evaluates
the independence between features and the target variable.
2. Wrapper Methods:
o Wrapper methods are more computationally intensive because they explore various
feature combinations. These methods use model performance (e.g., accuracy, F1-
score) as the evaluation criterion.
o They are considered “greedy” algorithms because they search exhaustively for the
optimal feature subset.
o Some common wrapper methods include:
▪ Forward Feature Selection: An iterative approach where we start with an
empty set of features and gradually add the best-performing features against
the target variable.
▪ Backward Feature Elimination: The opposite of forward selection, this
method begins with all features and iteratively removes the least relevant
ones.
3. C. Embedded Method
This approach encompasses the benefit of the wrapper and filter method and also
maintains reasonable computation cost. It takes care of every iteration of model training
and extracts the useful features that contributed more toward training.
Some of the embedded techniques are • Lasso L1 Regularization:
It adds some penalty to machine learning model parameters to reduce overfitting. It has
the property that can shrink a few coefficients to zero. Hence, that features could be
removed.
• Random Forest
It is a bagging algorithm that aggregates the number of decision trees. It ranks that how
well the purity of the node other hand decrease in the impurity Gini impurity. The nodes
with high impurity occur on the top of the tree, and less impurity occurs at the end of
the tree. Hence, pruning a tree can create a subset of the important features

D. Hybrid Method

In this, you can combine any of two approaches and make it a better approach for
feature selection

5.Discuss in detail the concept of Factor Analysis


An exploratory data analysis technique called factor analysis (FA) is used to look for significant
underlying factors or latent variables from a set of observed variables. Lowering the number
of variables aids in the interpretation of data. It takes the most common variance possible from
each variable and converts it into a single score.

Market research, advertising, psychology, finance, and operation research frequently


use FA.

1. Key Concepts:
o Observed Variables (Manifest Variables): These are the measured variables
in our dataset. For example, in a psychological study, observed variables could
be test scores, personality traits, or survey responses.
o Factors: Unobserved variables that explain the correlations among observed
variables. Factors represent underlying constructs or dimensions.
o Loading: The relationship between an observed variable and a factor. Loadings
indicate how much a variable contributes to a specific factor.
o Eigenvalues: Eigenvalues represent the variance explained by each factor.
Larger eigenvalues indicate more significant factors.
o Rotation Methods: Factor rotation improves interpretability. Techniques like
varimax, quartimax, and oblique rotation adjust factor loadings.
o Common Variance (Commonality): The proportion of variance in an
observed variable explained by all the factors.
o Specific Variance (Uniqueness): The unique variance in an observed variable
not explained by the factors.
2. Types of Factor Analysis:
o Exploratory Factor Analysis (EFA):
▪ Used when we don’t have specific hypotheses about the underlying
factors.
▪ EFA identifies factors without preconceived notions.
▪ Researchers explore the data to find the best-fitting model.
o Confirmatory Factor Analysis (CFA):
▪ Used when we have specific hypotheses about the underlying factors.
▪ CFA tests whether the observed variables align with the proposed factor
structure.
▪ Researchers confirm or reject a predefined model.
3. Steps in Factor Analysis:
o Data Preparation: Clean and preprocess the data (e.g., handle missing values,
normalize variables).
o Factor Extraction: Use methods like principal component analysis (PCA) or
maximum likelihood to extract factors.
o Factor Rotation: Improve interpretability by rotating the factors (e.g., varimax,
oblique rotation).
o Interpretation: Examine factor loadings, eigenvalues, and commonalities to
understand the factors’ meaning.
o Model Fit Assessment: In CFA, assess how well the proposed model fits the
data using goodness-of-fit indices.
4. Applications:
o Psychology: Identifying latent personality traits, intelligence factors, or
emotional dimensions.
o Marketing: Understanding consumer preferences and brand loyalty.
o Finance: Analyzing risk factors in stock returns.
o Healthcare: Identifying underlying health conditions from symptoms.
o Education: Evaluating the effectiveness of educational programs

6.Differentiate between Principal Component Analysis and and Linear Discriminant


Analysis

o Unsupervised Technique:
▪ PCA is an unsupervised method that focuses solely on the data’s
structure without considering class labels.
o Objective:
▪ The primary goal of PCA is to minimize dimensionality while
preserving as much variance as possible.
o Variance Retention:
▪ PCA constructs new variables (principal components) that are linear
combinations of the original features.
▪ These components capture the maximum variance present in the dataset.
▪ The first principal component explains the most variance, followed by
the second, and so on.
o Use Cases:
▪ Dimensionality Reduction: PCA reduces the number of features while
retaining critical information.
▪ Visualization: It simplifies high-dimensional data for visualization.
o Classification:
▪ PCA does not focus on class separability; it aims purely at variance
retention.
▪ It does not consider class labels during transformation.
o Supervised Technique:
▪ LDA is a supervised method that considers class labels during
dimensionality reduction.
o Objective:
▪ LDA aims to optimize the separability between classes while reducing
dimensionality.
o Separation of Classes:
▪ LDA constructs a new linear axis (discriminant) that maximizes the
distance between class means while minimizing the within-class
variance.
▪ It seeks to find the best subspace for distinguishing classes.
o Use Cases:
▪ Classification: LDA is primarily used for classification tasks.
▪ Feature Extraction: It combines dimensionality reduction with class
separability.
o Difference from PCA:
▪ LDA considers class information, whereas PCA does not.
▪ LDA aims at both dimensionality reduction and class discrimination.
▪ LDA identifies the most discriminative features for classification.

Summary:

o PCA: Focuses on variance retention and dimensionality reduction without


considering class labels.
o LDA: Optimizes class separability while reducing dimensions, making it
suitable for classification tasks

You might also like