Ia - Eda
Ia - Eda
SEMESTER III
2. What is Data Science? Discuss the role of Data Science in various Domains.
Data science plays a major role in society due to the explosion of information present, creating
the opportunity for industries to grow in their own way. Fields like healthcare, finance, media,
and many others are discovering the insights of big data through data science for decision-
making and other activities.
Search Engines:
Data science powers search engines like Google, Yahoo, and Bing.
Algorithms analyze user behavior, search queries, and content relevance to provide faster and
more accurate search results.
Data visualization is a powerful way to represent information visually, making it easier for
viewers to understand patterns, trends, and insights. Let’s explore several essential data
visualization techniques along with their applications:
1. Pie Chart:
Ideal for illustrating proportions or part-to-whole comparisons.
Simple and easy to read.
Best suited for audiences unfamiliar with the data.
Example: Visualizing the distribution of different product categories in sales data.
2. Bar Chart:
Compares categories based on a measured value.
Effective for showing differences between groups.
Commonly used for sales, survey results, and market share analysis.
3. Histogram:
Displays the frequency distribution of continuous data.
Useful for understanding data distribution and identifying outliers.
Example: Analyzing the distribution of exam scores in a class.
4. Gantt Chart:
Depicts project timelines, tasks, and dependencies.
Essential for project management and scheduling.
Shows start and end dates for each task.
5. Heat Map:
Color-coded representation of data values on a grid.
Useful for visualizing correlations, patterns, and trends.
Often used in finance, biology, and geospatial analysis.
6. Box and Whisker Plot (Box Plot):
Displays the spread and central tendency of data.
Shows quartiles, outliers, and variability.
Useful for comparing distributions across different groups.
7. Waterfall Chart:
Illustrates cumulative effects of positive and negative values.
Commonly used for financial analysis and budgeting.
8. Area Chart:
Shows quantity over time.
Useful for visualizing trends and cumulative data.
Example: Tracking website traffic over months.
9. Scatter Plot:
Represents relationships between two continuous variables.
Helps identify correlations or clusters.
Used in scientific research, finance, and social sciences.
10. Pictogram Chart:
Uses icons or symbols to represent quantities.
Engaging and intuitive for conveying information.
Example: Showing population sizes of different countries using flags.
11. Timeline:
Displays chronological events or milestones.
Useful for historical data, project timelines, and personal achievements.
12. Highlight Table:
Emphasizes specific data points within a table.
Useful for highlighting key metrics or outliers.
13. Bullet Graph:
Combines bar chart and reference lines.
Efficiently communicates performance against targets.
Commonly used in dashboards and KPI tracking.
14. Choroleth Map:
Represents data by shading regions on a map.
Useful for visualizing geographic patterns (e.g., population density, election results).
15. Word Cloud:
Displays word frequency using font size.
Great for summarizing text data (e.g., customer reviews, social media posts).
16. Network Diagram:
Shows relationships between nodes (entities).
Used in social network analysis, organizational structures, and flowcharts.
17. Correlation Matrix:
Displays pairwise correlations between variables.
Helps identify strong or weak relationships.
Commonly used in finance, biology, and machine learning.
INTERNAL ASSIGNMENT SET - 2
4.What is feature selection? Discuss any two feature selection techniques used to get
optimal feature combinations
In machine learning, before training a model, it is necessary to obtain some useful features.
some features may be useful and some may not. It does not mean that increasing the number
of features may increase the model’s performance. Sometimes, it may increase the model
complexity, and performance may be reduced.
a. Filter Method
b. Wrapper Method
c. Embedded Method
d. Hybrid Method
1. Filter Methods:
o Filter methods evaluate features based on their intrinsic properties using univariate
statistics. These methods are computationally efficient and faster than other
techniques.
o They do not rely on cross-validation performance but instead focus on individual
feature characteristics.
o Common filter methods include:
▪ Information Gain: This technique calculates the reduction in entropy when
transforming a dataset. It assesses the information gain of each variable
concerning the target variable.
▪ Chi-square Test: Used for categorical features, the chi-square test evaluates
the independence between features and the target variable.
2. Wrapper Methods:
o Wrapper methods are more computationally intensive because they explore various
feature combinations. These methods use model performance (e.g., accuracy, F1-
score) as the evaluation criterion.
o They are considered “greedy” algorithms because they search exhaustively for the
optimal feature subset.
o Some common wrapper methods include:
▪ Forward Feature Selection: An iterative approach where we start with an
empty set of features and gradually add the best-performing features against
the target variable.
▪ Backward Feature Elimination: The opposite of forward selection, this
method begins with all features and iteratively removes the least relevant
ones.
3. C. Embedded Method
This approach encompasses the benefit of the wrapper and filter method and also
maintains reasonable computation cost. It takes care of every iteration of model training
and extracts the useful features that contributed more toward training.
Some of the embedded techniques are • Lasso L1 Regularization:
It adds some penalty to machine learning model parameters to reduce overfitting. It has
the property that can shrink a few coefficients to zero. Hence, that features could be
removed.
• Random Forest
It is a bagging algorithm that aggregates the number of decision trees. It ranks that how
well the purity of the node other hand decrease in the impurity Gini impurity. The nodes
with high impurity occur on the top of the tree, and less impurity occurs at the end of
the tree. Hence, pruning a tree can create a subset of the important features
D. Hybrid Method
In this, you can combine any of two approaches and make it a better approach for
feature selection
1. Key Concepts:
o Observed Variables (Manifest Variables): These are the measured variables
in our dataset. For example, in a psychological study, observed variables could
be test scores, personality traits, or survey responses.
o Factors: Unobserved variables that explain the correlations among observed
variables. Factors represent underlying constructs or dimensions.
o Loading: The relationship between an observed variable and a factor. Loadings
indicate how much a variable contributes to a specific factor.
o Eigenvalues: Eigenvalues represent the variance explained by each factor.
Larger eigenvalues indicate more significant factors.
o Rotation Methods: Factor rotation improves interpretability. Techniques like
varimax, quartimax, and oblique rotation adjust factor loadings.
o Common Variance (Commonality): The proportion of variance in an
observed variable explained by all the factors.
o Specific Variance (Uniqueness): The unique variance in an observed variable
not explained by the factors.
2. Types of Factor Analysis:
o Exploratory Factor Analysis (EFA):
▪ Used when we don’t have specific hypotheses about the underlying
factors.
▪ EFA identifies factors without preconceived notions.
▪ Researchers explore the data to find the best-fitting model.
o Confirmatory Factor Analysis (CFA):
▪ Used when we have specific hypotheses about the underlying factors.
▪ CFA tests whether the observed variables align with the proposed factor
structure.
▪ Researchers confirm or reject a predefined model.
3. Steps in Factor Analysis:
o Data Preparation: Clean and preprocess the data (e.g., handle missing values,
normalize variables).
o Factor Extraction: Use methods like principal component analysis (PCA) or
maximum likelihood to extract factors.
o Factor Rotation: Improve interpretability by rotating the factors (e.g., varimax,
oblique rotation).
o Interpretation: Examine factor loadings, eigenvalues, and commonalities to
understand the factors’ meaning.
o Model Fit Assessment: In CFA, assess how well the proposed model fits the
data using goodness-of-fit indices.
4. Applications:
o Psychology: Identifying latent personality traits, intelligence factors, or
emotional dimensions.
o Marketing: Understanding consumer preferences and brand loyalty.
o Finance: Analyzing risk factors in stock returns.
o Healthcare: Identifying underlying health conditions from symptoms.
o Education: Evaluating the effectiveness of educational programs
o Unsupervised Technique:
▪ PCA is an unsupervised method that focuses solely on the data’s
structure without considering class labels.
o Objective:
▪ The primary goal of PCA is to minimize dimensionality while
preserving as much variance as possible.
o Variance Retention:
▪ PCA constructs new variables (principal components) that are linear
combinations of the original features.
▪ These components capture the maximum variance present in the dataset.
▪ The first principal component explains the most variance, followed by
the second, and so on.
o Use Cases:
▪ Dimensionality Reduction: PCA reduces the number of features while
retaining critical information.
▪ Visualization: It simplifies high-dimensional data for visualization.
o Classification:
▪ PCA does not focus on class separability; it aims purely at variance
retention.
▪ It does not consider class labels during transformation.
o Supervised Technique:
▪ LDA is a supervised method that considers class labels during
dimensionality reduction.
o Objective:
▪ LDA aims to optimize the separability between classes while reducing
dimensionality.
o Separation of Classes:
▪ LDA constructs a new linear axis (discriminant) that maximizes the
distance between class means while minimizing the within-class
variance.
▪ It seeks to find the best subspace for distinguishing classes.
o Use Cases:
▪ Classification: LDA is primarily used for classification tasks.
▪ Feature Extraction: It combines dimensionality reduction with class
separability.
o Difference from PCA:
▪ LDA considers class information, whereas PCA does not.
▪ LDA aims at both dimensionality reduction and class discrimination.
▪ LDA identifies the most discriminative features for classification.
Summary: