0% found this document useful (0 votes)
6 views

Lab Cs

Uploaded by

gbalu0061
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lab Cs

Uploaded by

gbalu0061
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Ex.

NO : 1 Find the skewness and kurtosis for an attribute in the given


Dataset.
1. Find the skewness and kurtosis for an attribute in the given dataset.
AIM:
To print and plot the first five rows, the histogram & KDE of the selected attribute,
probability of the selected attributes, skewness and kurtosis for the selected attribute
from Iris dataset using Python script
ALGORITHM:
i. Load the data
ii. Select an attribute from the data set and print the First five records
iii. Plot the Histogram and Kernal Density Estimation of the selected
attribute.
iv. Print the probability plot for the selected attribute
v. Find the skewness for the selected attribute
vi. Find the kurtosis for the selected attribute
PROGRAM:
SAMPLE INPUT:
Import the Iris data from ‘plotly.express’ library and name it as ‘px’
SAMPLE OUTPUT:
1. Printing the First Five Records:

2. Print the Histogram and Kernal Density Estimation for the attribute “Sepal
Length”

3. Print the probability plot of “Sepal Length”

4. Print the skewness of the attribute “Sepal Length”


Skewness for sepal length = 0.3149109566369728

5. Print the Kurtosis of the attribute “Sepal Length”


Kurtosis for sepal length = -0.5520640413156395
RESULT:
The first five rows, the histogram & KDE of the selected attribute, probability of the selected
attributes, skewness and kurtosis for the selected attribute are evaluated and plotted
successfully using Python script
Ex. NO : 2 Perform the statistical analysis on the Iris data set.

1. Find the skewness and kurtosis for an attribute in the given dataset.

Load Iris data set & print the following


 first 10 records,
 Total number of rows & columns in the data set
 Column names or data list.
 Find the mean of all the attributes

AIM:
To perform and print the basic statistical analysis such as the first 10 records, the
number of rows and columns, the column names and mean of all the attributes on
Iris data set using Python.
ALGORITHM:
i. Load the desired data set
ii. Print the first 10 records
iii. Find the length of the dataset and print the number of rows and
columns
iv. Extract the column names from the dataset and print them separately
v. Find the mean of all the numerical attributes
SAMPLE INPUT:
Import the Iris data from ‘plotly.express’ library and name it as ‘px’
SAMPLE OUTPUT:
1. Print the first 10 instances.

2. Print the Number of Rows and Columns of the dataset.


(150, 6)

3. Print the Column names [Attribute Names] of the dataset.


['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species','species_id']

4. Mean of all the Numerical Attributes:


Sepal length - Mean = 5.843333333333334
Sepal width - Mean = 3.0540000000000003
Petal length - Mean = 3.758666666666666
Petal width - Mean = 1.1986666666666668
RESULT:
The basic statistical analysis such as the first 10 records, the number of rows and
columns, the column names or Attribute names of the data set and mean of all the
attributes on Iris data set are evaluated and printed successfully using Python script.
Ex. NO : 3 Perform the statistical analysis on a data set.

Perform the statistical analysis on a data set.


Under each attribute
 Count the total number of records based on the values.
 Plot the Normal Distribution of each attribute and print the SD and Mean
of the same set of attributes.
 Plot the distribution of each attribute using Histogram.
AIM:
To Perform and plot the basic statistical outcomes such as total number of records,
Normal Distribution of each attribute, Standard Division and Mean of the attributes
and plotting the distribution using Histogram using Python script.
ALGORITHM:
i. Load the desired dataset
ii. Count the total number of instances based on the values of the selected
attribute.
iii. Repeat step ii for all the attributes
iv. Find the normal distribution of a selected attribute.
v. Print the SD and Mean of the same attribute
vi. Repeat step iv and v for all the remaining attributes.
vii. Using for loop, print the distribution of the attributes using Histogram.
PROGRAM:
SAMPLE INPUT:
Import the Iris data from ‘plotly.express’ library and name it as ‘px’
SAMPLE OUTPUT:
1. Total Number of Records.
(150, 6)
2. Print the Normal distribution of the Numerical Attributes

3. Print the Mean and Standard deviation of the Numerical Attributes

Attribute Name Mean Standard


Deviation
Sepal_length 5.843333333333334 0.8253
Sepal_width 3.0540000000000003 0.43215
Petal_Length 3.758666666666666 1.75853
Petal_width 1.1986666666666668 0.76061

RESULT:
The basic statistical outcomes such as total number of records, Normal Distribution of
each attribute, Standard Division and Mean of the attributes and plotting the
distribution using Histogram are evaluated and plotted successfully using Python script.
Ex. NO : 4 Perform the following statistical analysis on a data set.
Generate
The statistical description of Iris data set

The Box plot for any one attribute and compare it with the relevant statistical

data
The dependency Curve of the attribute considered for constructing the box plot.

Perform the following statistical analysis on a data set.


Generate
 The statistical description of Iris data set
 The Box plot for any one attribute and compare it with the relevant
statistical data
 The dependency Curve of the attribute considered for constructing the
box plot.
AIM:
To perform and plot the statistical outcomes such as statistical description, Box plot of
an attribute and dependency curve of the attribute using Python.
ALGORITHM:
i. Load the desired data set
ii. Find the statistical description of the data set
iii. Select any one attribute and Construct the Box plot for it.
iv. Compare the statistical description of the selected attribute with the
Box plot.
v. Draw the dependency curve of the selected attribute

PROGRAM:
SAMPLE INPUT:
Import the Iris data from ‘plotly.express’ library and name it as ‘px’
SAMPLE OUTPUT:
1. Print the Statistical Description for the attribute “Petal Width”
petal_width
count 150.000000
mean 1.198667
std 0.763161
min 0.100000
25% 0.300000
50% 1.300000
75% 1.800000
max 2.500000
Name: petal_width, dtype: float64
2. Print the Box Plot of the attribute “Petal Width”

3. Print the dependency curve of the attribute

RESULT:
The statistical outcomes such as statistical description, Box plot of an attribute and
dependency curve of the attribute are evaluated and plotted successfully using Python
script.
Ex. NO: 5 Parameter Estimation Process [Maximum Likelihood Estimation Process]

AIM:
To Perform the parameter estimation process (i.e) Maximum Likelihood Estimation
process using Python.
ALGORITHM:
i. Generate 100 random values between 10 to 30 and refer them as X.
ii. Find the Y using the function y=10+4x+e
iii. Put the X and Y values in a dataframe.
iv. Plot the generated values using regplot() function.
v. Find the OLS regression results using OLS model
vi. Calculate the SD for the residuals
vii. Construct the MLE model using L-BFGS-B Memory optimization
Algorithm.
viii. Compare MLE parameters with OLS Parameters
PROGRAM:
SAMPLE INPUT:
1. Synthesis the input.
Generate 100 values for the independent variable ‘x’ in the range -10 to 30,
using x = np.linspace(-10, 30, 100)
Generate 100 dependent values for the variable ‘y’ using the following formula,
y = 10 + 4*x + e
2. Sample Input:

SAMPLE OUTPUT:
message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
success: True
status: 0
fun: 303.3932659422116
x: [ 2.021e+01 3.998e+00 5.028e+00]
nit: 28
jac: [ 1.705e-05 5.684e-06 5.684e-06]
nfev: 140
njev: 35
hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>

RESULT:
The parameter estimation process (i.e) Maximum Likelihood Estimation process is
done successfully using Python script.
Ex. NO: 6 Data Aggregation Process
a. Perform the aggregation process for a single dimensional data
b. Perform the aggregation process for a 2D data
c. Perform the aggregation process for a n-D data

AIM:
To perform the data Aggregation for 1D, 2D and n-D data using Python.

ALGORITHM:
Consider the below mentioned Aggregation process:
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items
SAMPLE INPUT:
1. 1D Data
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64

2. 2D Data
A B

0 0.183405 0.611853

1 0.304242 0.139494

2 0.524756 0.292145

3 0.431945 0.366362

4 0.291229 0.456070

3. n-D Data
Load the dataset ‘planet’ with multiple instances.
Sample Dataset:

SAMPLE OUTPUT:
1. Aggregation of 1D
Sum = 2.811925491708157
Mean = 0.5623850983416314
Count = 5
Minimum = 0.156018
Maximum = 0.950714
2. Aggregation of 2D
Sum (Columnwise)
A 1.735577
B 1.865923
Sum (Rowwise)

0 0.397629
1 0.221868
2 0.408451
3 0.399153
4 0.373650
Mean
A 0.477888
B 0.443420
Count
(5, 2)
Minimum
A 0.183405
B 0.139494
Maximum
A 0.524756
B 0.611853
3. Aggregation of n-D data.
Group by ‘Method’ and ‘orbital Period’

Shape or Size

Statistical values based on ‘Method’ and ‘year’

Standard Deviation based on ‘Method’


RESULT:
The data Aggregation for 1D, 2D and n-D are completed successfully using Python
script.
Ex. NO: 7 LDA by calculating the basic requirements such eigenvalues and eigenvectors

AIM:
To perform LDA by calculating the basic requirements such eigenvalues and
eigenvectors
ALGORITHM:
i. Calculate the between-class variance.
ii. Calculate the within-class variance.
iii.Compute the eigenvectors and the corresponding eigenvalues.
iv. Put the eigenvalues in decreasing order and select k eigenvectors with the largest
eigenvalues.
v. Create a k dimensional matrix containing the eigenvectors.

PROGRAM:
SAMPLE INPUT:
Load the Wine dataset from sklearn
Sample Data:

SAMPLE OUTPUT:
Number of classes and Features:
Number of classes: 3
Number of features: 13
Variance Ratio:
[0.72817751 0.27182251]
LDA Scatter Plot:

RESULT:
LDA evaluation and scatter plot by eigenvalues and eigenvectors are done
successfully using Python Script.
Ex. NO: 8 Principal Component Analysis (PCA) Implementation for a given data set

PROBLEM:
Principal Component Analysis (PCA) Implementation for a given data set
AIM:
To implement the PCA for the given dataset using Python Script
ALGORITHM:
5. Load the data
6. Standardize the features
7. Make the PCA with n=2, where n is the number of components
8. Plot the data with the new principle components
9. Display the variance among the 2 components

PROGRAM:
SAMPLE INPUT:
Take the Iris dataset
Sample Input Data

SAMPLE OUTPUT:
1. Standardize the data
[-9.00681170e-01 1.03205722e+00 -1.34127240e+00 -1.31297673e+00]
[-1.14301691e+00 -1.24957601e-01 -1.34127240e+00 -1.31297673e+00]
[-1.38535265e+00 3.37848329e-01 -1.39813811e+00 -1.31297673e+00]
[-1.50652052e+00 1.06445364e-01 -1.28440670e+00 -1.31297673e+00]
[-1.02184904e+00 1.26346019e+00 -1.34127240e+00 -1.31297673e+00]
[-5.37177559e-01 1.95766909e+00 -1.17067529e+00 -1.05003079e+00]
[-1.50652052e+00 8.00654259e-01 -1.34127240e+00 -1.18150376e+00]

2. PCA for the given dataset

3. 2 Component Plot
4. Variance Ratio:
[0.72770452 0.23030523]

RESULT:
The PCA for the given dataset is found successfully using Python Script.
Ex. NO: 9 H-Plot construction for the given data set

PROBLEM:
H-Plot construction for the given data set

AIM:
To construct the H-Plot for the given data set.
ALGORITHM:
1. Prepare/download the data
2. Select the attributes for constructing the Horizontal Bar chart.
3. Plot the Bar and H-bar chart for the points/data among the selected attributes

PROGRAM:
SAMPLE INPUT:
Take the following data as Input.
product = ['computer', 'monitor', 'laptop', 'printer', 'tablet']
quantity = [320, 450, 300, 120, 280]
SAMPLE OUTPUT:

Bar Plot H Plot

RESULT:
The H-Plot for the given data set is constructed successfully using Python script.
Ex. NO: 10 Clustering the data using any one clustering Algorithm

PROBLEM:
Clustering the data using any one clustering Algorithm
AIM:
To cluster the given data by applying the K-Means algorithm using Python Script.
ALGORITHM:
1. Create/Load the data
2. Standardize the features
3. Plot the standardize features
4. Apply K-Means clustering with n=3
5. Plot the clusters with their centre points
6. Find the value of K for the selected data set using Elbow method
7. Apply K-Means clustering with the recommended Elbow value as k.
8. Plot the clusters with their centre points
SAMPLE INPUT:
Generate synthesized data of size 200
SAMPLE OUTPUT:
1. Original Data Plot:

2. Standardized Data plot:

3. Elbow Graph for identifying the number of clusters:

4. Scatter plot to show the Clustering:


RESULT:
The given data is clustered successfully using K-Means algorithm by Python Script.
MINI PROJECT
Instruction for Mini Project:
1. It should be considered as the last experiment of the Lab Experiment List
2. Implementation duration is 2 weeks.
3. The content should be prepared as below.
- Aim
- Algorithm
- Architecture Diagram
- Program
- Input and Output
- Conclusion
Tentative Titles [Not limited to] For doing Statistical Analysis:
1. Price Recommendation using Machine Learning
2. Sales Forecasting
3. Building a Recommender System
4. Employee Access-Challenge as a Classification Problem
5. Survival Prediction using Machine Learning
6. Personalized Medicine Recommending System
7. Loan Default Prediction
8. Fraud Detection as a Classification Problem
9. Credit Analysis
10. Model Insurance Claim Severity
11. House Price Prediction using Machine Learning
12. Recommendation System for Retail Stores
13. Fake News Detection
14. Human Activity Recognition
15. Stock Market Prediction
SAMPLE VIVA QUESTIONS
1. What is the difference between an error of type I and an error of type II?

 When the null hypothesis is rejected even though it is correct, a type 1 error
occurs. False positives are also known as type 1 errors.
 When the null hypothesis is not rejected despite being incorrect, a type 2 error
occurs. This is also known as a false negative.

2. How does one define statistical interaction?

 When an input variable influences an output variable, a statistical interaction


occurs.
 In real life, for example, the interaction of adding sugar to the stirring of tea is
an example of statistical interaction. Neither of the variables has an impact on
sweetness, but the two variables combine to produce sweetness.

3. What are some examples of data sets with non-Gaussian distributions?

When data follows a non-normal distribution, it is frequently non-Gaussian. A


non-Gaussian distribution is often seen in many statistics processes. This occurs
when data is naturally clustered on one side or the other on a graph. For
instance, bacterial growth follows an exponential or non-Gaussian distribution,
which is non-normal.

4. How does linear regression work?

When utilised in statistics, linear regression is a technique that models the


relationship between one or more predictor variables and one outcome variable.
For example, linear regression may be used to study the connection between
various predictors, such as age, gender, heredity, diet, and height.

5. What are the necessary conditions for a Binomial Distribution?

The three most important characteristics of a Binomial Distribution are listed


below.

1. The number of observations must be prearranged. In other words, one can


only determine the probability of an event happening a specific number of
times if a fixed number of trials are performed.
2. It is important that each trial is independent of the others. This means that
the probability of each subsequent trial should not be affected by previous
trials.
3. The chance of getting the job remains the same no matter how many times
you try.

6. What is the difference between a sample and a population?


The subset of the population from which numbers are obtained is known as the
sample. The numbers obtained from the population are known as parameters,
while the numbers obtained from the sample are known as statistics. It is
through sample data that conclusions may be made about the population.

Population Sample

A parameter is an observable quality Statistics is an observable quality that can


that can be measured. be measured.

Every element of the population is a A subset of the population is used to


unique individual. explore some aspects of the population.

An opinion report is a true The reported values have a confidence level


representation of what happened. and an error margin.

All members of a group are included A particular portion of the population is


in the list. represented by that subset.

7. What are the different kinds of variables or levels of measurement?

A variable can be categorized as one of four types: Ordinal, Interval, Ratio, or


Nominal. Scale and Continuous are sometimes used to describe Interval and
Ratio levels of measurement, respectively.

8. What is the difference between Descriptive and Inferential Statistics?


Descriptive Inferential

Describe the data in terms of its key To conclude the population, it is


characteristics. used.

Data can be organised, analysed, and The purpose of data analysis is to


presented in a meaningful way thanks to compare data and make predictions
charts. through hypotheses.

Using charts, tables, and graphs to present Probability was responsible for
information. achieving this goal.

9. What does standard deviation mean?

When a set of data points is near the mean, a low value of standard deviation
indicates that the points are close to the mean, and a high value indicates that
the points are far away from the mean. On the other hand, when the data points
are far apart from each other, a high standard deviation indicates that the points
are far away from the mean, and a low standard deviation indicates that the
points are close to the mean.

10. What are the characteristics of a bell-curve distribution?

The characteristic bell curve shape of a normal distribution is what gives it its
name. We can perceive the bell curve as we look at the distribution.

11. What is your definition of skewness?

Skewed data distribution has a non-symmetrical pattern relative to the mean, the
mode, and the median. The skewness of data indicates that there are significant
differences between the mean, the mode, and the median. Data that is skewed
cannot be used to create a normal distribution.

12. How do you define kurtosis?

Outliers are detected in a data distribution using kurtosis. It measures the extent
to which the tail values diverge from the central portion of the distribution. The
higher the kurtosis, the higher the number of outliers in the data. To reduce their
effect, we may either include more data or eliminate the outliers.

13. What is the definition of correlation?

 The degree to which variables correlate is tested by covariance and correlation.


In contrast to covariance, correlation indicates how closely linked two variables
are. Values for correlation range from -1 to +1, with -1 indicating a strong
negative correlation and +1 indicating a strong positive correlation.
 A high negative correlation, where if one variable increases, the other variable
will decrease drastically, is represented by the -1 value. A positive correlation,
where an increase in one variable will cause an increase in the other, is
represented by the +1 value. There is no correlation between 0 and +1 variables,
whereas 0 and -1 variables have a negative correlation.
 If the statistical model is affected negatively by two variables that are strongly
correlated, then one of them must be removed.

14. Left-skewed and right-skewed distributions exist, what are they?

 The left tail is longer than the right tail in a left-skewed distribution. It is critical
to note here that mean, median, and mode are inverses of one another.
 In contrast to a left-skewed distribution, in which the left tail is longer than the
right one, a right-skewed distribution is one where the right tail is longer than
the left one. Here, the mean > the median > the mode.

15. How does the term covariance relate to understanding?


When two items are associated in a random process, covariance is the measure of
how closely they fluctuate together. Is there a connection between one of the
variables in a random pair and the other variable? If there is, then the systematic
connection is determined.

16. What are inferential statistics used for?

In inferential statistics, we use some sample data to draw conclusions about a


population. From government operations to quality control and quality
assurance teams in multinational corporations, inferential statistics are used in a
variety of fields.

17. How are mean and median related in a normal distribution?

The mean and the median of a dataset are in agreement if the dataset’s
distribution is normal. We can immediately tell if a dataset’s distribution is
normal if we simply check its mean and median.

18. What is the relationship between standard error and the margin of error?

The margin of error is proportionally influenced by the standard error. In other


words, the margin of error is computed using standard error. As standard error
increases, the margin of error also rises.

19. Principal Component Analysis (PCA) is an example of?


o a) Supervised Learning
o b) Unsupervised Learning
o c) Semi-Supervised Learning
o Answer: b) Unsupervised Learning
o Explanation: PCA is a dimension reduction technique and falls under the
category of unsupervised learning.
20. What is the importance of using PCA before clustering? Choose the most complete
answer.
o a) Find which dimension of data maximizes feature variance.
o b) Find good features to improve clustering score.
o c) Avoid bad features.
o d) Find the explained variance.
o Answer: b) Find good features to improve clustering score
o Explanation: PCA helps identify relevant features for better clustering
performance.
21. Why is it important to standardize data before running PCA’s algorithm?
o a) Find the features that best predict the target variable.
o b) Standardized data allows better understanding of your work.
o c) Use best practices of data wrangling.
o d) Make training time faster.
o Answer: d) Make training time faster
o Explanation: Standardizing data ensures consistent scales and speeds up PCA
computation.
22. What does PCA do?
o a) Reduces dimensionality of data and creates new features.
o b) Predicts the target with high efficiency.
o c) Creates clusters to identify classes.
o d) Provides the highest number of features possible.
o Answer: a) Reduces dimensionality of data and creates new features
o Explanation: PCA transforms data into a lower-dimensional space while retaining
information.
23. When should you use PCA?
o a) To find latent features and reduce dimensionality.
o b) Every time before using a Machine Learning algorithm.
o c) When dealing with overfitting.
o d) When data is small and has few features.
o Answer: a) To find latent features and reduce dimensionality
o Explanation: PCA is useful for dimensionality reduction and feature extraction.
24. What is Latent Dirichlet Allocation (LDA)?
LDA is a generative probabilistic model used for topic modelling. It discovers latent
topics in a corpus of documents.
25. How LDA Works?
Document-Topic Distribution:
Each document is represented as a distribution over topics.
A document can belong to multiple topics with certain probabilities.
Topic-Word Distribution:
Each topic is represented as a distribution over words.
A topic is defined by a set of words, each associated with a probability.
Generative Process:
LDA uses a generative process to create documents:
1. Choose a distribution of topics for each document.
2. For each word in the document, select a topic based on the topic
distribution.
3. Choose a word from the selected topic’s word distribution.
26. List the Applications of LDA.
Topic Modeling: LDA uncovers underlying themes in text data.
Content Recommendation: Identifies related content based on topic distributions.
Document Clustering: Groups similar documents by topics.
Information Retrieval: Enhances search results by considering latent topics.
27. What are the Challenges with LDA?
Choosing the Number of Topics: Determining the optimal number of topics.
Interpreting Topics: Understanding the meaning of discovered topics.
Model Evaluation

28. Define Hierarchical Clustering.


o Definition: Hierarchical clustering organizes data into a tree-like structure
(dendrogram) based on similarity.
29. Compare Agglomerative vs. Divisive.
 Agglomerative (Bottom-up): Starts with individual data points and merges
them into clusters.
 Divisive (Top-down): Starts with all data points as one cluster and
recursively splits them.
30. List some Distance Metrics.
 Common distance metrics include Euclidean distance, Manhattan
distance, and correlation.
 The choice of metric affects the resulting clusters.
31. Define the Dendrogram.
o A dendrogram visually represents the hierarchy of clusters.
o Each leaf node represents an individual data point.
o The height of each branch indicates the dissimilarity between merged clusters.
32. List the various Linkage Methods.
o Single Linkage: Measures the shortest distance between any pair of points in two
clusters.
o Complete Linkage: Measures the longest distance between any pair of points in
two clusters.
o Average Linkage: Computes the average distance between all pairs of points in
two clusters.
o Ward’s Method: Minimizes the increase in variance after merging clusters.
33. List the Applications of Hierarchical Clustering.
o Biology: Grouping genes based on expression profiles.
o Marketing: Customer segmentation based on purchasing behavior.
o Image Segmentation: Identifying regions of interest in images.
34. List the Challenges with Hierarchical Clustering:
o Scalability: Computationally expensive for large datasets.
o Choosing the Number of Clusters: Determining the optimal number of clusters.
o Interpreting Dendrograms: Understanding the hierarchy and making decisions.
35. Define Multivariate Normal Distribution.
Definition: The multivariate normal distribution extends the univariate normal
distribution to multiple dimensions.
36. Define properties of multivariate distribution.
Properties
Characterized by a mean vector (μ) and a covariance matrix (Σ).
Symmetric and bell-shaped in higher dimensions.
Each component follows a univariate normal distribution.
37. Define the applications of multivariate distribution.
Applications
Modeling correlated data (e.g., stock returns, sensor measurements).
Statistical inference in multivariate settings.
38. Define the Covariance Matrix?
o Definition: The covariance matrix (Σ) captures the relationships between
variables.
39. List the elements of the covariance matrix.
 Diagonal elements: Variances of individual variables.
 Off-diagonal elements: Covariances between pairs of variables.
40. What is positive and negative covariance?
 Positive covariance: Variables move together.
 Negative covariance: Variables move in opposite directions.
41. List the Multivariate Sampling Techniques
o Multivariate Random Variables:
 A collection of random variables (X , X , …, X ).
 Joint probability density function (pdf) describes their distribution.
o Sampling from Multivariate Normal Distribution:
 Use Cholesky decomposition to generate correlated samples.
 Transform univariate normal samples using the covariance matrix.
42. Multivariate Hypothesis Testing:
o Hotelling’s T² Test:
 Compares means of multivariate samples.
 Detects differences in means across multiple variables.
o MANOVA (Multivariate Analysis of Variance):
 Extends ANOVA to multiple dependent variables.
 Tests whether group means differ significantly.
43. What category does multivariate linear regression belong to?
Supervised learning
44. In multivariate regression, what does the hypothesis look like?
(h(X) = X^T)
45. What does (X^{(i)}) represent in multivariate regression?
A feature vector denoting the independent variables in the (i)th example.
46. Is there an upper bound on the number of independent variables?
No, there is no an upper bound on the number of independent variables.
47. Is there an upper bound on the number of target variables?
No, there is no an upper bound on the number of target variables.
48. How to find the number of clusters (K) based on the cluster chart created.
The choice of the number of clusters (K) depends on the specific problem and data.
Common methods include the elbow method, silhouette score, or domain knowledge.
For example, if the elbow point occurs at K=3, we would choose 3 clusters.
49. Determine the similarities (distance relationships) between the variables’ values to the
initial center points by calculating the sum of the differences between the dataset
values and the initial center values using the formula Euclidean Distance.
The Euclidean distance between a data point and a centroid is calculated as:
Euclidean Distance=
where (xi) represents the value of the variable for the data point, and (ci) represents the
value of the centroid for the same variable.
50. Specify the termination condition of a clustering process.
The clustering algorithm iteratively updates the centroids and assigns data points to
the nearest centroid. The process continues until convergence (when the centroids no
longer change significantly) or until a specified number of iterations is reached.

You might also like