Lab Cs
Lab Cs
2. Print the Histogram and Kernal Density Estimation for the attribute “Sepal
Length”
1. Find the skewness and kurtosis for an attribute in the given dataset.
AIM:
To perform and print the basic statistical analysis such as the first 10 records, the
number of rows and columns, the column names and mean of all the attributes on
Iris data set using Python.
ALGORITHM:
i. Load the desired data set
ii. Print the first 10 records
iii. Find the length of the dataset and print the number of rows and
columns
iv. Extract the column names from the dataset and print them separately
v. Find the mean of all the numerical attributes
SAMPLE INPUT:
Import the Iris data from ‘plotly.express’ library and name it as ‘px’
SAMPLE OUTPUT:
1. Print the first 10 instances.
RESULT:
The basic statistical outcomes such as total number of records, Normal Distribution of
each attribute, Standard Division and Mean of the attributes and plotting the
distribution using Histogram are evaluated and plotted successfully using Python script.
Ex. NO : 4 Perform the following statistical analysis on a data set.
Generate
The statistical description of Iris data set
The Box plot for any one attribute and compare it with the relevant statistical
data
The dependency Curve of the attribute considered for constructing the box plot.
PROGRAM:
SAMPLE INPUT:
Import the Iris data from ‘plotly.express’ library and name it as ‘px’
SAMPLE OUTPUT:
1. Print the Statistical Description for the attribute “Petal Width”
petal_width
count 150.000000
mean 1.198667
std 0.763161
min 0.100000
25% 0.300000
50% 1.300000
75% 1.800000
max 2.500000
Name: petal_width, dtype: float64
2. Print the Box Plot of the attribute “Petal Width”
RESULT:
The statistical outcomes such as statistical description, Box plot of an attribute and
dependency curve of the attribute are evaluated and plotted successfully using Python
script.
Ex. NO: 5 Parameter Estimation Process [Maximum Likelihood Estimation Process]
AIM:
To Perform the parameter estimation process (i.e) Maximum Likelihood Estimation
process using Python.
ALGORITHM:
i. Generate 100 random values between 10 to 30 and refer them as X.
ii. Find the Y using the function y=10+4x+e
iii. Put the X and Y values in a dataframe.
iv. Plot the generated values using regplot() function.
v. Find the OLS regression results using OLS model
vi. Calculate the SD for the residuals
vii. Construct the MLE model using L-BFGS-B Memory optimization
Algorithm.
viii. Compare MLE parameters with OLS Parameters
PROGRAM:
SAMPLE INPUT:
1. Synthesis the input.
Generate 100 values for the independent variable ‘x’ in the range -10 to 30,
using x = np.linspace(-10, 30, 100)
Generate 100 dependent values for the variable ‘y’ using the following formula,
y = 10 + 4*x + e
2. Sample Input:
SAMPLE OUTPUT:
message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
success: True
status: 0
fun: 303.3932659422116
x: [ 2.021e+01 3.998e+00 5.028e+00]
nit: 28
jac: [ 1.705e-05 5.684e-06 5.684e-06]
nfev: 140
njev: 35
hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>
RESULT:
The parameter estimation process (i.e) Maximum Likelihood Estimation process is
done successfully using Python script.
Ex. NO: 6 Data Aggregation Process
a. Perform the aggregation process for a single dimensional data
b. Perform the aggregation process for a 2D data
c. Perform the aggregation process for a n-D data
AIM:
To perform the data Aggregation for 1D, 2D and n-D data using Python.
ALGORITHM:
Consider the below mentioned Aggregation process:
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items
SAMPLE INPUT:
1. 1D Data
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
2. 2D Data
A B
0 0.183405 0.611853
1 0.304242 0.139494
2 0.524756 0.292145
3 0.431945 0.366362
4 0.291229 0.456070
3. n-D Data
Load the dataset ‘planet’ with multiple instances.
Sample Dataset:
SAMPLE OUTPUT:
1. Aggregation of 1D
Sum = 2.811925491708157
Mean = 0.5623850983416314
Count = 5
Minimum = 0.156018
Maximum = 0.950714
2. Aggregation of 2D
Sum (Columnwise)
A 1.735577
B 1.865923
Sum (Rowwise)
0 0.397629
1 0.221868
2 0.408451
3 0.399153
4 0.373650
Mean
A 0.477888
B 0.443420
Count
(5, 2)
Minimum
A 0.183405
B 0.139494
Maximum
A 0.524756
B 0.611853
3. Aggregation of n-D data.
Group by ‘Method’ and ‘orbital Period’
Shape or Size
AIM:
To perform LDA by calculating the basic requirements such eigenvalues and
eigenvectors
ALGORITHM:
i. Calculate the between-class variance.
ii. Calculate the within-class variance.
iii.Compute the eigenvectors and the corresponding eigenvalues.
iv. Put the eigenvalues in decreasing order and select k eigenvectors with the largest
eigenvalues.
v. Create a k dimensional matrix containing the eigenvectors.
PROGRAM:
SAMPLE INPUT:
Load the Wine dataset from sklearn
Sample Data:
SAMPLE OUTPUT:
Number of classes and Features:
Number of classes: 3
Number of features: 13
Variance Ratio:
[0.72817751 0.27182251]
LDA Scatter Plot:
RESULT:
LDA evaluation and scatter plot by eigenvalues and eigenvectors are done
successfully using Python Script.
Ex. NO: 8 Principal Component Analysis (PCA) Implementation for a given data set
PROBLEM:
Principal Component Analysis (PCA) Implementation for a given data set
AIM:
To implement the PCA for the given dataset using Python Script
ALGORITHM:
5. Load the data
6. Standardize the features
7. Make the PCA with n=2, where n is the number of components
8. Plot the data with the new principle components
9. Display the variance among the 2 components
PROGRAM:
SAMPLE INPUT:
Take the Iris dataset
Sample Input Data
SAMPLE OUTPUT:
1. Standardize the data
[-9.00681170e-01 1.03205722e+00 -1.34127240e+00 -1.31297673e+00]
[-1.14301691e+00 -1.24957601e-01 -1.34127240e+00 -1.31297673e+00]
[-1.38535265e+00 3.37848329e-01 -1.39813811e+00 -1.31297673e+00]
[-1.50652052e+00 1.06445364e-01 -1.28440670e+00 -1.31297673e+00]
[-1.02184904e+00 1.26346019e+00 -1.34127240e+00 -1.31297673e+00]
[-5.37177559e-01 1.95766909e+00 -1.17067529e+00 -1.05003079e+00]
[-1.50652052e+00 8.00654259e-01 -1.34127240e+00 -1.18150376e+00]
3. 2 Component Plot
4. Variance Ratio:
[0.72770452 0.23030523]
RESULT:
The PCA for the given dataset is found successfully using Python Script.
Ex. NO: 9 H-Plot construction for the given data set
PROBLEM:
H-Plot construction for the given data set
AIM:
To construct the H-Plot for the given data set.
ALGORITHM:
1. Prepare/download the data
2. Select the attributes for constructing the Horizontal Bar chart.
3. Plot the Bar and H-bar chart for the points/data among the selected attributes
PROGRAM:
SAMPLE INPUT:
Take the following data as Input.
product = ['computer', 'monitor', 'laptop', 'printer', 'tablet']
quantity = [320, 450, 300, 120, 280]
SAMPLE OUTPUT:
RESULT:
The H-Plot for the given data set is constructed successfully using Python script.
Ex. NO: 10 Clustering the data using any one clustering Algorithm
PROBLEM:
Clustering the data using any one clustering Algorithm
AIM:
To cluster the given data by applying the K-Means algorithm using Python Script.
ALGORITHM:
1. Create/Load the data
2. Standardize the features
3. Plot the standardize features
4. Apply K-Means clustering with n=3
5. Plot the clusters with their centre points
6. Find the value of K for the selected data set using Elbow method
7. Apply K-Means clustering with the recommended Elbow value as k.
8. Plot the clusters with their centre points
SAMPLE INPUT:
Generate synthesized data of size 200
SAMPLE OUTPUT:
1. Original Data Plot:
When the null hypothesis is rejected even though it is correct, a type 1 error
occurs. False positives are also known as type 1 errors.
When the null hypothesis is not rejected despite being incorrect, a type 2 error
occurs. This is also known as a false negative.
Population Sample
Using charts, tables, and graphs to present Probability was responsible for
information. achieving this goal.
When a set of data points is near the mean, a low value of standard deviation
indicates that the points are close to the mean, and a high value indicates that
the points are far away from the mean. On the other hand, when the data points
are far apart from each other, a high standard deviation indicates that the points
are far away from the mean, and a low standard deviation indicates that the
points are close to the mean.
The characteristic bell curve shape of a normal distribution is what gives it its
name. We can perceive the bell curve as we look at the distribution.
Skewed data distribution has a non-symmetrical pattern relative to the mean, the
mode, and the median. The skewness of data indicates that there are significant
differences between the mean, the mode, and the median. Data that is skewed
cannot be used to create a normal distribution.
Outliers are detected in a data distribution using kurtosis. It measures the extent
to which the tail values diverge from the central portion of the distribution. The
higher the kurtosis, the higher the number of outliers in the data. To reduce their
effect, we may either include more data or eliminate the outliers.
The left tail is longer than the right tail in a left-skewed distribution. It is critical
to note here that mean, median, and mode are inverses of one another.
In contrast to a left-skewed distribution, in which the left tail is longer than the
right one, a right-skewed distribution is one where the right tail is longer than
the left one. Here, the mean > the median > the mode.
The mean and the median of a dataset are in agreement if the dataset’s
distribution is normal. We can immediately tell if a dataset’s distribution is
normal if we simply check its mean and median.
18. What is the relationship between standard error and the margin of error?