FDS Lab 1 Manuel .1..1new
FDS Lab 1 Manuel .1..1new
FDS Lab 1 Manuel .1..1new
: 1
WORKING WITH PANDAS AND DATAFRAMES
Date:
AIM:
ALGORITHM:
PROGRAM :
print(df)
THEORY:
Pandas is a powerful Python library used for data manipulation and analysis. It provides two
primary data structures:
DataFrame: A two-dimensional labeled data structure with columns of potentially different types,
similar to a table in a database or a spreadsheet.
OUTPUT:
RESULT:
Thus the program to implement numpy operations with array using python has been executed and
the output was verified successfully
Ex .No.: 2
WORKING WITH NUMPY ARRAYS
Date:
AIM:
ALGORITHM:
THEORY:
PROGRAM :
OUTPUT :
ZEROS ARRAY :
[[0 0 0 ] [0 0 0] [0 0 0]]
ONES ARRAY:
[[1 1 1 1] [1 1 1 1]]
(B)Generate a Random NumPy Array
np.random.rand(3, 3) print("Random
Array:") print(random\_array)
OUTPUT:
Random Array:
0.87597885 0.53328317]
import numpy as np arr1 = np.array([[1, 2], [3, 4]]) arr2 = np.array([[5, 6], [7,
print("Addition:") print(addition)
print("Subtraction:") print(subtraction)
print("Multiplication:")
print(multiplication) print("Division:")
print(division)
OUTPUT:
Addition:
[[ 6 8] [10
12]]
Subtraction:
[[-4 -4]
[-4 -4]]
Multiplication:
[[ 5 12]
[21 32]]
Division:
[[0.2 0.33333333]
[0.42857143 0.5 ]]
numpy as np
transposed\_arr = np.transpose(arr)
print("Transposed Array:")
print(transposed\_arr)
OUTPUT:
Original Array:
[[1 2 3]
[4 5 6]]
Transposed Array:
[[1 4]
[2 5]
[3 6]]
(E) Find the index of the maximum and minimum element along an axis import
numpy as np
OUTPUT:
Thus the implement python functions by using numpy arrays is successfully done
Ex .No.: 3
BASIC PLOTS USING MATPLOTLIB
Date:
AIM:
To creating basic plots using Matplotlib to visualize data distributions and trends .
ALGORTHM:
THEORY:
Display data points connected by a line, often used to show trends over time.Show the relationship
between two variables, with each point representing an observation.Display the distribution of a
dataset by grouping data into bins and counting the number of observations in each bin.Show the
count or value of different categories.
PROGRAM:
import numpy as np x =
np.sin(x) plt.plot(x, y)
plt.title('Sine Wave')
plt.xlabel('X-axis')
plt.ylabel('Y-axis') plt.show()
x = np.random.rand(100) y =
np.random.rand(100)
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis') plt.show()
values = [4, 7, 1, 8]
plt.bar(categories, values)
plt.title('Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values') plt.show()
data = np.random.randn(1000)
plt.hist(data, bins=30,
edgecolor='black')
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
10] plt.pie(sizes,
labels=labels, autopct='%1.1f
%%', startangle=140)
= np.sin(x) y2 = np.cos(x)
# Create plot with different styles plt.plot(x, y1,
plt.xlabel('X-axis') plt.ylabel('Y-
plt.show() plt.plot(x, y)
plt.xlabel('X-axis') plt.ylabel('Y-
y1 = np.sin(x) y2 = np.cos(x) #
axs[0].set_title('Sine')
axs[1].set_title('Cosine') #
plt.show()
OUTPUT :
RESULT:
Ex .No.: 4
FREQUENCY DISTRIBUTION, AVERAGES AND VARIABILITY
Date:
AIM:
To understand the distribution of values within the dataset and identify the frequency,
central tendency, dispersion of the dataset around the central tendency of each unique value.
ALGORITHM:
Step7: Count how many times each number appears in the dataset.
Step8: Analyze the results for frequency distributions, averages, and variability.
THEORY:
A frequency distribution is a summary of how often different values occur within a data set.
Averages are measures of central tendency that summarize a set of data with a single value
representing the center of the data. Variability refers to how spread out the data values are in a data
set.
PROGRAM :
_distribution(data):
freq\_dist = {} for
in freq\_dist: freq\
_dist[item] += 1
else:
calculate\_mean(data): return
= [10, 20, 30, 40, 50, 20, 30, 40, 20, 10]
freq\_dist = frequency\_distribution(data)
mean = calculate\_mean(data)
mode = calculate\_mode(data)
print("Mode:", mode)
data\_range = calculate\_range(data)
= calculate\_variance(data)
= calculate\_std\_dev(data)
OUTPUT:
Mean: 27
Median: 25.0
Mode: 20
Range: 40
Variance: 178.88888888888889
Standard Deviation: 13.37493509849258
RESULT:
Ex .No.: 5
NORMAL CURVES, CORRELATION AND SCATTER PLOTS,
Date:
CORRELATION COEFFICIENT
AIM:
To visualize the distribution of data points along a bell-shaped curve. To visually represent the
relationship between two variables by plotting their data points.To quantify the strength and
direction of the linear relationship between two variables .
ALGORITHM:
Step2: Normal Curves:Generate an array of evenly spaced x-values ranging from min to max with
n points
Step4: Correlation Coefficient Algorithm:Calculate the mean, Calculate the difference , Multiply
the differences for each pair of (x, y) values, Sum up the products, Calculate the standard
deviations, Divide the sum of products, Divide the result by the product, The result is the
correlation coefficient.
Step5: Stop the Program
THEORY:
A normal curve (or Gaussian distribution) is a bell-shaped curve that represents the distribution of a
set of data. It's symmetric around the mean and characterized by its mean (μ) and standard
deviation (σ). A scatter plot is a graph that shows the relationship between two variables using
Cartesian coordinates. The correlation coefficient (often denoted as rrr) quantifies the strength and
direction of the linear relationship between two variables.
PROGRAM :
seaborn as sns
= 0 std_dev = 1 data =
np.random.seed(0) x = np.random.rand(100) y
plt.legend() plt.show()
plt.show()
OUTPUT:
RESULT:
Ex .No. : 6
Z – TEST
Date:
AIM:
To implement the z – test is to determine whether the mean of a sample is statistically significantly
different from the known or hypothesized population mean.
ALGORITHM:
Step 2: Calculate means, standard deviations, and lengths for each group.
Step 6: Test the hypothesis based on the p-value and a significance level of 0.05.
Step 7: Plot a histogram for both groups, with labels, title, and legend.
THEORY:
A Z-test is a statistical test used to determine whether there is a significant difference between
sample and population means or between the means of two samples. It is applicable when the data
is approximately normally distributed and the sample size is large (typically n>30).
PROGRAM :
31.5, 33, 33.5, 35, 35.5, 37, 38, 37, 40] mean1,
print("Reject the null hypothesis: There is a significant difference between the means.") else:
print("Fail to reject the null hypothesis: There is no significant difference between the means.")
plt.legend(loc='upper right')
plt.xlabel('Value') plt.ylabel('Frequency')
plt.show()
OUTPUT:
RESULT:
Ex .No. : 7
T- TEST
Date:
AIM:
ALGORITHM:
Step 1: Define helper functions for variance checking and performing t-tests.
Step 2: Generate random data for one-sample, independent two-sample, and paired scenarios.
Step 6: Visualize the distributions of the two independent groups using histograms with labels,
title, and legend.
THEORY:
A t-test is a statistical test used to determine if there is a significant difference between the means of
two groups. It helps to ascertain whether the observed differences in sample means are likely to
reflect real differences in the populations from which the samples were drawn, or if they might
have occurred by random chance.
Types of t-tests:
1. Independent Samples t-test: Compares the means of two independent groups (e.g., treatment
vs. control groups).
2. Paired Samples t-test: Compares the means of the same group at two different times (e.g.,
before and after treatment).
3. One-Sample t-test: Compares the sample mean to a known value (e.g., a population mean).
PROGRAM:
import numpy as np
import pandas as pd
import
matplotlib.pyplot as
sns def
check_variance(group,
threshold=1e-5):
variance = np.var(group) if
'independent':
print("Reject the null hypothesis: The means are significantly different.") else:
print("Fail to reject the null hypothesis: The means are not significantly different.")
5, 30) check_variance(one_sample_data)
check_variance(group1) check_variance(group2)
check_variance(before) check_variance(after)
RESULT:
Thus the implementation of t-test is successfully done.
Ex .No. : 8
ANOVA
Date:
AIM:
ALGORITHM:
Step1: Start the program
Step2: Formulate Hypotheses
Step3: Set Significance Level
Step4: Calculate Group Means
Step5: Calculate Overall Mean
Step6: Calculate Sum of Squares
Step7: Calculate F-Statistic
Step8: Determine Critical Value or P-value
Step9: Make Decision
Step10: Stop the program
THEORY:
Types of ANOVA:
1. One-Way ANOVA: Tests for differences among group means in a single factor (e.g.,
comparing test scores of students across different teaching methods).
2. Two-Way ANOVA: Examines the effect of two different factors simultaneously and can
also assess the interaction between the two factors (e.g., studying the effect of teaching
method and study time on test scores).
PROGRAM:
np.random.seed(0) group1 =
pd.DataFrame({
})
OUTPUT:
RESULT:
Thus the implementation of the ANOVA (Analysis of Variance) is successfully done.
Ex .No. : 9
BUILDING AND VALIDATING LINEAR MODELS
Date:
AIM:
ALGORITHM:
Step 1: Import Libraries:Import necessary libraries for numpy, matplotlib, and scikit-learn.
Step 2: Generate Data:Set random seed.Generate feature matrix X and target vector y.
Step 4: Train Model:Initialize and train a linear regression model using training data.
PROGRAM:
np.random.seed(0)
X = 2 * np.random.rand(100, 1) y = 4 +
3 * X + np.random.randn(100, 1)
plt.legend() plt.show()
OUTPUT:
RESULT:
Thus the implementation of a building and validating linear models is successfully done.
Ex .No. : 10
BUILDING AND VALIDATING LOGISTIC MODELS
Date:
AIM:
Step 2: Generate Data:Set random seed.Generate feature matrix X and target vector y using random
values.
Step 3: Split Data:Split data into training and test sets using train_test_split.
Step 4: Train Model:Initialize and train a linear regression model using LinearRegression.
Step 5: Predict:Use the trained model to predict target values for the test data.
Step 6: Evaluate Model:Calculate and print the Mean Squared Error (MSE) using
mean_squared_error.
THEORY:
Building and validating logistic regression models involves preparing the data, specifying the
logistic function, estimating model parameters, and fitting the model. Model validation ensures that
the model performs well on new, unseen data by evaluating it using various metrics,
crossvalidation, and diagnostic tests. Regularization techniques are employed to mitigate overfitting
and improve generalizability.
PROGRAM:
OUTPUT:
RESULT:
Thus the implementation of a building and validating logistic models is successfully done.
Ex .No. : 11
Date:
TIME SERIES ANALYSIS
AIM:
ALGORITHM:
Step 2: Create Data: Create a dictionary with date range and corresponding values.
Step 3: Create Data Frame:Convert dictionary to Data Frame.Convert date column to datetime.Set
date column as index.
Step 4: Plot Data: Plot the time series data. Customize plot with labels, title, legend, and grid.
Step 5: Display Plot:Show the plot.
Step 6: Generate Summary Statistics: Print summary statistics of the Data Frame. Print the first few
rows of the Data Frame.
THEORY:
Time series analysis involves methods for analyzing time-ordered data points to extract
meaningful statistics, identify patterns, and forecast future values. It is widely used in
various fields, such as economics, finance, environmental science, and engineering, where
understanding temporal dynamics is crucial.
Stationarity:
• Definition: A stationary time series has statistical properties, such as mean and variance,
that are constant over time.
• Importance: Many time series models assume stationarity. Non-stationary series often
need to be transformed to stationary (e.g., differencing).
Autocorrelation:
• Definition: The correlation of a time series with its own past values.
• ACF and PACF: Autocorrelation Function (ACF) and Partial Autocorrelation Function
(PACF) help identify the presence of autocorrelation and determine appropriate model
parameters.
PROGRAM:
pd.to_datetime(df['date']) df.set_index('date',
print("Summary Statistics:")
print(df.describe()) print("\
nOutput:") print(df.head())
OUTPUT:
RESULT:
Thus the implementation of a time series analysis is successfully done.
Ex .No. : 12
REGRESSION
Date:
AIM:
To model and predict the value of the dependent variable based on the values of the independent
variables.
ALGORITHM:
Step4: Optionally, you can calculate other statistics such as the coefficient of determination Step5:
THEORY:
Regression is a statistical method used to examine the relationship between two or more variables.
The primary goal is to model the relationship and make predictions.
PROGRAM:
OUTPUT:
RESULT: