Data Analysis and Visulaization Experiment
Data Analysis and Visulaization Experiment
NO. 1
Aim : Getting Introduced to Data Analytics libraries in Python
& R. Function Description
1) Pandas read_csv() This function is used to retrieve data from CSV
files in the form of a dataframe.
df =
pd.read_csv('people.csv',
header=0,
# printing dataframe
print(df.head())
uses
The function is useful for importing tabular data from CSV files into a DataFrame, allowing
for further data manipulation and analysis
Kashik
Sredharan
T23 114
DAV EXP
# making data frame NO. 1
Sredharan
T23 114
DAV EXP
NO. 1
data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
# calling head()
method # storing in
new variable
data_top
= data.head()
# display
data_top
uses
The head() function is particularly useful when you want to quickly inspect the data in a
DataFrame or when you need to test if the DataFrame has the expected structure
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
# calling tail()
method # storing
in new
variable
data_bottom =
data.tail()
# display
data_botto
m
uses
The tail() function is particularly useful when you want to quickly inspect the data in a
DataFrame or when you need to test if the DataFrame has the expected structure
# importing pandas
as pd
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
row1 = data.sample(n = 1)
# display
row1
# generating another
row row2
=
data.sample(n = 1)
# display
row2
uses
The sample() function is particularly useful when you want to randomly select a subset of
rows or columns from a DataFrame for further analysis or testing
values.
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
# importing pandas
as pd import
pandas as pd
# Creating the
dataframe df
=
pd.read_csv("nba.csv")
# Print
the
dataframe df
USES:
The info() function is particularly useful during exploratory analysis, offering a quick and
informative overview of the dataset, and it is an essential step in the data analysis
workflow
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
2) Matplotlib library
FUNCTIONS:-
1. plt.plot(x, y, label='label')
- Explanation: This function is used to create a line plot, connecting data points specified by `x`
and
`y`. The optional `label` parameter adds a label to the line for legend representation.
- Example:
import matplotlib.pyplot
as plt x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y, label='Linear
Function') plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
plt.show
() Output-
- Explanation: This function creates a scatter plot, representing individual data points
with markers. The `c` and `marker` parameters allow customization of color and
marker style.
- Example:
import matplotlib.pyplot
as plt x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
Points') plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend
()
plt.show()
Output-
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
- Explanation: This function creates a bar chart, where `x` represents the bar positions,
and `height` specifies the bar heights. Optional parameters like `width` and `align`
customize the bar width and alignment.
- Example:
import matplotlib.pyplot
2, 5]
plt.xlabel('Categories')
plt.ylabel('Values'
) plt.legend()
plt.show()
Output-
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
4. plt.xlabel('label')
- Explanation: This function sets the label for the x-axis in the plot, providing context
for the data displayed.
- Example:
import matplotlib.pyplot
as plt plt.xlabel('X-axis
Label') Output-
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
5. plt.title('title')
- Explanation: This function adds a title to the plot, providing an overall description or
name for the visual representation.
- Example:
import matplotlib.pyplot as
Plot') Output-
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
3) Scipy
Library what
is SciPy?
SciPy contains varieties of sub packages which help to solve the most common
issue related to Scientific Computation.
SciPy package in Python is the most used Scientific library only second to GNU
Scientific Library for C/C++ or Matlab’s.
Easy to use and understand as well as fast computational
Scipy Functions:
1. Integra
te
Descriptio
n:
The scipy.integrate module is a part of the SciPy library and provides functions for
numerical
integration, including single, double, and triple integration, as well as solving
ordinary differential equations.
Usage:
Syntax:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
Code:
import numpy as np
return np.exp(-x)
Output:
2. Linear
Algebra
Description:
The scipy.linalg module is a part of the SciPy library and is used for common linear
algebra operations, such as solving linear systems, singular value decomposition,
eigenvalue problems, and matrix factorization.
Usage:
It is used when SciPy is built using the optimized ATLAS LAPACK and BLAS
libraries, providing fast linear algebra capabilities.
Syntax:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
scipy.linalg.det(
a) Code:
import numpy as np
B = np.array([[1],
[2]]) x = solve(A, B)
Output:
3. Special
Descriptio
n:
Usage:
Syntax:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
scipy.special.gamma(
x) Code:
import numpy as np
factorial n = 5
result = factorial(n)
Output:
4. Stats
Description:
The scipy.stats module is a part of the SciPy library and provides a large number of
probability distributions, summary and frequency statistics, correlation functions, and
statistical tests.
Usage:
It is used to perform various statistical operations on the given data. The functions
in this module are universal functions, which means they can accept NumPy arrays
as input arguments as well as single numbers.
Syntax:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
import numpy as np
my_array = np.array([3, 5, 2, 1, 9,
print(ranks)
Output:
5. Signal
Descriptio
n:
The scipy.signal module is a part of the SciPy library and provides tools for signal
processing, such as filtering, Fourier transforms, and wavelets. It is built on top of
the SciPy library and offers a comprehensive set of functions for working with
different types of signals, including 1D and 2D arrays, as well as multi-channel
signals.
Usage:
The module is used for a wide range of signal processing tasks, including
filtering, noise reduction, spectral analysis, time-frequency analysis, and
more.
Syntax:
import numpy as np
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
sequence1 = np.array([1, 2, 3, 4, 5, 6])
Output:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
4) SKLEARN
What is Sklearn?
French research scientist David Cournapeau's scikits.learn is a Google Summer of Code
venture where the scikit-learn project first began. Its name refers to the idea that it's a
modification to SciPy called "SciKit" (SciPy Toolkit), which was independently created and
published. Later, other programmers rewrote the core codebase.
Implementation of Sklearn
Scikit-learn is mainly coded in Python and heavily utilizes the NumPy library for highly
efficient array and linear algebra computations. Some fundamental algorithms are also built
in Cython to enhance the efficiency of this library. Support vector machines, logistic
regression, and linear SVMs are performed using wrappers coded in Cython for LIBSVM and
LIBLINEAR, respectively. Expanding these routines with Python might not be viable in such
circumstances.
Scikit-learn works nicely with numerous other Python packages, including SciPy, Pandas
data frames, NumPy for array vectorization, Matplotlib, seaborn and plotly for plotting
graphs, and many more.
Functions of SKLEARN
1.Datasets
Scikit-learn comes with several inbuilt datasets such as the iris dataset, house prices
dataset, diabetes dataset, etc. The main functions of these datasets are that they are easy
to understand and you can directly implement ML models on them. These datasets are
good for beginners.
Python Code:
import sklearn
as pd
dataset = datasets.load_iris()
df = pd.DataFrame(dataset.data,
columns=dataset.feature_names) print(df.head())
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
2. Data Splitting
Sklearn provided the functionality to split the dataset for training and testing. Splitting the
dataset is essential for an unbiased evaluation of prediction performance. We can define
what proportion of our data to be included in train and test datasets.
With the help of train_test_split, we have split the dataset such that the train set has 80%
and the test set has 20% data.
3. Linear Regression
This supervised ML model is used when the output variable is continuous and it follows
linear relation with dependent variables. It can be used to forecast sales in the coming
months by analyzing the sales data for previous months.
With the help of sklearn, we can easily implement the Linear Regression model as follows:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
from sklearn.linear_model import LinearRegression
regression_model.fit(x_train, y_train)
y_predicted =
regression_model.predict(x_test) rmse =
mean_squared_error(y_test, y_predicted) r2
= r2_score(y_test, y_predicted)
4,Logistic Regression
Logistic Regression is also a supervised regression algorithm just like linear regression. The
only difference is that the output variable is categorical. It can be used to predict whether a
patient has heart disease or not.
With the help of sklearn, we can easily implement the Logistic Regression model as
classification_report logreg =
LogisticRegression()
logreg.fit(x_train, y_train)
y_predicted =
logreg.predict(x_test)
confusion_matrix = confusion_matrix(y_test,
y_pred) print(confusion_matrix)
print(classification_report(y_test, y_pred))
confusion matrix and classification report are used to check the accuracy of classification models.
5.Decision Trees
A Decision Tree is a powerful tool that can be used for both classification and regression
problems. It uses a tree-like model to make decisions and predict the output. It consists of
roots and nodes. Roots
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
represent the decision to split and nodes represent an output variable value. A decision
tree is an important concept.
Decision trees are useful when the dependent variables do not follow a linear relationship with the
independent variable i.e linear regression does not accurate results.
Decision tree implementation for
import export_graphviz
Image
dt =
DecisionTreeClassifier()
dt.fit(x_train, y_train)
dot_data = StringIO()
(graph, ) =
graph_from_dot_data(dot_data.getvalue()) y_pred
= dt.predict(x_test)
We fit the model with the DecisionTreeClassifier() object and further code is used to
visualize the decision trees implementation in python.
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
5) NumPy
NumPy (Numerical Python) is a perfect tool for scientific computing
and performing basic and advanced array operations.
The library offers many handy features performing operations on n-
arrays and matrices in Python. It helps to process arrays that store
values of the same data type and makes performing math operations
on arrays (and their vectorization)
easier. In fact, the vectorization of mathematical operations on the
NumPy array type increases performance and accelerates the
execution time.
Syntax:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
NumPy array manipulation functions allow us to modify or
rearrange NumPy arrays.
np.reshape() is a function that modifies the
arrays. Syntax: np.reshape(array1, (2,3))
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
NumPy provides us with various statistical functions to perform
statistical data analysis.These statistical functions are useful to find
basic statistical concepts like mean, median, variance, etc. It is also
used to find the maximum or the minimum element in an array.
a. Mean
b. Median
Application:
In Python we have lists that serve the purpose of arrays, but they are
slow to process.NumPy aims to provide an array object that is up to 50x
faster than traditional Python lists.The array object in NumPy is called
ndarray, it provides a lot of supporting functions that make working
with ndarray very easy.Arrays are very frequently used in data science,
where speed and resources are very important.
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
R programming language
The R programming language has a rich ecosystem of libraries for data
analytics. Here are some key libraries commonly used for data analytics
in R:
1. filter()
Purpose: Select rows from a data frame that meet specified
conditions. Example:
2. mutate()
Purpose: Create new columns or modify existing ones based on specified
transformations.
Example:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
3. group_by()
Purpose: Group data by one or more variables for
subsequent operations. Example:
group_by(df, category_column)
4. summarize()
Purpose: Generate summary statistics for groups
of data. Example:
5. arrange()
Purpose: Order rows based on one or
more variables. Example:
arrange(df, desc(numeric_column))
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
Purpose: Initialize a ggplot object and define the data and
aesthetic mappings. Example:
2. geom_point()
Purpose: Add points to a
plot. Example:
geom_point()
3. geom_line()
Purpose: Add lines to a plot, connecting data
points. Example:
geom_line()
4. facet_wrap()
Purpose: Create small multiples (faceted plots) based on a categorical
variable. Example:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
facet_wrap(~category_variable)
5. labs()
Purpose: Customize axis labels, plot title, and other
annotations. Example:
2. spread()
Purpose: Spread key-value pairs into separate
columns. Example:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
spread(df, key = "existing_key_column", value = "existing_value_column")
3. separate()
Purpose: Split a single column into multiple columns based on a
delimiter. Example:
4. unite()
Purpose: Combine multiple columns into a single
column. Example:
5. complete()
Purpose: Ensure that all combinations of a set of columns have a row,
filling in missing values.
Example:
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
readr: Offers efficient methods for reading and parsing data from
various formats. Functions:
1. read_csv()
Purpose: Read a CSV file into a data
frame. Example:
read_csv("file_path.csv")
2. read_table()
Purpose: Read a delimited text file into a
data frame. Example:
3. read_excel()
Purpose: Read data from an Excel file into a
data frame. Example:
read_excel("file_path.xlsx", sheet = "Sheet1")
4. read_fwf()
Purpose: Read a fixed-width format file into a data frame.
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
Example:
read_fwf("file_path.txt", fwf_widths(c(10, 15, 20)))
5. read_delim()
Purpose: Read a delimited text file into a data frame, allowing customization
of delimiters.
Example:
read_delim("file_path.txt", delim = ";")
2. predict()
Purpose: Generate predictions from a trained machine learning
model. Example:
predict(model, new_data)
Kashik
Sredharan
T23 114
DAV EXP
NO. 1
3. confusionMatrix()
Purpose: Compute a confusion matrix for model
evaluation. Example:
confusionMatrix(predicted_values, actual_values)
4. featurePlot()
Purpose: Create feature plots to visualize the distribution of features by
class. Example:
featurePlot(x = df[, predictors], y = df$target_variable, plot = "box")
5. caretList()
Purpose: Train multiple models with different algorithms and
parameters. Example:
caretList(response_variable ~ ., data = df, methodList = c("lm", "rf", "svm"))
Kashik
Sredharan
T23 114
EXP 2: Implement Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent
variable and one dependent variable. The equation for simple linear regression is:
y=\beta_{0}+\beta_{1}X
where:
# Y is the dependent
variable # X is the
independent variable # β0
is the intercept
# β1 is the slope
This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:
y=\beta_{0}+\beta_{1}X+\beta_{2}X+....\beta_{n}X
where:
Kashik
Sredharan
T23 114
# Y is the dependent variable
Linear regression is a powerful tool for understanding and predicting the behavior
of a variable, however, it needs to meet a few conditions in order to be accurate
and dependable solutions.
1) Linearity: The independent and dependent variables have a linear relationship with
one
another. This implies that changes in the dependent variable follow those in the
independent
variable(s) in a linear fashion. This means that there should be a straight line
that can be drawn
through the data points. If the relationship is not linear, then linear regression will
not be an accurate model.
Kashik
Sredharan
T23 114
2) Independence:
The observations in the dataset are independent of each other. This means that the
value of the dependent variable for one observation does not depend on the value of
the dependent variable for another observation. If the observations are not
independent, then linear regression will not be an accurate model.
3) Homoscedasticity:
Across all levels of the independent variable(s), the variance of the errors is
constant. This indicates that the amount of the independent variable(s) has no
impact on the variance of the errors. If the variance of the residuals is not
constant, then linear regression will not be an accurate model.
Kashik
Sredharan
T23 114
Code:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.linear_model import
LinearRegression from sklearn.metrics import
mean_squared_error import matplotlib.pyplot as
plt
file_path =
'sample_data/california_housing_test.csv' df =
pd.read_csv(file_path)
df_sampled = df.sample(n=20,
random_state=42) X =
df_sampled[['total_rooms',
'median_income']] y =
df_sampled['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42) model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
plt.scatter(X_test['median_income'], y_test, color='black') # Adjust the x-axis for
visualization if needed
plt.plot(X_test['median_income'], y_pred, color='blue',
linewidth=3) plt.title('Linear Regression')
plt.xlabel('Median Income')
plt.ylabel('Median House
Value') plt.show()
Output:
Kashik
Sredharan
T23 114
Conclusion: Thus Successfully Implemented Simple Linear Regression On a Given
Data Set
Kashik
Sredharan
T23 114
EXP 3 : Implement Multiple Linear Regression
Assumptions:
Kashik
Sredharan
T23 114
variables.
5. No multicollinearity: Independent variables are not highly correlated with each other.
and their functional form. 4. Parameter Estimation: Use methods like Ordinary
R-
squared, residual plots, etc.
6. Inference and Interpretation: Interpret the estimated coefficients and their
new data.
-Overfitting: Including irrelevant variables or too many predictors can lead to overfitting.
-Underfitting: Having too few predictors or oversimplifying the model can lead to
underfitting.
Kashik
Sredharan
T23 114
-Collinearity: High multicollinearity among predictors can lead to unstable
estimates of coefficients.
Multiple linear regression is a versatile tool widely used in various fields for
understanding relationships between variables and making predictions. However, careful
attention to model assumptions and data quality is crucial for its successful application.
Kashik
Sredharan
T23 114
Requires 3D or multi-
Visualization Typically visualized with a 2D scatter
dimensional space, often
plot and a line of best fit.
represented using partial
regression plots.
Higher, especially if too many
Risk of Overfitting Lower, as it deals with only one predictor. predictors are used without
adequate data.
A primary concern; having
Multicollinearity correlated
Concern Not applicable, as there’s only one
predictors can affect the model’s
predictor.
accuracy and interpretation.
Complex research, multifactorial
Applications Basic research, simple predictions,
understanding a singular relationship. predictions, studying
interrelated systems.
Code :
import numpy as np
import matplotlib as
mpl
from mpl_toolkits.mplot3d import
Axes3D import matplotlib.pyplot as plt
def generate_dataset(n):
x = []
y = []
random_x1 =
np.random.rand()
random_x2 =
np.random.rand() for i in
range(n):
x1 = i
x2 = i/2 +
np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)
Kashik
Sredharan
T23 114
return np.array(x), np.array(y)
Kashik
Sredharan
T23 114
x, y = generate_dataset(200)
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.add_subplot(projection ='3d')
plt.show()
Output :
Kashik
Sredharan
T23 114
Experimen
t4
THEORY:
Logistic regression is the appropriate regression analysis to conduct when the
dependent variable is dichotomous (binary). Like all regression analyses, logistic
regression is a predictive analysis.
It is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio-level independent
variables.
Types of Logistic
regression Binary
Binary logistic regression is used to predict the probability of a binary outcome, such
as yes or lnoog, itsrtui ce roer gf ar el sses, i oo rn 0 or 1. For example, it could be used to
predict whether a customer will churn or not, whether a patient has a disease or not, or
whether a loan will be repaid or not.
Multinomial logistic regression
Multinomial logistic regression is used to predict the probability of one of three or
more possible outcomes, such as the type of product a customer will buy, the rating
a customer will give a product, or the political party a person will vote for.
Kashik
Sredharan
T23 114
Experimen
t4
Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5
then we predict malignant tumor (1) and if it is less than 0.5 then we predict benign
tumor (0). Everything seems okay here but now let’s change it a bit, we add some
outliers in our dataset, now this best fit line will shift to that point. Hence the line will
be somewhat like this:
The blue line represents the old threshold and the yellow line represents the new
threshold which is maybe 0.2 here. To keep our predictions right we had to lower our
threshold value. Hence we can say that linear regression is prone to outliers. Now
here if h(x) is greater than 0.2 then only this regression will give correct outputs.
Another problem with linear regression is that the predicted values may be out of
range. We know that probability can be between 0 and 1, but if we use linear
regression this probability may exceed 1 or go below 0.
To overcome these problems we use Logistic Regression, which converts this
straight best fit line in linear regression to an S-curve using the sigmoid function,
which will always give values between 0 and 1.
Kashik
Sredharan
T23 114
Experimen
t4
2. Train the model: We teach the model by showing it the training data. This
involves finding the values of the model parameters that minimize the error
in the training data.
3. Evaluate the model: The model is evaluated on the held-out test data to
assess its performance on unseen data.
4. Use the model to make predictions: After the model has been trained and
assessed, it can be used to forecast outcomes on new data.
CODE:
import numpy
from sklearn import linear_model
#Reshaped for Logistic function.
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69,
5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr =
linear_model.LogisticRegression()
logr.fit(X,y)
log_odds = logr.coef_
odds = numpy.exp(log_odds)
#predict if tumor is cancerous where the size is
3.46mm: predicted =
logr.predict(numpy.array([3.46]).reshape(-1,1))
print("Predicted: ")
print(predicted)
print("\nOdds:
") print(odds)
print("\n")
def logit2prob(logr, X):
log_odds = logr.coef_ * X +
logr.intercept_ odds =
numpy.exp(log_odds)
probability = odds / (1 +
odds) return(probability)
print("Probability that each tumor is cancerous is \n")
print(logit2prob(logr, X))
OUTPUT:
Kashik
Sredharan
T23 114
Experimen
t4
Kashik
Sredharan
T23 114
Experiment 5
Core Components:
● Axes:
X-axis (Time
o Represents the chronological order of the data points. The scale
Axis):
on the x-axis can vary depending on the data, ranging from seconds to decades.
Y-axis (Measurement Axis): o Represents the values being measured. The scale on the
y-axis depends on the specific variable being plotted.
Da t a P o in t s:
● E a c h d a t a p o i n t signifies a specific measurement at a corresponding
time i n s ta n c e . T he s e p o in t s a re ty p ically plotted using
C o n n e c t in g L i ne s ( O p t io na l) :
circles, squares, or other markers.
●Lines are often used to connect consecutive data
points, hTiygphelisghotfinTgimthee-
Sterernieds oGr reavpohlus:tion of the variable over
time.
● Line Graph:
The most common type, it uses lines to connect data points, ideal for visualizing
t●rSecnadtstearndPlsoeta: sonality.
linear correlation
The ACF measures the between a time series and its lagged versions. In simpler
terms, it estimates how much a value in the series is related to values at previous time
steps (lags). The ACF is calculated for various lags, and the resulting plot depicts the
correlation coefficient at each lag.
Key characteristics of ACF plot:
●DeAcnayAinCgF pe ax thti eb ri tni n: g a decaying pattern suggests that the
influence of past values on current values diminishes over time.
Peaks at specific
●Significant spikes at specific lags indicate a potential relationship
lags:
between the current value and the value at that particular lag.
Partial Autocorrelation Function (PACF):
Kashik
Sredharan
T23 114
While the ACF captures the overall correlation, the PACF goes a step further. It measures
thepartial
Page 1 of 3
Kashik
Sredharan
T23 114
autocorrelation between a time series and its lagged versionsc,ontrolling for the influence
of
Kashik
Sredharan
T23 114
intervening lags . In essence, the PACF isolates the unique effect of a specific lag on the
current value, independent of the correlations at shorter lags.
Key characteristic of PACF plot:
Significant spikes at few lag●s: A PACF with significant spikes only at a few
lags suggests an autoregressive (AR) model might be suitable for forecasting. The
lags with spikes indicate the order of the AR model.
prompting
further investigation into potential causes.
●Forecasting: By analyzing historical patterns, time-series graphs can be
used to make predictions about future values.
●Comparison: Multiple time series can be plotted on the same graph to compare trends
across
different variables or categories.
Additional Considerations:
Time
Scale: ●The chosen time scale on the x-axis significantly impacts the interpretation of
the data. Ensure the scale provides an adequate view of the relevant time period.
Data Aggregation: ● Depending on the frequency of data collection, data points might be
aggregated (e.g., daily averages) for a clearer visualization on the graph.
Missing ●Strategies exist to handle missing data points, such as leaving gaps or using
Data:
in t e r po l a ti o n t e c h n iq u e s.
f r o/ m p a n d a s i m p o r t re a d_csv
Code
series = read_csv('daily-min-temperatures.csv', header=0, index_col=0, parse_dates=True)
Output:
series.squeeze('columns')
print(series.head())
Kashik
Sredharan
T23 114
series.plot()
plt.show()
Kashik
Sredharan
T23 114
Experiment 6
Kashik
Sredharan
T23 114
Applications of ARIMA Models:
ARIMA models find application in various domains due to their effectiveness in
fBourseicnaesstsin:g: ● Sales forecasting, inventory management, demand prediction.
F●inSatoncckep: rice prediction, market trend analysis, risk assessment.
E●cMo na cormo ei ccso:n o m i c forecasting (GDP, inflation), resource management.
S●oPcoiapluSlactiieonncfeosr:ecasting, resource allocation, trend analysis.
Advantages of ARIMA:
I●ntTehreprmeotadbelilcitoym: ponents (AR, I, MA) are
statistically well-understood, aiding in interpreting the results.
F●leAxRiIbMilAityca: n be adapted to various time series data by adjusting its parameters.
E● f Ff eocr tsi vt aetni oensas r: y time series, ARIMA can generate accurate forecasts.
Limitations of ARIMA:
Stationarity Assumption:
● Relies heavily on the assumption of stationarity in the data.
●Limited Model May not capture complex non-linear relationships in the data.
Complexity:
●Data-Driven
Reliant on historical data, potentially leading to poor forecasts for
Nature:
significant changes.
Code / Output:
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo.csv', header=0, parse_dates=[0],
index_col=0, date_parser=parser)
Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit()
print(model_fit.summary())
residuals = DataFrame(model_fit.resid)
residuals.plot()
Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
from sklearn.metrics import
mean_squared_error from math import sqrt
Kashik
Sredharan
T23 114
print('Test RMSE: %.3f' %
rmse) # plot forecasts
against
Kashik
Sredharan
T23 114
Experiment 7
meaning. much
5. Text Feature Engineering: Creating numerical features from the text data to facilitate
analysis.
This might involve techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
6. Text Analysis Techniques: Depending on the objective, various techniques
are employed. Here are a few examples:
o Sentiment Analysis: Classifying the polarity of text (positive, negative, neutral)
to understand opinions or emotions.
o Topic Modeling: Identifying underlying themes or topics discussed within a
corpus of text documents.
o Entity Recognition: Identifying and classifying named entities like
people, organizations, or locations.
7. Visualization and Reporting: Presenting the results of the text analysis in a clear and
Kashik
Sredharan
T23 114
concise
Kashik
Sredharan
T23 114
1. Define the Objective:What insights do you want to extract from the text data?
Page 1 of 19
Kashik
Sredharan
T23 114
2. Data Collection and Preprocessing: Gather the relevant textual data and clean it for
analysis.
3. Feature Engineering: Create features suitable for the chosen text analysis technique.
4. Model Training and Analysis: Depending on the objective, train a model or apply
the chosen text analysis technique.
5. Evaluation and Refinement: Assess the results and refine the model or techniques as
needed.
6. Visualization and Reporting: Present the insights gleaned from the text analysis.
By following these steps and leveraging the seven practice areas, text analytics empowers
businesses and organizations to:
□ Understand customer sentiment: Analyze reviews, social media posts, and
surveys to gauge customer satisfaction and identify areas for improvement.
□ Gain market insights: Analyze news articles, social media trends, and
online discussions to understand market trends and competitor strategies.
□ Improve communication: Analyze communication patterns and
identify areas where communication can be more effective.
□ Categorize documents: Automate document classification based on
content for efficient information retrieval.
Spam Filtering:
Spam filtering combines techniques from text analytics and machine learning to
distinguish between legitimate emails and unwanted spam messages. Here's a breakdown
of the key concepts involved:
Kashik
Sredharan
T23 114
o Naive Bayes: A popular and efficient algorithm for text classification tasks like spam
filtering.
o Support Vector Machines (SVMs):Can be effective for handling high-dimensional
text data.
o Random Ensemble learning method that combines multiple decision trees for
Forests:
improved
accuracy.
Machine Learning Model Training:
Da ta P r e p a ra ti o n :
1. A la r g e c o rp u s o f labeled emails (spam and legitimate) is assembled.
Feature Extraction :
2. T ext analytics techniques are applied to extract features from the emails.
Model Training:
3. The chosen machine learning algorithm is trained on the labeled data,
le a rn in g t o d i s t i n g u is h spam from legitimate emails based on the extracted
Filter in g a n d E v a l u a ti o n :
features.
□ Incoming emails are processed through the trained model, and a spam
probability score is generated.
□ Emails exceeding a predefined threshold are classified as spam and
filtered or directed to a spam folder.
□ The model's performance is continuously monitored and evaluated on new data.
Periodic retraining might be necessary to maintain effectiveness as spammers adapt
their tactics.
Additional Considerations:
□ False Positives and False Negatives: No system is perfect.
Occasionally, legitimate emails might be flagged as spam (false positives) and vice
versa (false negatives). The filtering system needs to be balanced to minimize both
types of errors.
□ SRpeaaml-mtimeres Uc opndt ai nt ue os :u s l y evolve their techniques. Regular updates to the
training data and potentially the machine learning model are crucial to maintain effectiveness.
Code / Output:
Kashik
Sredharan
T23 114
dataset.shape
dataset.hea
d()
dataset.shape
Kashik
Sredharan
T23 114
dataset.hea
d()
dataset.isnull().su
m()
dataset.duplicated().su
m()
Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
dataset.hea
d()
dataset.shape
dataset.duplicated().sum()
Kashik
Sredharan
T23 114
dataset['Total_chars'] = dataset['Text'].apply(len)
dataset.head()
dataset['Total_words'] = dataset['Text'].apply(lambda x :
len(nltk.word_tokenize(x)))
dataset.head()
dataset['Total_sentences'] =
dataset['Text'].apply(lambda x : len(nltk.sent_tokenize(x)))
dataset.head()
Kashik
Sredharan
T23 114
dataset[dataset['Spam'] == 0].iloc[:,2:].describe()
Kashik
Sredharan
T23 114
dataset[dataset['Spam'] == 1].iloc[:,2:].describe()
Kashik
Sredharan
T23 114
Interna
l
Kashik
Sredharan
T23 114
Interna
l
Kashik
Sredharan
T23 114
Interna
l
Kashik
Sredharan
T23 114
Interna
l
def text_transformation(text):
# Converting to Lower
Case text = text.lower()
# Tokenization
text = nltk.word_tokenize(text)
for i in text:
if i not in stopwords.words('english') and i not in
string.punctuation:
lst.append(i)
# Stemming
text = lst[:]
lst = []
for i in text:
lst.appen
d(ps.stem(i))
Kashik
Sredharan
T23 114
Interna
datase l
t['Text'].apply(text_transformation) dataset.head()
Kashik
Sredharan
T23 114
Interna
l
spam_words = []
spam_msgs = dataset[dataset['Spam'] == 1]['Transformed_text'].tolist()
for i in
spam_msgs: for j
in i.split():
spam_words.append(j)
from collections import Counter
c1 =
pd.DataFram
e(Counter(spam_words).most_common(50)) c1.head()
Kashik
Sredharan
T23 114
Interna
l
plt.xticks(rotation = 'vertical')
ham_words = []
ham_msgs = dataset[dataset['Spam'] == 0]['Transformed_text'].tolist()
for i in ham_msgs:
for j in i.split():
ham_words.appen
d(j)
c2 =
pd.DataFram
e(Counter(spam_words).most_common(50)) c2.head()
Kashik
Sredharan
T23 114
Interna
l
cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)
mnb = MultinomialNB()
bnb = BernoulliNB()
X_cv =
cv.fit_transform(dataset['Transformed_text']).toarray()
X_cv[10]
X_cv.shape
Kashik
Sredharan
T23 114
Interna
l
is : \n',
confusion_matrix(y_test_cv, y_pred_gnb_cv ))
print('Precision Score for GNB & cv = ',
precision_score(y_test_cv , y_pred_gnb_cv ))
mnb.fit(X_train_cv
y_pred_mnb_cv = mnb.predict(X_test_cv )
print('Accuracy Score for MNB & cv = ', accuracy_score(y_test_cv ,
y_pred_mnb_cv ))
print('Confusion Matrix for MNB & cv is : \n',
confusion_matrix(y_test_cv, y_pred_mnb_cv ))
print('Precision Score for MNB & cv = ', precision_score(y_test_cv ,
y_pred_mnb_cv ))
bnb.fit(X_train_cv, y_train_cv )
y_pred_bnb_cv= bnb.predict(X_test_cv )
print('Accuracy Score for BNB & cv = ', accuracy_score(y_test_cv ,
y_pred_bnb_cv ))
print('Confusion Matrix for BNB & cv is : \n',
confusion_matrix(y_test_cv, y_pred_bnb_cv ))
print('Precision Score for BNB & cv = ',
precision_score(y_test_cv ,
y_pred_bnb_cv ))
Kashik
Sredharan
T23 114
Interna
l
X_tfidf.shape
Kashik
Sredharan
T23 114
Interna
l
bnb.fit(X_train_tfidf , y_train_tfidf )
y_pred_bnb_tfidf = bnb.predict(X_test_tfidf )
print('Accuracy Score for BNB & tfidf = ',
accuracy_score(y_test_tfidf , y_pred_bnb_tfidf ))
print('Confusion Matrix for BNB is & tfidf : \n',
confusion_matrix(y_test_tfidf , y_pred_bnb_tfidf ))
print('Precision Score for BNB & tfidf = ',
precision_score(y_test_tfidf, y_pred_bnb_tfidf ))
Conclusion: Thus we have successfully performed text analytics in the case of spam filters.
Kashik
Sredharan
T23 114
Experiment no.8
Aim : Exploring different Libraries for visualization in Python & R
Theory :
When exploring different libraries for visualization in Python and R, it's
essential to consider factors such as ease of use, flexibility, variety of
plot types, interactivity, performance, and integration with other tools.
Both Python and R offer a rich ecosystem of visualization libraries,
each with its own strengths and weaknesses.
Python Libraries:
Matplotlib:
Theory: Matplotlib is a versatile plotting library that allows you to
create static, interactive, and animated visualizations in Python. It
provides a MATLAB-like interface for creating basic plots quickly and
supports a wide range of plot types.
Code:
python
Copy
code
import matplotlib.pyplot
as plt # Example usage
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.title('Title')
plt.show()
Kashik
Sredharan
T23 114
Seaborn:
Theory: Seaborn is built on top of Matplotlib and provides a higher-
level interface for creating attractive and informative statistical
graphics. It simplifies the process of creating complex plots such as
violin plots, pair plots, and heatmaps.
Code:
python
Copy
code
import seaborn as
sns # Example
usage
sns.scatterplot(x='sepal_length', y='sepal_width', data=iris)
Kashik
Sredharan
T23 114
Plotly:
Theory: Plotly is a powerful library for creating interactive and web-
based visualizations. It supports a wide range of plot types and offers
interactivity features such as zooming, panning, and tooltips.
Code:
python
Copy
code
import plotly.express as
px # Example usage
fig = px.scatter(df, x='sepal_length', y='sepal_width',
color='species') fig.show()
Kashik
Sredharan
T23 114
Conclusion:
Both Python and R offer a variety of powerful visualization libraries,
each catering to different needs and preferences. Matplotlib and
ggplot2 are widely used for creating static plots with high
customization. Seaborn and Plotly provide higher-level interfaces with
additional features for statistical visualization and interactivity.
Ultimately, the choice of library depends on factors such as the type of
visualization needed, ease of use, and integration with other tools in
your workflow.
Kashik
Sredharan
T23 114
Experiment 9
Theory:
The iris dataset is a well-known dataset in R, containing information on
the
sepal length, sepal width, petal length, and petal width for 150 iris
flowers, divided into three species.
Code & Output:
1. Using ggplot2
Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
2. Using MASS
Kashik
Sredharan
T23 114
3. Using ggvis
4. Using lattice:
Kashik
Sredharan
T23 114
Conclusion:
It explored diverse visualization techniques on the iris dataset using R
libraries like ggplot2, MASS,
ggvis, and lattice, providing insights into the dataset's characteristics and variable
relationships.
These visualizations offer valuable tools for data exploration and analysis in data science
and machine learning.
Kashik
Sredharan
T23 114
Experiment 10
Theory:
Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
# Example: Creating a bar plot of sales by product category
sales_by_category = sales_data.groupby('product_category')
['sales_amount'].sum() sales_by_category.plot(kind='bar',
color='skyblue') plt.title('Total Sales by Product Category')
plt.xlabel('Product
Category') plt.ylabel('Total
Sales Amount')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Kashik
Sredharan
T23 114
Conclusion:
We connected to a MySQL database, fetched sales data, and visualized the total
sales by product category using Python libraries. The choice of database and
visualization techniques can greatly impact the efficiency and effectiveness of
your data analysis and presentation. It's essential to carefully consider your
project requirements and select the most suitable tools for the task at hand.
Additionally, Python provides a wide range of libraries and tools, making it a
versatile choice for data analysis and visualization tasks.
Kashik
Sredharan
T23 114
Experiment 11
Theory:
Microsoft Power BI is a unified platform that empowers users to connect to, visualise, and
analyse data from various sources. It caters to both self-service BI (Business Intelligence)
needs for individual users and enterprise-wide BI deployments for organizations.
Here's a comprehensive breakdown of Power BI and its functionalities:
Core Features:
●Data Connectivity: Power BI bridges the gap between data and insights by connecting to a
wide range of sources. This includes relational databases, cloud storage services
(like Azure Blob Storage), Excel spreadsheets, and even social media platforms.
●Data Transformation and Cleaning: Power BI provides tools for data transformation and
cleaning. Users can import, filter, transform, and enrich their data to ensure its accuracy
and relevance for analysis.
Kashik
Sredharan
T23 114
Data Visualization Techniques in Power BI:
Basic
Visualizations: ●These include bar charts, column charts, line charts, pie charts, and
scatter plots. They are ideal for conveying basic trends, comparisons, and
relationships within the data.
●Advanced Visualizations: Power BI offers more sophisticated visualizations like:
o Heatmaps: Reveal patterns and correlations between two categorical variables.
o Funnel Charts: Depict stages in a process and identify drop-off points.
o Treemaps: Visualize hierarchical data structures, showcasing how parts
contribute to the whole.
o Gauge Charts: Communicate performance against targets or KPIs (Key
Performance Indicators).
o Cards: Display key metrics in a concise and easy-to-understand format.
o Kilos (Power KPI): Combine multiple metrics into a single visual,
providing a holistic view of performance.
●Custom Visualizations: As mentioned earlier, Power BI's extensibility allows users to
leverage custom visuals for specific needs. These can include advanced
charts, network graphs, and other specialized visualizations.
Implementation:
Kashik
Sredharan
T23 114
Conclusion: Thus, we have performed visualisation and gained insights on data using PowerBI.
Kashik
Sredharan
T23 114