0% found this document useful (0 votes)
44 views104 pages

Data Analysis and Visulaization Experiment

Uploaded by

Kashik Sredharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views104 pages

Data Analysis and Visulaization Experiment

Uploaded by

Kashik Sredharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

DAV EXP

NO. 1
Aim : Getting Introduced to Data Analytics libraries in Python
& R. Function Description
1) Pandas read_csv() This function is used to retrieve data from CSV
files in the form of a dataframe.

df =

pd.read_csv('people.csv',

header=0,

usecols=["First Name", "Sex", "Email"])

# printing dataframe

print(df.head())

uses

The function is useful for importing tabular data from CSV files into a DataFrame, allowing
for further data manipulation and analysis

Pandas head() This function is used to return the top n (5


by default) values of a data frame or series.
# importing pandas
module import pandas as
pd

Kashik
Sredharan
T23 114
DAV EXP
# making data frame NO. 1

Sredharan
T23 114
DAV EXP
NO. 1
data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")

# calling head()

method # storing in

new variable

data_top

= data.head()

# display
data_top

uses
The head() function is particularly useful when you want to quickly inspect the data in a
DataFrame or when you need to test if the DataFrame has the expected structure

Pandas tail() This method is used to return the bottom n


(5 by default) rows of a data frame or series.
# importing pandas
module import pandas as
pd

# making data frame


data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
# calling tail()

method # storing

in new

variable

data_bottom =

data.tail()

# display
data_botto
m

uses
The tail() function is particularly useful when you want to quickly inspect the data in a
DataFrame or when you need to test if the DataFrame has the expected structure

Pandas sample() This method is used to generate a sample


random row or column from the data frame.

# importing pandas

package import pandas

as pd

# making data frame from csv


file data =
pd.read_csv("employees.csv")

# generating one row

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
row1 = data.sample(n = 1)

# display

row1

# generating another
row row2
=
data.sample(n = 1)

# display
row2

uses
The sample() function is particularly useful when you want to randomly select a subset of
rows or columns from a DataFrame for further analysis or testing

Pandas info() This method is used to generate the


summary of the DataFrame, this will include info
about columns with their
names, their datatypes, and missing

values.

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
# importing pandas

as pd import

pandas as pd

# Creating the
dataframe df

=
pd.read_csv("nba.csv")

# Print
the
dataframe df

USES:
The info() function is particularly useful during exploratory analysis, offering a quick and
informative overview of the dataset, and it is an essential step in the data analysis
workflow

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
2) Matplotlib library

Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a


multi- platform data visualization library built on NumPy arrays and designed to work with
the broader SciPy stack. It was introduced by John Hunter in the year 2002. One of the
greatest benefits of visualization is that it allows us visual access to huge amounts of data
in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram, etc.

Some of the sample plots that are covered :-

□ Matplotlib Line Plot

□ Matplotlib Bar Plot

□ Matplotlib Histograms Plot

□ Matplotlib Scatter Plot

□ Matplotlib Pie Charts

□ Matplotlib Area Plot

FUNCTIONS:-

1. plt.plot(x, y, label='label')

- Explanation: This function is used to create a line plot, connecting data points specified by `x`
and
`y`. The optional `label` parameter adds a label to the line for legend representation.
- Example:

import matplotlib.pyplot

as plt x = [1, 2, 3, 4, 5]

y = [2, 4, 6, 8, 10]

plt.plot(x, y, label='Linear

Function') plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.legend()

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
plt.show

() Output-

2. plt.scatter(x, y, c='color', marker='marker')

- Explanation: This function creates a scatter plot, representing individual data points
with markers. The `c` and `marker` parameters allow customization of color and
marker style.
- Example:

import matplotlib.pyplot

as plt x = [1, 2, 3, 4, 5]

y = [2, 4, 6, 8, 10]

plt.scatter(x, y, c='blue', marker='o', label='Data

Points') plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.legend

()

plt.show()

Output-

Kashik
Sredharan
T23 114
DAV EXP
NO. 1

3. plt.bar(x, height, width=width, align='center')

- Explanation: This function creates a bar chart, where `x` represents the bar positions,
and `height` specifies the bar heights. Optional parameters like `width` and `align`
customize the bar width and alignment.

- Example:

import matplotlib.pyplot

as plt categories = ['A',

'B', 'C', 'D'] values = [3, 7,

2, 5]

plt.bar(categories, values, width=0.6, align='center', color='green', label='Bar Chart')

plt.xlabel('Categories')

plt.ylabel('Values'

) plt.legend()

plt.show()

Output-

Kashik
Sredharan
T23 114
DAV EXP
NO. 1

4. plt.xlabel('label')

- Explanation: This function sets the label for the x-axis in the plot, providing context
for the data displayed.
- Example:

import matplotlib.pyplot

as plt plt.xlabel('X-axis

Label') Output-

Kashik
Sredharan
T23 114
DAV EXP
NO. 1

5. plt.title('title')

- Explanation: This function adds a title to the plot, providing an overall description or
name for the visual representation.
- Example:

import matplotlib.pyplot as

plt plt.title('Title for the

Plot') Output-

Kashik
Sredharan
T23 114
DAV EXP
NO. 1

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
3) Scipy

Library what

is SciPy?

SciPy (Scientific Python) is an open-source scientific computing module for


Python. Based on NumPy, SciPy includes tools to solve scientific problems.
Scientists created this library to address their growing needs for solving complex
issues.

Why use SciPy?

SciPy contains varieties of sub packages which help to solve the most common
issue related to Scientific Computation.
SciPy package in Python is the most used Scientific library only second to GNU
Scientific Library for C/C++ or Matlab’s.
Easy to use and understand as well as fast computational

power. It can operate on an array of NumPy library.

Scipy Functions:

1. Integra

te

Descriptio

n:

The scipy.integrate module is a part of the SciPy library and provides functions for
numerical
integration, including single, double, and triple integration, as well as solving
ordinary differential equations.

Usage:

It is used to perform numerical integration of a given function. The functions in


this module are universal functions, which means they can accept NumPy arrays
as input arguments as well as single numbers.

Syntax:

One of the functions provided by scipy.integrate is scipy.integrate.quad() to compute


a definite integral. The syntax for computing the definite integral of a function f
between the limits a and b is:
scipy.integrate.quad(f, a, b)

Kashik
Sredharan
T23 114
DAV EXP
NO. 1

Code:
import numpy as np

from scipy.integrate import

quad def f(x):

return np.exp(-x)

result, error = quad(f, 0, 1)

print(f"The integral of f(x) from 0 to 1 is {result} with an error of {error}")

Output:

2. Linear

Algebra

Description:

The scipy.linalg module is a part of the SciPy library and is used for common linear
algebra operations, such as solving linear systems, singular value decomposition,
eigenvalue problems, and matrix factorization.

Usage:
It is used when SciPy is built using the optimized ATLAS LAPACK and BLAS
libraries, providing fast linear algebra capabilities.
Syntax:

One of the functions provided by scipy.linalg is linalg.det() to find the determinant


of a matrix. The syntax for finding the determinant of a matrix a is:

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
scipy.linalg.det(

a) Code:

import numpy as np

from scipy.linalg import solve

A = np.array([[3, 2], [1, 4]])

B = np.array([[1],

[2]]) x = solve(A, B)

print("The solution for the linear

system is:") print(x)

Output:

3. Special

Descriptio

n:

The scipy.special module is a part of the SciPy library and provides a


collection of special functions, such as Bessel functions, elliptic functions,
gamma functions, hypergeometric functions, and many more.

Usage:

It is used to perform mathematical operations on the given data. The functions in


this module are universal functions, which means they can accept NumPy arrays as
input arguments as well as single numbers.

Syntax:

One of the functions provided by scipy.special is scipy.special.gamma() to


compute the gamma function. The syntax for computing the gamma function of a
number x is:

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
scipy.special.gamma(

x) Code:

import numpy as np

from scipy.special import

factorial n = 5

result = factorial(n)

print(f"the factorial of {n} is {result}")

Output:

4. Stats

Description:

The scipy.stats module is a part of the SciPy library and provides a large number of
probability distributions, summary and frequency statistics, correlation functions, and
statistical tests.
Usage:

It is used to perform various statistical operations on the given data. The functions
in this module are universal functions, which means they can accept NumPy arrays
as input arguments as well as single numbers.

Syntax:

One of the functions provided by scipy.stats is scipy.stats.describe() to


compute descriptive statistics of a given array. The syntax for computing
descriptive statistics of an array a is:
scipy.stats.describe(a)
Code:

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
import numpy as np

from scipy.stats import rankdata

my_array = np.array([3, 5, 2, 1, 9,

9]) ranks = rankdata(my_array)

print(ranks)

Output:

5. Signal

Descriptio

n:

The scipy.signal module is a part of the SciPy library and provides tools for signal
processing, such as filtering, Fourier transforms, and wavelets. It is built on top of
the SciPy library and offers a comprehensive set of functions for working with
different types of signals, including 1D and 2D arrays, as well as multi-channel
signals.
Usage:

The module is used for a wide range of signal processing tasks, including
filtering, noise reduction, spectral analysis, time-frequency analysis, and
more.
Syntax:

One of the functions provided by scipy.signal is scipy.signal.convolve() to


compute the convolution of two arrays. The syntax for computing the
convolution of arrays x and y is:
scipy.signal.convolve(x, y)
Code:

import numpy as np

from scipy.signal import correlate

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
sequence1 = np.array([1, 2, 3, 4, 5, 6])

sequence2 = np.array([4, 3, 2, 1])

correlation = correlate(sequence1, sequence2)

print(f"The cross-correlation between the two sequences is:\n{correlation}")

Output:

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
4) SKLEARN
What is Sklearn?
French research scientist David Cournapeau's scikits.learn is a Google Summer of Code
venture where the scikit-learn project first began. Its name refers to the idea that it's a
modification to SciPy called "SciKit" (SciPy Toolkit), which was independently created and
published. Later, other programmers rewrote the core codebase.
Implementation of Sklearn

Scikit-learn is mainly coded in Python and heavily utilizes the NumPy library for highly
efficient array and linear algebra computations. Some fundamental algorithms are also built
in Cython to enhance the efficiency of this library. Support vector machines, logistic
regression, and linear SVMs are performed using wrappers coded in Cython for LIBSVM and
LIBLINEAR, respectively. Expanding these routines with Python might not be viable in such
circumstances.

Scikit-learn works nicely with numerous other Python packages, including SciPy, Pandas
data frames, NumPy for array vectorization, Matplotlib, seaborn and plotly for plotting
graphs, and many more.
Functions of SKLEARN

1.Datasets

Scikit-learn comes with several inbuilt datasets such as the iris dataset, house prices
dataset, diabetes dataset, etc. The main functions of these datasets are that they are easy
to understand and you can directly implement ML models on them. These datasets are
good for beginners.

You can import the iris dataset as follows:

Python Code:

import sklearn

from sklearn import

datasets import pandas

as pd

dataset = datasets.load_iris()

df = pd.DataFrame(dataset.data,

columns=dataset.feature_names) print(df.head())

Kashik
Sredharan
T23 114
DAV EXP
NO. 1

2. Data Splitting

Sklearn provided the functionality to split the dataset for training and testing. Splitting the
dataset is essential for an unbiased evaluation of prediction performance. We can define
what proportion of our data to be included in train and test datasets.

We can split the dataset as follows:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=2, random_state=4)

With the help of train_test_split, we have split the dataset such that the train set has 80%
and the test set has 20% data.

3. Linear Regression

This supervised ML model is used when the output variable is continuous and it follows
linear relation with dependent variables. It can be used to forecast sales in the coming
months by analyzing the sales data for previous months.

With the help of sklearn, we can easily implement the Linear Regression model as follows:

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error,

r2_score regression_model = LinearRegression()

regression_model.fit(x_train, y_train)

y_predicted =

regression_model.predict(x_test) rmse =

mean_squared_error(y_test, y_predicted) r2

= r2_score(y_test, y_predicted)

4,Logistic Regression

Logistic Regression is also a supervised regression algorithm just like linear regression. The
only difference is that the output variable is categorical. It can be used to predict whether a
patient has heart disease or not.

With the help of sklearn, we can easily implement the Logistic Regression model as

follows: from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix

from sklearn.metrics import

classification_report logreg =

LogisticRegression()

logreg.fit(x_train, y_train)

y_predicted =

logreg.predict(x_test)

confusion_matrix = confusion_matrix(y_test,

y_pred) print(confusion_matrix)

print(classification_report(y_test, y_pred))

confusion matrix and classification report are used to check the accuracy of classification models.

5.Decision Trees

A Decision Tree is a powerful tool that can be used for both classification and regression
problems. It uses a tree-like model to make decisions and predict the output. It consists of
roots and nodes. Roots

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
represent the decision to split and nodes represent an output variable value. A decision
tree is an important concept.
Decision trees are useful when the dependent variables do not follow a linear relationship with the
independent variable i.e linear regression does not accurate results.
Decision tree implementation for

classification from sklearn.tree import

DecisionTreeClassifier from sklearn.metrics

import confusion_matrix from sklearn.tree

import export_graphviz

from sklearn.externals.six import

StringIO from IPython.display import

Image

from pydot import graph_from_dot_data

dt =

DecisionTreeClassifier()

dt.fit(x_train, y_train)

dot_data = StringIO()

export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)

(graph, ) =

graph_from_dot_data(dot_data.getvalue()) y_pred

= dt.predict(x_test)

We fit the model with the DecisionTreeClassifier() object and further code is used to
visualize the decision trees implementation in python.

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
5) NumPy
NumPy (Numerical Python) is a perfect tool for scientific computing
and performing basic and advanced array operations.
The library offers many handy features performing operations on n-
arrays and matrices in Python. It helps to process arrays that store
values of the same data type and makes performing math operations
on arrays (and their vectorization)
easier. In fact, the vectorization of mathematical operations on the
NumPy array type increases performance and accelerates the
execution time.

Syntax:

The syntax of NumPy functions in Python typically follows the


following structure:
numpy.function_name(array, parameters)
where "function_name" is the name of the NumPy function, "array" is
the NumPy array on which the function is to be applied, and
"parameters" are any additional parameters required by the function.
Functions:
1) NumPy Array Creation Functions
Array creation functions allow us to create new NumPy arrays.

2) NumPy Array Manipulation Functions

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
NumPy array manipulation functions allow us to modify or
rearrange NumPy arrays.
np.reshape() is a function that modifies the
arrays. Syntax: np.reshape(array1, (2,3))

3) NumPy Array Mathematical Functions


In NumPy, there are tons of mathematical functions to
perform on arrays. np.add() is a function that adds two arrays.
Syntax: np.add(array1, array2)

4) NumPy Array Statistical Functions

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
NumPy provides us with various statistical functions to perform
statistical data analysis.These statistical functions are useful to find
basic statistical concepts like mean, median, variance, etc. It is also
used to find the maximum or the minimum element in an array.
a. Mean

b. Median

Application:
In Python we have lists that serve the purpose of arrays, but they are
slow to process.NumPy aims to provide an array object that is up to 50x
faster than traditional Python lists.The array object in NumPy is called
ndarray, it provides a lot of supporting functions that make working
with ndarray very easy.Arrays are very frequently used in data science,
where speed and resources are very important.

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
R programming language
The R programming language has a rich ecosystem of libraries for data
analytics. Here are some key libraries commonly used for data analytics
in R:

dplyr: Provides a set of functions for data manipulation,


including filtering, grouping, summarizing, and arranging data.
Functions:

1. filter()
Purpose: Select rows from a data frame that meet specified
conditions. Example:

filter(df, column_name > 10)

2. mutate()
Purpose: Create new columns or modify existing ones based on specified
transformations.
Example:

mutate(df, new_column = existing_column * 2)

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
3. group_by()
Purpose: Group data by one or more variables for
subsequent operations. Example:

group_by(df, category_column)

4. summarize()
Purpose: Generate summary statistics for groups
of data. Example:

summarize(df, mean_value = mean(numeric_column))

5. arrange()
Purpose: Order rows based on one or
more variables. Example:

arrange(df, desc(numeric_column))

ggplot2: A powerful and flexible package for creating static,


interactive, and layered plots.
Functions:
1. ggplot()

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
Purpose: Initialize a ggplot object and define the data and
aesthetic mappings. Example:

ggplot(data = df, aes(x = variable1, y = variable2))

2. geom_point()
Purpose: Add points to a
plot. Example:

geom_point()

3. geom_line()
Purpose: Add lines to a plot, connecting data
points. Example:

geom_line()

4. facet_wrap()
Purpose: Create small multiples (faceted plots) based on a categorical
variable. Example:

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
facet_wrap(~category_variable)

5. labs()
Purpose: Customize axis labels, plot title, and other
annotations. Example:

labs(title = "Custom Title", x = "X-axis Label", y = "Y-axis Label")

tidyr: Focuses on data tidying tasks, helping reshape and clean


data for analysis. Functions:
1. gather()
Purpose: Reshape data from wide to long format by gathering
columns into key-
value pairs.
Example:

gather(df, key = "new_key_column", value = "new_value_column",


- excluded_column)

2. spread()
Purpose: Spread key-value pairs into separate
columns. Example:

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
spread(df, key = "existing_key_column", value = "existing_value_column")

3. separate()
Purpose: Split a single column into multiple columns based on a
delimiter. Example:

separate(df, column_to_split, into = c("new_column1", "new_column2"), sep


= "_")

4. unite()
Purpose: Combine multiple columns into a single
column. Example:

unite(df, new_column, column1, column2, sep = "_")

5. complete()
Purpose: Ensure that all combinations of a set of columns have a row,
filling in missing values.
Example:

complete(df, column1, column2)

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
readr: Offers efficient methods for reading and parsing data from
various formats. Functions:
1. read_csv()
Purpose: Read a CSV file into a data
frame. Example:

read_csv("file_path.csv")

2. read_table()
Purpose: Read a delimited text file into a
data frame. Example:

read_table("file_path.txt", delimiter = "\t")

3. read_excel()
Purpose: Read data from an Excel file into a
data frame. Example:
read_excel("file_path.xlsx", sheet = "Sheet1")

4. read_fwf()
Purpose: Read a fixed-width format file into a data frame.

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
Example:
read_fwf("file_path.txt", fwf_widths(c(10, 15, 20)))

5. read_delim()
Purpose: Read a delimited text file into a data frame, allowing customization
of delimiters.
Example:
read_delim("file_path.txt", delim = ";")

caret: Provides a unified interface for various machine learning


algorithms and facilitates model training and evaluation.
Functions:
1. train()
Purpose: Train a machine learning model using specified algorithms
and parameters.
Example:
train(response_variable ~ ., data = df, method = "lm")

2. predict()
Purpose: Generate predictions from a trained machine learning
model. Example:
predict(model, new_data)

Kashik
Sredharan
T23 114
DAV EXP
NO. 1
3. confusionMatrix()
Purpose: Compute a confusion matrix for model
evaluation. Example:
confusionMatrix(predicted_values, actual_values)

4. featurePlot()
Purpose: Create feature plots to visualize the distribution of features by
class. Example:
featurePlot(x = df[, predictors], y = df$target_variable, plot = "box")

5. caretList()
Purpose: Train multiple models with different algorithms and
parameters. Example:
caretList(response_variable ~ ., data = df, methodList = c("lm", "rf", "svm"))

Kashik
Sredharan
T23 114
EXP 2: Implement Simple Linear Regression

Aim: To Implement Simple Lnear Regression on a given data set

Theo rWy h: at is Linear Regression?


Linear regression is a type of supervised machine learning algorithm that
computes the linear relationship between a dependent variable and one or more
independent features. When the number of the independent feature, is 1 then it is
known as Univariate Linear regression, and in the case of more than one feature,
it is known as multivariate linear regression.

Types of Linear Regression

There are two main types of linear regression:

Simple Linear Regression

This is the simplest form of linear regression, and it involves only one independent
variable and one dependent variable. The equation for simple linear regression is:
y=\beta_{0}+\beta_{1}X

where:

# Y is the dependent

variable # X is the

independent variable # β0

is the intercept

# β1 is the slope

Multiple Linear Regression

This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:
y=\beta_{0}+\beta_{1}X+\beta_{2}X+....\beta_{n}X

where:

Kashik
Sredharan
T23 114
# Y is the dependent variable

# X1, X2, …, Xp are the


independent variables # β0 is the
intercept

# β1, β2, …, βn are the slopes

Assumptions of Simple Linear Regression

Linear regression is a powerful tool for understanding and predicting the behavior
of a variable, however, it needs to meet a few conditions in order to be accurate
and dependable solutions.
1) Linearity: The independent and dependent variables have a linear relationship with
one
another. This implies that changes in the dependent variable follow those in the
independent
variable(s) in a linear fashion. This means that there should be a straight line
that can be drawn
through the data points. If the relationship is not linear, then linear regression will
not be an accurate model.

Kashik
Sredharan
T23 114
2) Independence:
The observations in the dataset are independent of each other. This means that the
value of the dependent variable for one observation does not depend on the value of
the dependent variable for another observation. If the observations are not
independent, then linear regression will not be an accurate model.
3) Homoscedasticity:
Across all levels of the independent variable(s), the variance of the errors is
constant. This indicates that the amount of the independent variable(s) has no
impact on the variance of the errors. If the variance of the residuals is not
constant, then linear regression will not be an accurate model.

Kashik
Sredharan
T23 114
Code:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.linear_model import
LinearRegression from sklearn.metrics import
mean_squared_error import matplotlib.pyplot as
plt

file_path =
'sample_data/california_housing_test.csv' df =
pd.read_csv(file_path)
df_sampled = df.sample(n=20,
random_state=42) X =
df_sampled[['total_rooms',
'median_income']] y =
df_sampled['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42) model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
plt.scatter(X_test['median_income'], y_test, color='black') # Adjust the x-axis for
visualization if needed
plt.plot(X_test['median_income'], y_pred, color='blue',
linewidth=3) plt.title('Linear Regression')
plt.xlabel('Median Income')
plt.ylabel('Median House
Value') plt.show()

Output:

Kashik
Sredharan
T23 114
Conclusion: Thus Successfully Implemented Simple Linear Regression On a Given
Data Set

Kashik
Sredharan
T23 114
EXP 3 : Implement Multiple Linear Regression

Aim : To Implement Multiple Linear Regression on a given data set

Theo rEyx:pla in Multiple Linear Regression in detail


Multiple linear regression is a statistical method used to model the relationship
between two or more independent variables (predictors) and a dependent variable
(response) by fitting a linear equation to observed data. It extends simple linear
regression, which deals with only one independent variable and one dependent
variable, to scenarios where multiple predictors are involved.
In multiple linear regression, the model equation can be

represented as: [ Y = beta_0 + beta_1X_1 + beta_2X_2 + ... +

beta_nX_n + varepsilon] Where:

- ( Y ) is the dependent variable.

- ( X_1, X_2, ..., X_n ) are the independent variables.

- ( beta_0 ) is the intercept (constant term).

- ( beta_1, beta_2, ..., beta_n ) are the coefficients of the independent


variables, representing the change in ( Y ) for a one-unit change in the
corresponding ( X ), holding other predictors constant.
- ( varepsilon ) is the error term, representing the difference between the
observed and predicted values of ( Y ).
The goal in multiple linear regression is to estimate the coefficients ( ( beta )
values) that minimize the sum of squared differences between the observed and
predicted values of the dependent variable.

Assumptions:

1. Linearity: The relationship between independent and dependent variables is linear.

2. Independence: Observations are independent of each other.

3. Homoscedasticity: The variance of errors is constant across all levels of the


independent

Kashik
Sredharan
T23 114
variables.

4. Normality: Residuals (errors) are normally distributed.

5. No multicollinearity: Independent variables are not highly correlated with each other.

Application of Multiple Linear Regression :

1. Economics : Predicting factors affecting GDP growth, inflation, etc.

2. Finance : Analyzing factors influencing stock prices, interest rates, etc.

3. Marketing : Predicting sales based on advertising expenditure, pricing strategies, etc.

4. Medicine : Predicting patient outcomes based on various medical parameters.

5. Social Sciences : Studying factors influencing human behavior, education outcomes,


etc.

Steps in Multiple Linear Regression Analysis:

1. Data Collection: Gather data on the variables of interest.

2. Data Preprocessing: Handle missing values, outliers, and transform variables

if necessary. 3.Model Specification: Choose appropriate independent variables

and their functional form. 4. Parameter Estimation: Use methods like Ordinary

Least Squares (OLS) to estimate the


coefficients.
5. Model Evaluation: Assess the goodness of fit using measures like R-squared, adjusted

R-
squared, residual plots, etc.
6. Inference and Interpretation: Interpret the estimated coefficients and their

significance. C7 .oPnr sei d ec rt ai ot ni o:nUs :s e the model to make predictions on

new data.

-Overfitting: Including irrelevant variables or too many predictors can lead to overfitting.

-Underfitting: Having too few predictors or oversimplifying the model can lead to
underfitting.

-Model Assumption: Assumptions such as linearity and normality should be checked.

Kashik
Sredharan
T23 114
-Collinearity: High multicollinearity among predictors can lead to unstable
estimates of coefficients.

Multiple linear regression is a versatile tool widely used in various fields for
understanding relationships between variables and making predictions. However, careful
attention to model assumptions and data quality is crucial for its successful application.

Differentiate Simple Linear Regression & Multiple Linear Regression.

Parameter Linear (Simple) Regression Multiple Regression


Models the relationship
Definition Models the relationship between one
between one dependent and
dependent and one independent two or more independent
variable. variables.
Y = C0 + C1X +e Y = C0 + C1X1 + C2X2 + C3X3
Equation
+ …..
+ CnXn + e
Complexity Simpler dealing with one relationship. More complex due to multiple
relationships.
Use Cases Suitable when there is one clear predictor. Suitable when multiple factors
Linearity, Independence, affect the outcome.
Homoscedasticity,
Same as linear regression,
Assumptions Normality
with the added concern of
multicollinearity.

Kashik
Sredharan
T23 114
Requires 3D or multi-
Visualization Typically visualized with a 2D scatter
dimensional space, often
plot and a line of best fit.
represented using partial
regression plots.
Higher, especially if too many
Risk of Overfitting Lower, as it deals with only one predictor. predictors are used without
adequate data.
A primary concern; having
Multicollinearity correlated
Concern Not applicable, as there’s only one
predictors can affect the model’s
predictor.
accuracy and interpretation.
Complex research, multifactorial
Applications Basic research, simple predictions,
understanding a singular relationship. predictions, studying
interrelated systems.

Code :
import numpy as np
import matplotlib as
mpl
from mpl_toolkits.mplot3d import
Axes3D import matplotlib.pyplot as plt

def generate_dataset(n):
x = []
y = []
random_x1 =
np.random.rand()
random_x2 =
np.random.rand() for i in
range(n):
x1 = i

x2 = i/2 +
np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)

Kashik
Sredharan
T23 114
return np.array(x), np.array(y)

Kashik
Sredharan
T23 114
x, y = generate_dataset(200)

mpl.rcParams['legend.fontsize'] = 12

fig = plt.figure()
ax = fig.add_subplot(projection ='3d')

ax.scatter(x[:, 1], x[:, 2], y, label ='y', s


= 5) ax.legend()
ax.view_init(45, 0)

plt.show()
Output :

Conclusion: Thus Successfully Implemented Multiple Linear Regression on a Given


Data Set.

Kashik
Sredharan
T23 114
Experimen
t4

AIM: Implement Logistic Regression in Python

THEORY:
Logistic regression is the appropriate regression analysis to conduct when the
dependent variable is dichotomous (binary). Like all regression analyses, logistic
regression is a predictive analysis.
It is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio-level independent
variables.
Types of Logistic

regression Binary
Binary logistic regression is used to predict the probability of a binary outcome, such
as yes or lnoog, itsrtui ce roer gf ar el sses, i oo rn 0 or 1. For example, it could be used to
predict whether a customer will churn or not, whether a patient has a disease or not, or
whether a loan will be repaid or not.
Multinomial logistic regression
Multinomial logistic regression is used to predict the probability of one of three or
more possible outcomes, such as the type of product a customer will buy, the rating
a customer will give a product, or the political party a person will vote for.

Ordinal logistic regression


is used to predict the probability of an outcome that falls into a predetermined
order, such as the level of customer satisfaction, the severity of a disease, or the
stage of cancer.

Why do we use Logistic Regression rather than Linear Regression?


After reading the definition of logistic regression we now know that it is only used
when our dependent variable is binary and in linear regression this dependent
variable is continuous.
The second problem is that if we add an outlier in our dataset, the best fit line in linear
regression shifts to fit that point.
Now, if we use linear regression to find the best fit line which aims at minimizing
the distance between the predicted value and actual value, the line will be like
this:

Kashik
Sredharan
T23 114
Experimen
t4

Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5
then we predict malignant tumor (1) and if it is less than 0.5 then we predict benign
tumor (0). Everything seems okay here but now let’s change it a bit, we add some
outliers in our dataset, now this best fit line will shift to that point. Hence the line will
be somewhat like this:

The blue line represents the old threshold and the yellow line represents the new
threshold which is maybe 0.2 here. To keep our predictions right we had to lower our
threshold value. Hence we can say that linear regression is prone to outliers. Now
here if h(x) is greater than 0.2 then only this regression will give correct outputs.
Another problem with linear regression is that the predicted values may be out of
range. We know that probability can be between 0 and 1, but if we use linear
regression this probability may exceed 1 or go below 0.
To overcome these problems we use Logistic Regression, which converts this
straight best fit line in linear regression to an S-curve using the sigmoid function,
which will always give values between 0 and 1.

How does Logistic Regression work?


Logistic regression works in the following steps:
1. Prepare the data: The data should be in a format where each row
represents a single observation and each column represents a different
variable. The target variable (the variable you want to predict) should be
binary (yes/no, true/false, 0/1).

Kashik
Sredharan
T23 114
Experimen
t4
2. Train the model: We teach the model by showing it the training data. This
involves finding the values of the model parameters that minimize the error
in the training data.
3. Evaluate the model: The model is evaluated on the held-out test data to
assess its performance on unseen data.
4. Use the model to make predictions: After the model has been trained and
assessed, it can be used to forecast outcomes on new data.

CODE:

import numpy
from sklearn import linear_model
#Reshaped for Logistic function.
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69,
5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr =
linear_model.LogisticRegression()
logr.fit(X,y)
log_odds = logr.coef_
odds = numpy.exp(log_odds)
#predict if tumor is cancerous where the size is
3.46mm: predicted =
logr.predict(numpy.array([3.46]).reshape(-1,1))
print("Predicted: ")
print(predicted)
print("\nOdds:
") print(odds)
print("\n")
def logit2prob(logr, X):
log_odds = logr.coef_ * X +
logr.intercept_ odds =
numpy.exp(log_odds)
probability = odds / (1 +
odds) return(probability)
print("Probability that each tumor is cancerous is \n")
print(logit2prob(logr, X))

OUTPUT:

Kashik
Sredharan
T23 114
Experimen
t4

CONCLUSION: Thus we have successfully implemented logistic regression in python.

Kashik
Sredharan
T23 114
Experiment 5

Aim: To implement time series analysis in Python


Theory:
Time-series graphs are a fundamental tool for visualizing data collected at regular intervals
over time. They play a vital role in various fields, from finance and economics to science
and engineering.
Here's a detailed breakdown of time-series graphs:

Core Components:
● Axes:
X-axis (Time
o Represents the chronological order of the data points. The scale
Axis):
on the x-axis can vary depending on the data, ranging from seconds to decades.
Y-axis (Measurement Axis): o Represents the values being measured. The scale on the
y-axis depends on the specific variable being plotted.
Da t a P o in t s:
● E a c h d a t a p o i n t signifies a specific measurement at a corresponding
time i n s ta n c e . T he s e p o in t s a re ty p ically plotted using
C o n n e c t in g L i ne s ( O p t io na l) :
circles, squares, or other markers.
●Lines are often used to connect consecutive data
points, hTiygphelisghotfinTgimthee-
Sterernieds oGr reavpohlus:tion of the variable over
time.
● Line Graph:
The most common type, it uses lines to connect data points, ideal for visualizing
t●rSecnadtstearndPlsoeta: sonality.

●UHseisfutol gwrhaemn: the order of data points is not chronologically significant.


Can be used for time-series data to show the distribution of values at different
tAimuteocinotrerrevlaalsti. on Function (ACF):

linear correlation
The ACF measures the between a time series and its lagged versions. In simpler
terms, it estimates how much a value in the series is related to values at previous time
steps (lags). The ACF is calculated for various lags, and the resulting plot depicts the
correlation coefficient at each lag.
Key characteristics of ACF plot:
●DeAcnayAinCgF pe ax thti eb ri tni n: g a decaying pattern suggests that the
influence of past values on current values diminishes over time.
Peaks at specific
●Significant spikes at specific lags indicate a potential relationship
lags:
between the current value and the value at that particular lag.
Partial Autocorrelation Function (PACF):

Kashik
Sredharan
T23 114
While the ACF captures the overall correlation, the PACF goes a step further. It measures
thepartial
Page 1 of 3

Kashik
Sredharan
T23 114
autocorrelation between a time series and its lagged versionsc,ontrolling for the influence
of

Kashik
Sredharan
T23 114
intervening lags . In essence, the PACF isolates the unique effect of a specific lag on the
current value, independent of the correlations at shorter lags.
Key characteristic of PACF plot:
Significant spikes at few lag●s: A PACF with significant spikes only at a few
lags suggests an autoregressive (AR) model might be suitable for forecasting. The
lags with spikes indicate the order of the AR model.

Functionality and Applications:


Identifying Trends: ● Time-series graphs effectively reveal trends in data, such as growth,
decline, seasonality, or cyclical patterns.
Identifying Anomalies●: Deviations from the expected trend can be easily spotted,

prompting
further investigation into potential causes.
●Forecasting: By analyzing historical patterns, time-series graphs can be
used to make predictions about future values.
●Comparison: Multiple time series can be plotted on the same graph to compare trends

across
different variables or categories.
Additional Considerations:
Time
Scale: ●The chosen time scale on the x-axis significantly impacts the interpretation of
the data. Ensure the scale provides an adequate view of the relevant time period.
Data Aggregation: ● Depending on the frequency of data collection, data points might be
aggregated (e.g., daily averages) for a clearer visualization on the graph.
Missing ●Strategies exist to handle missing data points, such as leaving gaps or using
Data:
in t e r po l a ti o n t e c h n iq u e s.
f r o/ m p a n d a s i m p o r t re a d_csv
Code
series = read_csv('daily-min-temperatures.csv', header=0, index_col=0, parse_dates=True)
Output:
series.squeeze('columns')
print(series.head())

Kashik
Sredharan
T23 114
series.plot()
plt.show()

Conclusion: Thus, we have successfully implemented time series analysis in

Kashik
Sredharan
T23 114
Experiment 6

Aim: To implement ARIMA model in Python


Theory:
The ARIMA (Autoregressive Integrated Moving Average) model is a powerful statistical
tool used for forecasting and understanding time series data. It leverages past
observations of a series to predict future values and analyze its underlying structure.
Model Components:
ARIMA is represented by the notation ARIMA(p, d, q), where each term signifies a
specific component:
●AR (Autoregressive): This term captures the dependence of the current
value on p past values of the series. Essentially, it models how much the past
values (lags) influence the current value.
●I (Integrated): This component deals with non-stationarity in the data.
Differencing is applied d times to make the series stationary, meaning its
statistical properties (mean, variance) remain constant over time.
●MA (Moving Average): This term incorporates the random noise or
errors from past forecasts into the model. It considers the influence of the q past
forecast errors on the current value.

Understanding the Parameters:


p● (TAhRe onrudmerb)e: r of past values influencing the current value. Choosing the
optimal p involves balancing model complexity and accuracy.
d (Differencin●gTohred ne ur)m: ber of times the data needs to be
differenced to achieve stationarity. Statistical tests help determine the
appropriate d.
q● (TMhAe nourdmebre):r of past forecast errors included in the
model. Similar to p, choosing q involves finding a balance between
complexity and accuracy.
The ARIMA Process:
Identify Stationarity:
1. Analyze the data for stationarity. Differencing might be necessary if
non-stationary.
Model Selection:
2. Use tools like ACF (Autocorrelation Function) and PACF
(Partial AMuotdoec lo rFrietltai nt i og :n Function) to identify the appropriate values for
p and q.
3. Estimate the model parameters using statistical methods like
maximum likelihEovoadlueasttii omnataionnd. Diagnostics:

Forecastin 4. Evaluate the model's performance using metrics like mean


g:
squared error (MSE) and assess for residual errors.
5. Once satisfied with the model, use it to predict future values of the time series.

Kashik
Sredharan
T23 114
Applications of ARIMA Models:
ARIMA models find application in various domains due to their effectiveness in
fBourseicnaesstsin:g: ● Sales forecasting, inventory management, demand prediction.
F●inSatoncckep: rice prediction, market trend analysis, risk assessment.
E●cMo na cormo ei ccso:n o m i c forecasting (GDP, inflation), resource management.
S●oPcoiapluSlactiieonncfeosr:ecasting, resource allocation, trend analysis.
Advantages of ARIMA:
I●ntTehreprmeotadbelilcitoym: ponents (AR, I, MA) are
statistically well-understood, aiding in interpreting the results.
F●leAxRiIbMilAityca: n be adapted to various time series data by adjusting its parameters.
E● f Ff eocr tsi vt aetni oensas r: y time series, ARIMA can generate accurate forecasts.
Limitations of ARIMA:
Stationarity Assumption:
● Relies heavily on the assumption of stationarity in the data.
●Limited Model May not capture complex non-linear relationships in the data.
Complexity:
●Data-Driven
Reliant on historical data, potentially leading to poor forecasts for
Nature:
significant changes.
Code / Output:

def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo.csv', header=0, parse_dates=[0],
index_col=0, date_parser=parser)

Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit()

print(model_fit.summary())

residuals = DataFrame(model_fit.resid)
residuals.plot()

Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
from sklearn.metrics import
mean_squared_error from math import sqrt

# split into train and test


sets X = series.values
size = int(len(X) * 0.66)
train, test = X[0:size],
X[size:len(X)] history = [x for x
in train] predictions = list()
# walk-forward
validation for t in
range(len(test)):
model = ARIMA(history,
order=(5,1,0)) model_fit =
model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.appen
d(yhat) obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat,
obs)) # evaluate forecasts
Kashik
Sredharan
T23 114
rmse = sqrt(mean_squared_error(test, predictions))

Kashik
Sredharan
T23 114
print('Test RMSE: %.3f' %
rmse) # plot forecasts
against

Conclusion: Thus, we have successfully implemented ARIMA model

Kashik
Sredharan
T23 114
Experiment 7

Aim: Implement text analytics: Spam filter in Python


Theory:
Data analytics deals with analyzing vast amounts of data to extract valuable
insights and inform decision-making. Text analytics, on the other hand, specifically
focuses on
tuenxstturaulcdtuartead.
It's a subfield of data analytics that employs a combination of linguistic techniques,
statistical
methods,
and machine learning algorithms to unlock the hidden meaning within textual
data. Here's a breakdown of text analytics in relation to data analytics:
□ Data Source: Text analytics works with unstructured textual data, which is
the vast majority of data generated today. This can include emails, social
media posts, customer reviews,
documents, and more. In contrast, traditional data analytics often deals with
structured data stored in databases, spreadsheets, or other formats with a
predefined organization.
□ Techniques: Text analytics leverages techniques from natural language
processing (NLP) to understand the nuances of human language. This includes
tasks like sentiment analysis, topic modeling, and entity recognition. Traditional
data analytics often relies on statistical methods
and data visualization techniques.
Seven Practice Areas of Text Analytics:
Text Preprocessin1.g:This stage involves cleaning the text data by removing
irrelevant characters, fixing typos, and converting text to lowercase.
2T .o Bk er enai kz iantgi odno: w n the text into smaller units like words or phrases.
Normalization:
consistency. 3. Stemming or lemmatization reduces words to their root form,
ensuring
Stop Word
Removal:
4. Eliminating common words like "the" and "a" that don't
contribute

meaning. much
5. Text Feature Engineering: Creating numerical features from the text data to facilitate

analysis.
This might involve techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
6. Text Analysis Techniques: Depending on the objective, various techniques
are employed. Here are a few examples:
o Sentiment Analysis: Classifying the polarity of text (positive, negative, neutral)
to understand opinions or emotions.
o Topic Modeling: Identifying underlying themes or topics discussed within a
corpus of text documents.
o Entity Recognition: Identifying and classifying named entities like
people, organizations, or locations.
7. Visualization and Reporting: Presenting the results of the text analysis in a clear and
Kashik
Sredharan
T23 114
concise

manner using charts, graphs, or reports.


Steps Involved in Text Analytics:

Kashik
Sredharan
T23 114
1. Define the Objective:What insights do you want to extract from the text data?
Page 1 of 19

Kashik
Sredharan
T23 114
2. Data Collection and Preprocessing: Gather the relevant textual data and clean it for
analysis.
3. Feature Engineering: Create features suitable for the chosen text analysis technique.
4. Model Training and Analysis: Depending on the objective, train a model or apply
the chosen text analysis technique.
5. Evaluation and Refinement: Assess the results and refine the model or techniques as
needed.
6. Visualization and Reporting: Present the insights gleaned from the text analysis.
By following these steps and leveraging the seven practice areas, text analytics empowers
businesses and organizations to:
□ Understand customer sentiment: Analyze reviews, social media posts, and
surveys to gauge customer satisfaction and identify areas for improvement.
□ Gain market insights: Analyze news articles, social media trends, and
online discussions to understand market trends and competitor strategies.
□ Improve communication: Analyze communication patterns and
identify areas where communication can be more effective.
□ Categorize documents: Automate document classification based on
content for efficient information retrieval.
Spam Filtering:
Spam filtering combines techniques from text analytics and machine learning to
distinguish between legitimate emails and unwanted spam messages. Here's a breakdown
of the key concepts involved:

Text Analytics Techniques:


□P r Ienpcroomc ei nsgs i en mg :a i l s are cleaned by removing irrelevant characters,
HTML tags, and converting text to lowercase.
Tokenization: □ The email content (subject line, body) is broken down into words or
phrases (tokens) for analysis.
Fe a t u r e E n g i ne e r in g :
□ F e a t u re s a r e e x tr ac t ed from the tokens
to represent the email content n u m er ic a ll y .
T e rm F r e q u
C o m m on f eatures include:
e n c y (T F) :
o How often a word appears in the email.
Inverse Document Frequency (IDF):
o How rare a word is across a vast email corpus.
Words like "the" have a low IDF.
Presence of Blacklist Words:
o Certain words or phrases commonly found in spam are
iPdaerntt-iofife-dS.peech Tags:
o Identifying the grammatical function of words
(nouns, verbs) cMaanchpirnoevidLeeacrluneins.g
Algorithms:
Supervised Learning:
□ Spam filtering heavily relies on supervised machine learning
algorithms. These algorithms are trained on a labeled dataset where emails are
already
Kashik
Sredharan
T23 114
categorized as spam or legitimate. Common algorithms include:

Kashik
Sredharan
T23 114
o Naive Bayes: A popular and efficient algorithm for text classification tasks like spam
filtering.
o Support Vector Machines (SVMs):Can be effective for handling high-dimensional
text data.
o Random Ensemble learning method that combines multiple decision trees for
Forests:
improved
accuracy.
Machine Learning Model Training:
Da ta P r e p a ra ti o n :
1. A la r g e c o rp u s o f labeled emails (spam and legitimate) is assembled.
Feature Extraction :
2. T ext analytics techniques are applied to extract features from the emails.
Model Training:
3. The chosen machine learning algorithm is trained on the labeled data,
le a rn in g t o d i s t i n g u is h spam from legitimate emails based on the extracted
Filter in g a n d E v a l u a ti o n :
features.

□ Incoming emails are processed through the trained model, and a spam
probability score is generated.
□ Emails exceeding a predefined threshold are classified as spam and
filtered or directed to a spam folder.
□ The model's performance is continuously monitored and evaluated on new data.
Periodic retraining might be necessary to maintain effectiveness as spammers adapt
their tactics.
Additional Considerations:
□ False Positives and False Negatives: No system is perfect.
Occasionally, legitimate emails might be flagged as spam (false positives) and vice
versa (false negatives). The filtering system needs to be balanced to minimize both
types of errors.
□ SRpeaaml-mtimeres Uc opndt ai nt ue os :u s l y evolve their techniques. Regular updates to the
training data and potentially the machine learning model are crucial to maintain effectiveness.
Code / Output:

Kashik
Sredharan
T23 114
dataset.shape

dataset.hea
d()

dataset.shape

Kashik
Sredharan
T23 114
dataset.hea
d()

dataset.isnull().su
m()

dataset.duplicated().su
m()

Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
dataset.hea
d()

dataset.shape

dataset.duplicated().sum()

plt.pie(dataset['Spam'].value_counts() , labels = labels, colors = colors ,


autopct = '%0.2f')
plt.show()

Kashik
Sredharan
T23 114
dataset['Total_chars'] = dataset['Text'].apply(len)
dataset.head()

dataset['Total_words'] = dataset['Text'].apply(lambda x :
len(nltk.word_tokenize(x)))
dataset.head()

dataset['Total_sentences'] =
dataset['Text'].apply(lambda x : len(nltk.sent_tokenize(x)))
dataset.head()

Kashik
Sredharan
T23 114
dataset[dataset['Spam'] == 0].iloc[:,2:].describe()

Kashik
Sredharan
T23 114
dataset[dataset['Spam'] == 1].iloc[:,2:].describe()

Kashik
Sredharan
T23 114
Interna
l

Kashik
Sredharan
T23 114
Interna
l

sns.pairplot(dataset , hue = 'Spam')

Kashik
Sredharan
T23 114
Interna
l

sns.heatmap(dataset.corr(), annot = True)

Kashik
Sredharan
T23 114
Interna
l

from nltk.stem.porter import


PorterStemmer ps = PorterStemmer()
# Function to do all the following transformation to the text
to make it compatible for the model
# Lower
case #
Tokenization
# Removing special characters
# Removing stop words and
punctuation # Stemming

def text_transformation(text):

# Converting to Lower
Case text = text.lower()

# Tokenization
text = nltk.word_tokenize(text)

# Removing Special Characters lst


= [] for i in text:
if i.isalnum():
lst.append(i)

# Removing stop words and punctuation text


= lst[:]
lst = []

for i in text:
if i not in stopwords.words('english') and i not in
string.punctuation:
lst.append(i)

# Stemming
text = lst[:]
lst = []

for i in text:
lst.appen
d(ps.stem(i))

return ' '.join(lst)


# Apply the Transformation
Function
dataset['Transformed_text'] =

Kashik
Sredharan
T23 114
Interna
datase l
t['Text'].apply(text_transformation) dataset.head()

Kashik
Sredharan
T23 114
Interna
l

spam_words = []
spam_msgs = dataset[dataset['Spam'] == 1]['Transformed_text'].tolist()
for i in
spam_msgs: for j
in i.split():
spam_words.append(j)
from collections import Counter
c1 =
pd.DataFram
e(Counter(spam_words).most_common(50)) c1.head()

Kashik
Sredharan
T23 114
Interna
l

plt.xticks(rotation = 'vertical')

ham_words = []
ham_msgs = dataset[dataset['Spam'] == 0]['Transformed_text'].tolist()
for i in ham_msgs:
for j in i.split():
ham_words.appen
d(j)

c2 =
pd.DataFram
e(Counter(spam_words).most_common(50)) c2.head()

Kashik
Sredharan
T23 114
Interna
l

sns.lineplot(x = c2[0] , y = c2[1], data = c2, color = 'Blue')


plt.xticks(rotation = 'vertical')

cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)

from sklearn.metrics import accuracy_score ,


confusion_matrix , precision_score

mnb = MultinomialNB()
bnb = BernoulliNB()
X_cv =
cv.fit_transform(dataset['Transformed_text']).toarray()
X_cv[10]

X_cv.shape

Kashik
Sredharan
T23 114
Interna
l

y_cv = dataset['Spam'].values y_cv[:10]

X_train_cv , X_test_cv , y_train_cv , y_test_cv =


train_test_split(X_cv, y_cv , test_size = 0.2, random_state =
2) gnb.fit(X_train_cv , y_train_cv)
y_pred_gnb_cv = gnb.predict(X_test_cv)
print('Accuracy Score for GNB & cv = ', accuracy_score(y_test_cv
,

is : \n',
confusion_matrix(y_test_cv, y_pred_gnb_cv ))
print('Precision Score for GNB & cv = ',
precision_score(y_test_cv , y_pred_gnb_cv ))

mnb.fit(X_train_cv
y_pred_mnb_cv = mnb.predict(X_test_cv )
print('Accuracy Score for MNB & cv = ', accuracy_score(y_test_cv ,
y_pred_mnb_cv ))
print('Confusion Matrix for MNB & cv is : \n',
confusion_matrix(y_test_cv, y_pred_mnb_cv ))
print('Precision Score for MNB & cv = ', precision_score(y_test_cv ,
y_pred_mnb_cv ))

bnb.fit(X_train_cv, y_train_cv )
y_pred_bnb_cv= bnb.predict(X_test_cv )
print('Accuracy Score for BNB & cv = ', accuracy_score(y_test_cv ,
y_pred_bnb_cv ))
print('Confusion Matrix for BNB & cv is : \n',
confusion_matrix(y_test_cv, y_pred_bnb_cv ))
print('Precision Score for BNB & cv = ',
precision_score(y_test_cv ,
y_pred_bnb_cv ))

Kashik
Sredharan
T23 114
Interna
l

X_tfidf.shape

Kashik
Sredharan
T23 114
Interna
l

bnb.fit(X_train_tfidf , y_train_tfidf )
y_pred_bnb_tfidf = bnb.predict(X_test_tfidf )
print('Accuracy Score for BNB & tfidf = ',
accuracy_score(y_test_tfidf , y_pred_bnb_tfidf ))
print('Confusion Matrix for BNB is & tfidf : \n',
confusion_matrix(y_test_tfidf , y_pred_bnb_tfidf ))
print('Precision Score for BNB & tfidf = ',
precision_score(y_test_tfidf, y_pred_bnb_tfidf ))

Conclusion: Thus we have successfully performed text analytics in the case of spam filters.

Kashik
Sredharan
T23 114
Experiment no.8
Aim : Exploring different Libraries for visualization in Python & R
Theory :
When exploring different libraries for visualization in Python and R, it's
essential to consider factors such as ease of use, flexibility, variety of
plot types, interactivity, performance, and integration with other tools.
Both Python and R offer a rich ecosystem of visualization libraries,
each with its own strengths and weaknesses.
Python Libraries:
Matplotlib:
Theory: Matplotlib is a versatile plotting library that allows you to
create static, interactive, and animated visualizations in Python. It
provides a MATLAB-like interface for creating basic plots quickly and
supports a wide range of plot types.
Code:
python
Copy
code
import matplotlib.pyplot
as plt # Example usage
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.title('Title')
plt.show()

Kashik
Sredharan
T23 114
Seaborn:
Theory: Seaborn is built on top of Matplotlib and provides a higher-
level interface for creating attractive and informative statistical
graphics. It simplifies the process of creating complex plots such as
violin plots, pair plots, and heatmaps.
Code:
python
Copy
code
import seaborn as
sns # Example
usage
sns.scatterplot(x='sepal_length', y='sepal_width', data=iris)

Kashik
Sredharan
T23 114
Plotly:
Theory: Plotly is a powerful library for creating interactive and web-
based visualizations. It supports a wide range of plot types and offers
interactivity features such as zooming, panning, and tooltips.
Code:
python
Copy
code
import plotly.express as
px # Example usage
fig = px.scatter(df, x='sepal_length', y='sepal_width',
color='species') fig.show()

Kashik
Sredharan
T23 114
Conclusion:
Both Python and R offer a variety of powerful visualization libraries,
each catering to different needs and preferences. Matplotlib and
ggplot2 are widely used for creating static plots with high
customization. Seaborn and Plotly provide higher-level interfaces with
additional features for statistical visualization and interactivity.
Ultimately, the choice of library depends on factors such as the type of
visualization needed, ease of use, and integration with other tools in
your workflow.

Kashik
Sredharan
T23 114
Experiment 9

Aim: Selecting appropriate database and applying various


visualization techniques on it using R libraries

Theory:
The iris dataset is a well-known dataset in R, containing information on
the
sepal length, sepal width, petal length, and petal width for 150 iris
flowers, divided into three species.
Code & Output:
1. Using ggplot2

Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
2. Using MASS

Kashik
Sredharan
T23 114
3. Using ggvis

4. Using lattice:

Kashik
Sredharan
T23 114
Conclusion:
It explored diverse visualization techniques on the iris dataset using R
libraries like ggplot2, MASS,
ggvis, and lattice, providing insights into the dataset's characteristics and variable
relationships.
These visualizations offer valuable tools for data exploration and analysis in data science
and machine learning.

Kashik
Sredharan
T23 114
Experiment 10

Aim: Selecting appropriate database and applying various visualisation techniques on it


using Python libraries

Theory:

Selecting the appropriate database for your project depends on various


factors such as the size of data, data structure, scalability, and the specific
requirements of your application. Commonly used databases include
relational databases like MySQL,
PostgreSQL, and SQLite, as well as NoSQL databases like MongoDB,
Cassandra, and Redis.

When it comes to visualization in Python, libraries like Matplotlib, Seaborn,


Plotly,
and Bokeh are widely used. Each library offers different features and
capabilities for creating visualizations, ranging from simple plots to interactive
dashboards.
Code & Output:
# Import necessary
libraries import pandas
as pd
import matplotlib.pyplot
as plt import
mysql.connector
# Connect to MySQL database
db_connection = mysql.connector.connect(
host="localhost",
user="username",
password="password"
, database="sales_db"
)
# Fetch data from the database
query = "SELECT * FROM sales_data"
sales_data = pd.read_sql(query, con=db_connection)
# Close the database connection
db_connection.close()
Kashik
Sredharan
T23 114
# Explore the data
print(sales_data.head
()) # Visualize data

Kashik
Sredharan
T23 114
Kashik
Sredharan
T23 114
# Example: Creating a bar plot of sales by product category
sales_by_category = sales_data.groupby('product_category')
['sales_amount'].sum() sales_by_category.plot(kind='bar',
color='skyblue') plt.title('Total Sales by Product Category')
plt.xlabel('Product
Category') plt.ylabel('Total
Sales Amount')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Kashik
Sredharan
T23 114
Conclusion:
We connected to a MySQL database, fetched sales data, and visualized the total
sales by product category using Python libraries. The choice of database and
visualization techniques can greatly impact the efficiency and effectiveness of
your data analysis and presentation. It's essential to carefully consider your
project requirements and select the most suitable tools for the task at hand.
Additionally, Python provides a wide range of libraries and tools, making it a
versatile choice for data analysis and visualization tasks.

Kashik
Sredharan
T23 114
Experiment 11

Aim: To perform visualisation and gain insights on data using PowerBI

Theory:

Microsoft Power BI is a unified platform that empowers users to connect to, visualise, and
analyse data from various sources. It caters to both self-service BI (Business Intelligence)
needs for individual users and enterprise-wide BI deployments for organizations.
Here's a comprehensive breakdown of Power BI and its functionalities:

Core Features:
●Data Connectivity: Power BI bridges the gap between data and insights by connecting to a
wide range of sources. This includes relational databases, cloud storage services
(like Azure Blob Storage), Excel spreadsheets, and even social media platforms.
●Data Transformation and Cleaning: Power BI provides tools for data transformation and
cleaning. Users can import, filter, transform, and enrich their data to ensure its accuracy
and relevance for analysis.

●Interactive Visualizations: Power BI boasts a rich set of built-in

visualizations, including bar


charts, line charts, pie charts, scatter plots, maps, and more. These visualizations are
interactive, allowing users to drill down, filter, and explore the data from
different perspectives.
●Custom Visualizations: Beyond the built-in options, Power BI extends its
capabilities through custom visuals. These can be downloaded from the
Microsoft AppSource marketplace, catering to specific needs and functionalities
not covered by the standard set.
●Dashboards and Reports: Power BI empowers users to create interactive
dashboards and reports. Dashboards provide a high-level overview of key metrics
and trends, while reports delve deeper into specific aspects of the data.
●Collaboration and Sharing: Power BI fosters collaboration by allowing
users to share dashboards and reports with colleagues. They can collaborate on
insights and make data- driven decisions together.
●Mobile BI: Power BI offers mobile applications for various platforms,
enabling users to access and interact with their data and reports on the go.

Kashik
Sredharan
T23 114
Data Visualization Techniques in Power BI:
Basic
Visualizations: ●These include bar charts, column charts, line charts, pie charts, and
scatter plots. They are ideal for conveying basic trends, comparisons, and
relationships within the data.
●Advanced Visualizations: Power BI offers more sophisticated visualizations like:
o Heatmaps: Reveal patterns and correlations between two categorical variables.
o Funnel Charts: Depict stages in a process and identify drop-off points.
o Treemaps: Visualize hierarchical data structures, showcasing how parts
contribute to the whole.
o Gauge Charts: Communicate performance against targets or KPIs (Key
Performance Indicators).
o Cards: Display key metrics in a concise and easy-to-understand format.
o Kilos (Power KPI): Combine multiple metrics into a single visual,
providing a holistic view of performance.
●Custom Visualizations: As mentioned earlier, Power BI's extensibility allows users to
leverage custom visuals for specific needs. These can include advanced
charts, network graphs, and other specialized visualizations.

Implementation:

Kashik
Sredharan
T23 114
Conclusion: Thus, we have performed visualisation and gained insights on data using PowerBI.

Kashik
Sredharan
T23 114

You might also like