0% found this document useful (0 votes)
5 views43 pages

Dfs Manual

Uploaded by

hexagonsih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views43 pages

Dfs Manual

Uploaded by

hexagonsih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

GOVERNMENT COLLEGE OF ENGINEERING, ERODE – 638 316

RECORD NOTE BOOK


Register number:

Certified that this is the bonafide record of work done by Selvan / Selvi
_____________________________________ of SEVENTH Semester of B.E Electrical and
Electronics Engineering branch during the Academic Year 2024 – 2025 in the OCS353 – DATA
SCIENCE FUNDAMENTALS

Staff In-Charge Head of the Department

Submitted for the Anna University practical examination on __________________ at


Government College of Engineering, Erode – 638 316
Date: ________________________

Internal Examiner External Examiner

1
LIST OF EXPRIMENTS
EX. DATE EXPRIMENT NAME PAGE MARKS SIGN
NO NO

1 Working with NumPy 3

Write a NumPy program to convert a


2 python dictionary to a NumPy ndarray 7

3 Working with Pandas DataFrame 9

4 Basic Plots using Matplotlib 17

5 Univariate Analysis 22

Using the diabetes data set from UCI


6 data set for apply bivariate analysis 26

Statistical and Probability measures on


7 the Iris data set (This program requires 30
iris.csv file)

Supervised and Unsupervised learning


8 with python program 34

Apply and explore various plotting


9 function on any data set 37

2
Ex.no: 1
Date:

Working with NumPy


Aim
To write python programs to working with NumPy

Program
i. NumPy program to create a null vector of size 10 and update sixth value to 11
Ans:
import numpy as np
vector = np.zeros(10)
vector[5] = 11
print(vector)
output:
[ 0. 0. 0. 0. 0. 11. 0. 0. 0. 0.]
ii. NumPy program to convert an array to a float type
import numpy as np
array = np.array([1, 2, 3, 4, 5])
float_array = array.astype(float)
print(float_array)
output:
[1. 2. 3. 4. 5.]
iii. NumPy program to create a 3 * 3 matrix with values ranging from 2 to 10
Ans:
import numpy as np
matrix = np.arange(2, 11).reshape(3, 3)
print(matrix)
output:
[[ 2 3 4]
[ 5 6 7]
[ 8 9 10]]

3
iv. Write a NumPy program to convert a list and tuple into arrays
Ans:
import numpy as np
lst = [1, 2, 3, 4]
tpl = (5, 6, 7, 8)
array_from_list = np.array(lst)
array_from_tuple = np.array(tpl)
print(array_from_list)
print(array_from_tuple)
output:
[1 2 3 4]
[5 6 7 8]
v. Write a NumPy program to convert the values of Centigrade degrees into Fahrenheit degrees
and vice versa. Values have to be stored into a NumPy array.
Ans
import numpy as np
centigrade = np.array([0, 20, 37, 100])
fahrenheit = (centigrade * 9/5) + 32
print("Centigrade to Fahrenheit:", fahrenheit)
fahrenheit_to_centigrade = (fahrenheit - 32) * 5/9
print("Fahrenheit to Centigrade:", fahrenheit_to_centigrade)
output:
Centigrade to Fahrenheit: [ 32. 68. 98.6 212. ]
Fahrenheit to Centigrade: [ 0. 20. 37. 100.]
vi. Write a NumPy program to perform the basic arithmetic operations
Ans:
import numpy as np
array1 = np.array([10, 20, 30, 40])
array2 = np.array([1, 2, 3, 4])
addition = np.add(array1, array2)
subtraction = np.subtract(array1, array2)
multiplication = np.multiply(array1, array2)
division = np.divide(array1, array2)

4
print("Addition:", addition)
print("Subtraction:", subtraction)
print("Multiplication:", multiplication)
print("Division:", division)
Output:
Addition: [11 22 33 44]
Subtraction: [ 9 18 27 36]
Multiplication: [ 10 40 90 160]
Division: [10. 10. 10. 10.]
vii. Write a NumPy program to transpose an array
Ans:
import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6]])
transpose_array = np.transpose(array)
print("Original array:")
print(array)
print("Transposed array:")
print(transpose_array)
Output:
Original array:
[[1 2 3]
[4 5 6]]
Transposed array:
[[1 4]
[2 5]
[3 6]
viii. Use NumPy, create an array with 5 dimensions and verify that it has 5 dimensions
Ans:
import numpy as np
array_5d = np.ones((2, 2, 2, 2, 2))
print("Number of dimensions:", array_5d.ndim)

5
Output:
Number of dimensions: 5
ix. Write a NumPy program to merge three given NumPy arrays of same shape
Ans:
import numpy as np
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
array3 = np.array([7, 8, 9])
merged_array = np.concatenate((array1, array2, array3))
print("Merged array:", merged_array)
output:
Merged array: [1 2 3 4 5 6 7 8 9]
x. Create two arrays of six elements, write a NumPy program to count the number of instances
of a value occurring in one array on the condition of another array.
Ans:
import numpy as np
array1 = np.array([1, 2, 3, 2, 4, 2])
array2 = np.array([5, 6, 7, 6, 8, 6])
value_to_count = 2
condition_value = 6
count = np.sum((array1 == value_to_count) & (array2 == condition_value))
print("Number of instances:", count)
output:
Number of instances: 3

Result
Thus, the python program to work with NumPy has executed successfully

6
Ex.no: 2
Date:

Write a NumPy program to convert a python dictionary to a NumPy


ndarray.
Aim
To write python NumPy program to convert a python dictionary to a NumPy ndarray.

Program:
import numpy as np
# Original dictionary
data_dict = {
'column0': {'a': 1, 'b': 0.0, 'c': 0.0, 'd': 2.0},
'column1': {'a': 3.0, 'b': 1, 'c': 0.0, 'd': -1.0},
'column2': {'a': 4, 'b': 1, 'c': 5.0, 'd': -1.0},
'column3': {'a': 3.0, 'b': -1.0, 'c': -1.0, 'd': -1.0}
}
# Convert the dictionary to a NumPy ndarray
ndarray = np.array([list(col.values()) for col in data_dict.values()]).T
print("Original dictionary:")
print(data_dict)
print("Type:")
print("ndarray:")
print(ndarray)
print("Type:", type(ndarray))

Sample output:
Original dictionary:
{‘column0’: {‘a’: 1, ‘b’: 0.0, ‘c’: 0.0, ‘d’: 2.0},
‘column1’: {‘a’: 3.0, ‘b’: 1, ‘c’: 0.0, ‘d’: -1.0},
‘column2’: {‘a’: 4, ‘b’: 1, ‘c’: 5.0, ‘d’: -1.0},
‘column3’: {‘a’: 3.0, ‘b’: -1.0, ‘c’: -1.0, ‘d’: -1.0}}
Type:
ndarray:

7
[[1. 0. 0. 2.]
[3. 1. 0. -1.]
[4. 1. 5. -1.]
[3. -1. -1. -1.]]
Type:<class ‘numpy.ndarray’>

output:
Original dictionary:
{'column0': {'a': 1, 'b': 0.0, 'c': 0.0, 'd': 2.0}, 'column1': {'a': 3.0, 'b': 1, 'c': 0.0, 'd': -1.0}, 'column2': {'a':
4, 'b': 1, 'c': 5.0, 'd': -1.0}, 'column3': {'a': 3.0, 'b': -1.0, 'c': -1.0, 'd': -1.0}}
Type:
ndarray:
[[ 1. 3. 4. 3.]
[ 0. 1. 1. -1.]
[ 0. 0. 5. -1.]
[ 2. -1. -1. -1.]]
Type: <class 'numpy.ndarray'>

Result
Thus, python NumPy program to convert a python dictionary to a NumPy ndarray is executed
successfully.

8
Ex.no: 3
Date:

Working with Pandas DataFrame


Aim
To write python program to work with Pandas DataFrame.

Program
i. Create your own simple Pandas DataFrame and print its values.
import pandas as pd
# Creating a simple DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# Printing DataFrame values
print("DataFrame values:")
print(df.values)
output:
DataFrame values:
[['Alice' 24 'New York']
['Bob' 27 'Los Angeles']
['Charlie' 22 'Chicago']
['David' 32 'Houston']]
ii. Perform appending, slicing, addition and deletion of rows with a pandas dataframe.
import pandas as pd
# Initial DataFrame
data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

9
# 1. Append a new row
new_row = pd.DataFrame([{'Name': 'Eve', 'Age': 29, 'City': 'San Francisco'}])
df = pd.concat([df, new_row], ignore_index=True)
# 2. Slice rows (e.g., select rows 1 to 3)
sliced_df = df.iloc[1:4]
print("Sliced DataFrame (rows 1 to 3):")
print(sliced_df)
# 3. Add rows (concatenate with another DataFrame)
additional_data = pd.DataFrame({
'Name': ['Frank', 'Grace'],
'Age': [30, 25],
'City': ['Seattle', 'Austin']
})
df = pd.concat([df, additional_data], ignore_index=True)
# 4. Delete a row by index (e.g., delete row with index 2)
df = df.drop(index=2)
print("\nDataFrame after appending, adding, and deleting rows:")
print(df)
Output:
Sliced DataFrame (rows 1 to 3):
Name Age City
1 Bob 27 Los Angeles
2 Charlie 22 Chicago
3 David 32 Houston
DataFrame after appending, adding, and deleting rows:
Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
3 David 32 Houston
4 Eve 29 San Francisco
5 Frank 30 Seattle
6 Grace 25 Austin

10
iii. Using Pandas, Create a DataFrame with a list of dictionaries, row indices, and column
indices
Program
import pandas as pd
# List of dictionaries
data = [
{'Name': 'Alice', 'Age': 24, 'City': 'New York'},
{'Name': 'Bob', 'Age': 27, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 22, 'City': 'Chicago'},
{'Name': 'David', 'Age': 32, 'City': 'Houston'}
]
# Specifying row indices and column order
df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4'], columns=['Name', 'Age', 'City'])
print("DataFrame with specified row and column indices:")
print(df)
Output:
DataFrame with specified row and column indices:
Name Age City
row1 Alice 24 New York
row2 Bob 27 Los Angeles
row3 Charlie 22 Chicago
row4 David 32 Houston
iv. Write a Pandas program to goet the powers of an array values element-wise.
Note: First array elements raised to powers from second array
Sample data:
{‘X’: [78, 85, 96, 80, 86], ‘Y’: [84, 94, 89, 83, 86], ‘Z’: [86, 97, 96, 72, 83]}
Expected Output:
XYZ
0 78 84 86
1 85 94 97
2 96 89 72
3 80 83 72
4 86 86 83

11
Program
import pandas as pd
import numpy as np
# Sample data as a dictionary
data = {'X': [78, 85, 96, 80, 86], 'Y': [84, 94, 89, 83, 86], 'Z': [86, 97, 96, 72, 83]}
df = pd.DataFrame(data)
# Element-wise power: X raised to the power of Y
df['Power_X_Y'] = np.power(df['X'], df['Y'])
print("Original DataFrame:")
print(df[['X', 'Y', 'Z']])
print("\nDataFrame with element-wise power of X^Y:")
print(df[['X', 'Y', 'Z', 'Power_X_Y']])
Output:
Original DataFrame:
X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83
DataFrame with element-wise power of X^Y:
X Y Z Power_X_Y
0 78 84 86 0
1 85 94 97 4551265826121030281
2 96 89 96 0
3 80 83 72 0
4 86 86 83 0

12
v. Write a Pandas Program to get the numeric representation of an array by identifying distinct
values of a given column of a DataFrame
Sample output:
Original DataFrame:
Name Date_Of_Birth Age
0 Alberto Franco 17/05/2002 18.5
1 Gino Mcnell 16/02/1999 21.2
2 Ryan Parkes 25/09/1998 22.5
3 Eesha Hinton 11/05/2002 22.0
4 Gino Mcnell 15/09/1997 23.0
Numeric representation of an array by identifying distinct values:
[0 1 2 3 1]
Index([‘Alberto Franco’, ‘Gino Mcnell’, ‘Ryan Parkes’, ‘Eesha Hinton’], dtype=’object’)
Program
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alberto Franco', 'Gino Mcnell', 'Ryan Parkes', 'Eesha Hinton', 'Gino Mcnell'],
'Date_Of_Birth': ['17/05/2002', '16/02/1999', '25/09/1998', '11/05/2002', '15/09/1997'],
'Age': [18.5, 21.2, 22.5, 22.0, 23.0]
}
df = pd.DataFrame(data)
# Getting the numeric representation of 'Name' column by identifying distinct values
df['Name_numeric'] = pd.factorize(df['Name'])[0]
print("Original DataFrame:")
print(df[['Name', 'Date_Of_Birth', 'Age']])
print("\nNumeric representation of an array by identifying distinct values:")
print(df['Name_numeric'].values)
print("\nUnique names with their numeric index mapping:")
print(pd.Index(df['Name'].unique()))

13
Output:
Original DataFrame:
Name Date_Of_Birth Age
0 Alberto Franco 17/05/2002 18.5
1 Gino Mcnell 16/02/1999 21.2
2 Ryan Parkes 25/09/1998 22.5
3 Eesha Hinton 11/05/2002 22.0
4 Gino Mcnell 15/09/1997 23.0
Numeric representation of an array by identifying distinct values:
[0 1 2 3 1]
Unique names with their numeric index mapping:
Index(['Alberto Franco', 'Gino Mcnell', 'Ryan Parkes', 'Eesha Hinton'], dtype='object')

vi. Write a Pandas program to count the number of rows and columns of a DataFrame.
Sample python dictionary data and list labels:
exam_data = {‘name’: [‘Anastasia’, ‘Dima’, ‘Katherine’, ‘James’, ‘Emily’, ‘Michael’, ‘Matthew’,
‘Laura’, ‘Kevin’, ‘Jonas’],
‘score’: [12.5, 9, 16.5, np.nan, 9. 20, 14.5, np.nan, 8. 19],
‘attempts’: [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
‘qualify’: [‘yes’, ‘no’, ‘yes’, ‘no’, ‘no’, ‘yes’, ‘yes’, ‘no’, ‘no’, ‘yes’]}
labels = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’]
Expected Output:
Number of Rows: 10
Number of Columns: 4
Program
import pandas as pd
import numpy as np
exam_data = {
'name': ['BarathKumar', 'TamilSelvan', 'Dharshan', 'Saravanan', 'SudhanKumar', 'EsaiVani',
'KalaiVani', 'Rupriya', 'Abirami', 'Murugan'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']

14
}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
# Creating the DataFrame with row labels
df = pd.DataFrame(exam_data, index=labels)
# Counting rows and columns
num_rows = df.shape[0]
num_columns = df.shape[1]
print("Number of Rows:", num_rows)
print("Number of Columns:", num_columns)
Output:
Number of Rows: 10
Number of Columns: 4
vii. Write a Pandas program to check a given column is present in a DataFrame or not
Sample data:
Original DataFrame
col1 col2 col3
0147
1258
2 3 6 12
3491
4 7 5 11
Col4 is not present in DataFrame.
Col1 is present in DataFrame.
Program
import pandas as pd
data = {
'col1': [1, 2, 3, 4, 7],
'col2': [4, 5, 6, 9, 5],
'col3': [7, 8, 12, 1, 11]
}
df = pd.DataFrame(data)
def check_column_presence(df, column_name):

15
if column_name in df.columns:
print(f"{column_name} is present in DataFrame.")
else:
print(f"{column_name} is not present in DataFrame.")
check_column_presence(df, 'col4')
check_column_presence(df, 'col1')
Output:
col4 is not present in DataFrame.
col1 is present in DataFrame.

Result
Thus, the python programs to work with Pandas DataFrame are executed successfully.

16
Ex.no: 4
Date:

Basic Plots using Matplotlib


Aim
To write python programs to plot basic plots using Matplotlib
i. Using the ‘concrete strength’ dataset, explore relationships between two continuous variables
with Scatterplots
Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set a random seed for reproducibility
np.random.seed(42)
# Save the DataFrame as a CSV file
file_path = '/content/concrete_strength_parabolic.csv'
df.to_csv(file_path, index=False)
print(f"Concrete strength dataset saved to {file_path}")
# Plotting relationships between continuous variables
# Scatterplot between 'Cement' and 'Strength'
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Cement', y='Strength', color='blue')
plt.title('Relationship between Cement and Concrete Strength')
plt.xlabel('Cement (kg/m³)')
plt.ylabel('Concrete Strength (MPa)')
plt.show()
# Scatterplot between 'Water' and 'Strength'
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Water', y='Strength', color='green')
plt.title('Relationship between Water and Concrete Strength')
plt.xlabel('Water (kg/m³)')
plt.ylabel('Concrete Strength (MPa)')

17
plt.show()

Output:

18
ii. Draw a Scatter Plot for the following Pandas DataFrame with Team name and Rank Points
as x and y axis,
[‘Australia’, 2500], [‘Bangladesh’, 1000], [‘England’, 2000], [‘India’, 3000], [‘Srilanka’, 1500]
Program
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Create the DataFrame with team names and rank points
data = {
'Team': ['Australia', 'Bangladesh', 'England', 'India', 'Srilanka'],
'Rank Points': [2500, 1000, 2000, 3000, 1500]
}
df_teams = pd.DataFrame(data)
# Display the DataFrame
print("DataFrame:")
print(df_teams)
# Plotting the scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_teams, x='Team', y='Rank Points', color='Pink', s=100)
# Adding labels and title
plt.title('Scatter Plot of Team Rank Points')
plt.xlabel('Team')
plt.ylabel('Rank Points')
plt.show()

19
Output:

iii. make a three-dimensional plot with randomly generate 50 data points for x, y, and z. Set the
point colour as red, and size of the point as 50.
Ans:
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
# Generating random data for x, y, and z axes
np.random.seed(42)
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)
# Creating a 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Plotting the points with specified color and size

20
ax.scatter(x, y, z, color='red', s=50)
# Adding labels for clarity
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
ax.set_title('3D Scatter Plot with Random Data Points')
plt.show()
Output:

Result
Thus, the python programs to plot the basic plots using Matplotlib is executed successfully.

21
Ex.no: 5
Date:

Univariate Analysis
Aim
To write python programs to apply Univariate Analysis
Use the diabetes data set from Pima Indians Diabetes data set for performing the following:
Apply Univariate analysis:
a. Frequency
b. Mean
c. Median
d. Mode
e. Variance
f. Standard Deviation
g. Skewness and Kurtosis

Program
# Replace with your actual file path if different
import pandas as pd
import numpy as np
from scipy import stats
# Load the dataset
file_path = "/content/diabetes.csv" # Replace with your actual file path if different
data = pd.read_csv(file_path)
# Filter data for Outcome = 0 and Outcome = 1
data_0 = data[data['Outcome'] == 0]
data_1 = data[data['Outcome'] == 1]
# Dictionary to store the results
analysis_results = {
"Outcome = 0": {
"Pregnancies Frequency": data_0["Pregnancies"].value_counts(),
"Glucose Mean": np.mean(data_0["Glucose"]),
"BloodPressure Median": np.median(data_0["BloodPressure"]),
"SkinThickness Mode": stats.mode(data_0["SkinThickness"])[0],
"Insulin Variance": np.var(data_0["Insulin"]),
"BMI Standard Deviation": np.std(data_0["BMI"]),

22
"DiabetesPedigreeFunction Skewness": stats.skew(data_0["DiabetesPedigreeFunction"]),
"Age Kurtosis": stats.kurtosis(data_0["Age"])
},

"Outcome = 1": {
"Pregnancies Frequency": data_1["Pregnancies"].value_counts(),
"Glucose Mean": np.mean(data_1["Glucose"]),
"BloodPressure Median": np.median(data_1["BloodPressure"]),
"SkinThickness Mode": stats.mode(data_1["SkinThickness"])[0],
"Insulin Variance": np.var(data_1["Insulin"]),
"BMI Standard Deviation": np.std(data_1["BMI"]),
"DiabetesPedigreeFunction Skewness": stats.skew(data_1["DiabetesPedigreeFunction"]),
"Age Kurtosis": stats.kurtosis(data_1["Age"])
}
}
# Display the analysis for both outcomes
for outcome, stats_dict in analysis_results.items():
print(f"\nStatistical Analysis for {outcome}:")
for stat_name, value in stats_dict.items():
print(f"{stat_name}: {value}")

output:

Statistical Analysis for Outcome = 0:


Pregnancies Frequency: Pregnancies
1 106
2 84
0 73
3 48
4 45
5 36
6 34
7 20

23
8 16
10 14
9 10
13 5
12 5
11 4
Name: count, dtype: int64
Glucose Mean: 109.98
BloodPressure Median: 70.0
SkinThickness Mode: 0
Insulin Variance: 9754.796735999955
BMI Standard Deviation: 7.682161307861215
DiabetesPedigreeFunction Skewness: 2.00021791479704
Age Kurtosis: 1.9318725201269862
Statistical Analysis for Outcome = 1:

Pregnancies Frequency: Pregnancies


0 38
1 29
3 27
7 25
4 23
8 22
5 21
2 19
9 18
6 16
10 10
11 7
13 5
12 4
14 2

24
15 1
17 1
Name: count, dtype: int64
Glucose Mean: 141.25746268656715
BloodPressure Median: 74.0
SkinThickness Mode: 0
Insulin Variance: 19162.902149699297
BMI Standard Deviation: 7.249404266473003
DiabetesPedigreeFunction Skewness: 1.7127179440927176
Age Kurtosis: -0.36378456012609117

Result
Thus, the python programs to apply Univariate Analysis is executed successfully

25
Ex.no: 6
Date:

Use the diabetes data set from UCI data set for performing the following:
Apply Bivariate Analysis
Program:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load the dataset
file_path = "/content/diabetes.csv" # Update with your actual path if needed
data = pd.read_csv(file_path)
# Display dataset info
print("Dataset Info:")
print(data.info())
print("\nDataset Head:")
print(data.head())
# Multiple Regression Analysis - Logistic Regression for 'Outcome' Prediction
# Define predictors and target variable
X = data.drop(columns=["Outcome"]) # Independent variables
y = data["Outcome"] # Dependent variable
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit logistic regression model
logistic_model = LogisticRegression(max_iter=200)
logistic_model.fit(X_train, y_train)
# Predict on the test set
y_pred = logistic_model.predict(X_test)
# Model Evaluation

26
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Logistic Regression Model Evaluation:")
print(f"Accuracy: {accuracy:.3f}")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
# Logistic Regression Summary using StatsModels for detailed statistics
X_train_sm = sm.add_constant(X_train) # Adding constant for intercept in statsmodels
logit_model = sm.Logit(y_train, X_train_sm)
result = logit_model.fit()
print("\nLogistic Regression Analysis Summary (StatsModels):")
print(result.summary())

Output:
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64

27
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
Dataset Head:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

Logistic Regression Model Evaluation:


Accuracy: 0.736
Confusion Matrix:
[[120 31]
[ 30 50]]

Classification Report:
precision recall f1-score support

0 0.80 0.79 0.80 151


1 0.62 0.62 0.62 80

accuracy 0.74 231


macro avg 0.71 0.71 0.71 231

28
weighted avg 0.74 0.74 0.74 231

Optimization terminated successfully.


Current function value: 0.459388
Iterations 6

Logistic Regression Analysis Summary (StatsModels):


Logit Regression Results
========================================================================
======
Dep. Variable: Outcome No. Observations: 537
Model: Logit Df Residuals: 528
Method: MLE Df Model: 8
Date: Sat, 26 Oct 2024 Pseudo R-squ.: 0.2905
Time: 17:33:50 Log-Likelihood: -246.69
converged: True LL-Null: -347.71
Covariance Type: nonrobust LLR p-value: 2.378e-39
========================================================================
====================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
const -9.4451 0.915 -10.321 0.000 -11.239 -7.651
Pregnancies 0.0580 0.039 1.477 0.140 -0.019 0.135
Glucose 0.0359 0.005 7.714 0.000 0.027 0.045
BloodPressure -0.0108 0.007 -1.584 0.113 -0.024 0.003
SkinThickness -0.0015 0.008 -0.179 0.858 -0.018 0.015
Insulin -0.0010 0.001 -0.884 0.377 -0.003 0.001
BMI 0.1090 0.019 5.740 0.000 0.072 0.146
DiabetesPedigreeFunction 0.4215 0.357 1.182 0.237 -0.278 1.120
Age 0.0359 0.012 3.106 0.002 0.013 0.059

Result
Thus, the python program to Bivariant analysis with the diabetes data set from UCI data set is
executed successfully.

29
Ex.no: 7
Date:

Statistical and Probability measures on the Iris data set (This program
requires iris.csv file)
Aim
To write a python program to apply statistical and probability measures on any data set

Program
import pandas as pd
import matplotlib.pyplot as plt
# Load the Iris dataset from a text file, Excel file, or from the web
# 1. Reading data from a text file (CSV format)
# Uncomment if you have iris.csv locally:
# df_text = pd.read_csv('path_to_your_file/iris.csv')
# 2. Reading data from an Excel file
# Uncomment if you have iris.xlsx locally:
df_web = pd.read_csv('/content/iris.csv')
# 3. Reading data directly from a URL (web)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
#df_web = pd.read_csv(url, header=None, names=column_names)
# Displaying the first few rows to verify data load
print("First five rows of the Iris dataset:")
print(df_web.head())
# Descriptive Analytics on the Iris dataset
# 1. Basic information about the dataset
print("\nDataset Information:")
print(df_web.info())
# 2. Summary statistics
print("\nSummary Statistics:")
print(df_web.describe())
# 3. Checking for unique species
print("\nUnique Species in the dataset:")

30
print(df_web['species'].unique())
# 4. Count of each species
print("\nCount of each species:")
print(df_web['species'].value_counts())
# 5. Mean, median, and standard deviation of Sepal Length
print("\nMean Sepal Length:", df_web['sepal_length'].mean())
print("Median Sepal Length:", df_web['sepal_length'].median())
print("Standard Deviation of Sepal Length:", df_web['sepal_length'].std())
# 6. Correlation matrix to see relationships between variables
print("\nCorrelation Matrix:")
print(df_web.corr())
# 7. Grouping data by species and calculating mean values
print("\nMean values by species:")
print(df_web.groupby('species').mean())
# 8. Plotting pairplot for visual analysis (if needed)
#Uncomment to visualize if running in an environment with plotting capability
import seaborn as sns
sns.pairplot(df_web, hue="species")
plt.show()
Output:
First five rows of the Iris dataset:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype

31
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
None
Summary Statistics:
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333 1.000000
std 0.828066 0.435866 1.765298 0.762238 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000
Unique Species in the dataset:
[0 1 2]
Count of each species:
species
0 50
1 50
2 50
Name: count, dtype: int64
Mean Sepal Length: 5.843333333333334
Median Sepal Length: 5.8
Standard Deviation of Sepal Length: 0.8280661279778629
Correlation Matrix:
sepal_length sepal_width petal_length petal_width species
sepal_length 1.000000 -0.117570 0.871754 0.817941 0.782561

32
sepal_width -0.117570 1.000000 -0.428440 -0.366126 -0.426658
petal_length 0.871754 -0.428440 1.000000 0.962865 0.949035
petal_width 0.817941 -0.366126 0.962865 1.000000 0.956547
species 0.782561 -0.426658 0.949035 0.956547 1.000000
Mean values by species:
sepal_length sepal_width petal_length petal_width
species
0 5.006 3.428 1.462 0.246
1 5.936 2.770 4.260 1.326
2 6.588 2.974 5.552 2.026

Result
Thus, the python program to apply statistical and probability measures on any data set is
executed successfully.

33
Ex.no: 8
Date:

Supervised and Unsupervised learning with python program


Aim
To implement Supervised and unsupervised learning with python program
Program
i. Supervised learning
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
url="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names=['sepal-length','sepal-width','petal-length','petal-width','Class']
dataset=pd.read_csv(url,names=names)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import LabelEncoder
from matplotlib.colors import ListedColormap
# Encode the class labels as integers
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
# Use only the first two features for a 2D decision boundary plot
X_two_features = X[:, :2]
X_train_2D, X_test_2D, y_train_2D, y_test_2D = train_test_split(X_two_features, y_encoded,
test_size=0.20, random_state=42)
# Fit KNN model on 2D data
knn_2D = KNeighborsClassifier(n_neighbors=5)
knn_2D.fit(X_train_2D, y_train_2D)
# Create a mesh grid for plotting the decision boundary
h = .02

34
x_min, x_max = X_two_features[:, 0].min() - 1, X_two_features[:, 0].max() + 1
y_min, y_max = X_two_features[:, 1].min() - 1, X_two_features[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the labels for each point in the mesh grid
Z = knn_2D.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.figure(figsize=(8, 6))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
plt.contourf(xx, yy, Z, cmap=cmap_light)
# Plot the original data points
plt.scatter(X_two_features[:, 0], X_two_features[:, 1], c=y_encoded, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title("KNN Decision Boundary (2 features)")
plt.show()
output:

35
ii. Unsupervised learning Implementation of K-means Clustering Algorithm
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
np.random.seed(0)
X = np.random.randn(200, 2) + np.array([2, 2])
X = np.vstack((X, np.random.randn(200, 2) + np.array([-2, -2])))
X = np.vstack((X, np.random.randn(200, 2) + np.array([2, -2])))
X = np.vstack((X, np.random.randn(200, 2) + np.array([-2, 2])))
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', s=200,
color='black')
plt.show()

OUTPUT

Result
Thus, the python program to implement Supervised and unsupervised learning with python
program

36
Ex. No: 9
Date:

Apply and explore various plotting functions on any data set.


Aim
To apply and explore various plotting functions on any data set
i. Apply and explore various plotting functions on UCI data set for performing the following:
i. Normal Value
ii. Density and contour plots
iii. Three-dimensional plotting
Program
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import norm

# Load the dataset


file_path = "/content/diabetes.csv" # Update path as needed
data = pd.read_csv(file_path)

# 1. Normal Value Distribution Plot


# Let's take 'Glucose' and 'BMI' as examples
plt.figure(figsize=(14, 6))

# Plot for Glucose


plt.subplot(1, 2, 1)
sns.histplot(data['Glucose'], kde=True, stat="density", line_kws={'linestyle':'--'}, color="skyblue")
x_vals = np.linspace(data['Glucose'].min(), data['Glucose'].max(), 100)
plt.plot(x_vals, norm.pdf(x_vals, data['Glucose'].mean(), data['Glucose'].std()), color="red",
linestyle="--")
plt.title("Normal Distribution of Glucose")
plt.xlabel("Glucose")

37
plt.ylabel("Density")

# Plot for BMI


plt.subplot(1, 2, 2)
sns.histplot(data['BMI'], kde=True, stat="density", line_kws={'linestyle':'--'}, color="orange")
x_vals = np.linspace(data['BMI'].min(), data['BMI'].max(), 100)
plt.plot(x_vals, norm.pdf(x_vals, data['BMI'].mean(), data['BMI'].std()), color="red", linestyle="--")
plt.title("Normal Distribution of BMI")
plt.xlabel("BMI")
plt.ylabel("Density")
plt.show()

# 2. Density and Contour Plots


# Using Glucose vs. Insulin for example
plt.figure(figsize=(8, 6))
sns.kdeplot(x=data['Glucose'], y=data['Insulin'], cmap="coolwarm", fill=True, thresh=0.05)
plt.title("Density and Contour Plot of Glucose vs Insulin")
plt.xlabel("Glucose")
plt.ylabel("Insulin")
plt.show()

# 3. Three-Dimensional Plotting
# 3D plot of Age, BMI, and Glucose colored by Outcome
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot
sc = ax.scatter(data['Age'], data['BMI'], data['Glucose'], c=data['Outcome'], cmap="viridis", s=50,
alpha=0.7)
ax.set_xlabel("Age")
ax.set_ylabel("BMI")
ax.set_zlabel("Glucose")
ax.set_title("3D Plot of Age, BMI, and Glucose")

38
plt.colorbar(sc, label="Outcome")
plt.show()
Output:

39
40
ii. Apply and explore various plotting functions on UCI data set for performing the following:
i. Correlation and scatter plots
ii. Histograms
iii. Three-dimensional plotting
Program:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Load the dataset


file_path = "/content/diabetes.csv"
data = pd.read_csv(file_path)

# i. Correlation and Scatter Plots


# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

# Pairplot for Scatter Plots (pairwise relationships between variables)


sns.pairplot(data, hue="Outcome", diag_kind="kde")
plt.suptitle("Scatter Plots for Pairwise Relationships", y=1.02)
plt.show()

# ii. Histograms
# Plot histograms for continuous variables
data.hist(bins=15, figsize=(15, 10), color="skyblue", edgecolor="black")
plt.suptitle("Histograms of Diabetes Dataset Features", y=0.95)
plt.show()

41
# iii. Three-Dimensional Plotting
# 3D plot of Age, BMI, and Glucose colored by Outcome
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot
sc = ax.scatter(data['Age'], data['BMI'], data['Glucose'], c=data['Outcome'], cmap="viridis", s=50,
alpha=0.7)
ax.set_xlabel("Age")
ax.set_ylabel("BMI")
ax.set_zlabel("Glucose")
ax.set_title("3D Plot of Age, BMI, and Glucose")
plt.colorbar(sc, label="Outcome")
plt.show()

output:

42
Result
Thus, the python program to apply and explore various plotting functions on any data set

43

You might also like