Dfs Manual
Dfs Manual
Certified that this is the bonafide record of work done by Selvan / Selvi
_____________________________________ of SEVENTH Semester of B.E Electrical and
Electronics Engineering branch during the Academic Year 2024 – 2025 in the OCS353 – DATA
SCIENCE FUNDAMENTALS
1
LIST OF EXPRIMENTS
EX. DATE EXPRIMENT NAME PAGE MARKS SIGN
NO NO
5 Univariate Analysis 22
2
Ex.no: 1
Date:
Program
i. NumPy program to create a null vector of size 10 and update sixth value to 11
Ans:
import numpy as np
vector = np.zeros(10)
vector[5] = 11
print(vector)
output:
[ 0. 0. 0. 0. 0. 11. 0. 0. 0. 0.]
ii. NumPy program to convert an array to a float type
import numpy as np
array = np.array([1, 2, 3, 4, 5])
float_array = array.astype(float)
print(float_array)
output:
[1. 2. 3. 4. 5.]
iii. NumPy program to create a 3 * 3 matrix with values ranging from 2 to 10
Ans:
import numpy as np
matrix = np.arange(2, 11).reshape(3, 3)
print(matrix)
output:
[[ 2 3 4]
[ 5 6 7]
[ 8 9 10]]
3
iv. Write a NumPy program to convert a list and tuple into arrays
Ans:
import numpy as np
lst = [1, 2, 3, 4]
tpl = (5, 6, 7, 8)
array_from_list = np.array(lst)
array_from_tuple = np.array(tpl)
print(array_from_list)
print(array_from_tuple)
output:
[1 2 3 4]
[5 6 7 8]
v. Write a NumPy program to convert the values of Centigrade degrees into Fahrenheit degrees
and vice versa. Values have to be stored into a NumPy array.
Ans
import numpy as np
centigrade = np.array([0, 20, 37, 100])
fahrenheit = (centigrade * 9/5) + 32
print("Centigrade to Fahrenheit:", fahrenheit)
fahrenheit_to_centigrade = (fahrenheit - 32) * 5/9
print("Fahrenheit to Centigrade:", fahrenheit_to_centigrade)
output:
Centigrade to Fahrenheit: [ 32. 68. 98.6 212. ]
Fahrenheit to Centigrade: [ 0. 20. 37. 100.]
vi. Write a NumPy program to perform the basic arithmetic operations
Ans:
import numpy as np
array1 = np.array([10, 20, 30, 40])
array2 = np.array([1, 2, 3, 4])
addition = np.add(array1, array2)
subtraction = np.subtract(array1, array2)
multiplication = np.multiply(array1, array2)
division = np.divide(array1, array2)
4
print("Addition:", addition)
print("Subtraction:", subtraction)
print("Multiplication:", multiplication)
print("Division:", division)
Output:
Addition: [11 22 33 44]
Subtraction: [ 9 18 27 36]
Multiplication: [ 10 40 90 160]
Division: [10. 10. 10. 10.]
vii. Write a NumPy program to transpose an array
Ans:
import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6]])
transpose_array = np.transpose(array)
print("Original array:")
print(array)
print("Transposed array:")
print(transpose_array)
Output:
Original array:
[[1 2 3]
[4 5 6]]
Transposed array:
[[1 4]
[2 5]
[3 6]
viii. Use NumPy, create an array with 5 dimensions and verify that it has 5 dimensions
Ans:
import numpy as np
array_5d = np.ones((2, 2, 2, 2, 2))
print("Number of dimensions:", array_5d.ndim)
5
Output:
Number of dimensions: 5
ix. Write a NumPy program to merge three given NumPy arrays of same shape
Ans:
import numpy as np
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
array3 = np.array([7, 8, 9])
merged_array = np.concatenate((array1, array2, array3))
print("Merged array:", merged_array)
output:
Merged array: [1 2 3 4 5 6 7 8 9]
x. Create two arrays of six elements, write a NumPy program to count the number of instances
of a value occurring in one array on the condition of another array.
Ans:
import numpy as np
array1 = np.array([1, 2, 3, 2, 4, 2])
array2 = np.array([5, 6, 7, 6, 8, 6])
value_to_count = 2
condition_value = 6
count = np.sum((array1 == value_to_count) & (array2 == condition_value))
print("Number of instances:", count)
output:
Number of instances: 3
Result
Thus, the python program to work with NumPy has executed successfully
6
Ex.no: 2
Date:
Program:
import numpy as np
# Original dictionary
data_dict = {
'column0': {'a': 1, 'b': 0.0, 'c': 0.0, 'd': 2.0},
'column1': {'a': 3.0, 'b': 1, 'c': 0.0, 'd': -1.0},
'column2': {'a': 4, 'b': 1, 'c': 5.0, 'd': -1.0},
'column3': {'a': 3.0, 'b': -1.0, 'c': -1.0, 'd': -1.0}
}
# Convert the dictionary to a NumPy ndarray
ndarray = np.array([list(col.values()) for col in data_dict.values()]).T
print("Original dictionary:")
print(data_dict)
print("Type:")
print("ndarray:")
print(ndarray)
print("Type:", type(ndarray))
Sample output:
Original dictionary:
{‘column0’: {‘a’: 1, ‘b’: 0.0, ‘c’: 0.0, ‘d’: 2.0},
‘column1’: {‘a’: 3.0, ‘b’: 1, ‘c’: 0.0, ‘d’: -1.0},
‘column2’: {‘a’: 4, ‘b’: 1, ‘c’: 5.0, ‘d’: -1.0},
‘column3’: {‘a’: 3.0, ‘b’: -1.0, ‘c’: -1.0, ‘d’: -1.0}}
Type:
ndarray:
7
[[1. 0. 0. 2.]
[3. 1. 0. -1.]
[4. 1. 5. -1.]
[3. -1. -1. -1.]]
Type:<class ‘numpy.ndarray’>
output:
Original dictionary:
{'column0': {'a': 1, 'b': 0.0, 'c': 0.0, 'd': 2.0}, 'column1': {'a': 3.0, 'b': 1, 'c': 0.0, 'd': -1.0}, 'column2': {'a':
4, 'b': 1, 'c': 5.0, 'd': -1.0}, 'column3': {'a': 3.0, 'b': -1.0, 'c': -1.0, 'd': -1.0}}
Type:
ndarray:
[[ 1. 3. 4. 3.]
[ 0. 1. 1. -1.]
[ 0. 0. 5. -1.]
[ 2. -1. -1. -1.]]
Type: <class 'numpy.ndarray'>
Result
Thus, python NumPy program to convert a python dictionary to a NumPy ndarray is executed
successfully.
8
Ex.no: 3
Date:
Program
i. Create your own simple Pandas DataFrame and print its values.
import pandas as pd
# Creating a simple DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# Printing DataFrame values
print("DataFrame values:")
print(df.values)
output:
DataFrame values:
[['Alice' 24 'New York']
['Bob' 27 'Los Angeles']
['Charlie' 22 'Chicago']
['David' 32 'Houston']]
ii. Perform appending, slicing, addition and deletion of rows with a pandas dataframe.
import pandas as pd
# Initial DataFrame
data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)
9
# 1. Append a new row
new_row = pd.DataFrame([{'Name': 'Eve', 'Age': 29, 'City': 'San Francisco'}])
df = pd.concat([df, new_row], ignore_index=True)
# 2. Slice rows (e.g., select rows 1 to 3)
sliced_df = df.iloc[1:4]
print("Sliced DataFrame (rows 1 to 3):")
print(sliced_df)
# 3. Add rows (concatenate with another DataFrame)
additional_data = pd.DataFrame({
'Name': ['Frank', 'Grace'],
'Age': [30, 25],
'City': ['Seattle', 'Austin']
})
df = pd.concat([df, additional_data], ignore_index=True)
# 4. Delete a row by index (e.g., delete row with index 2)
df = df.drop(index=2)
print("\nDataFrame after appending, adding, and deleting rows:")
print(df)
Output:
Sliced DataFrame (rows 1 to 3):
Name Age City
1 Bob 27 Los Angeles
2 Charlie 22 Chicago
3 David 32 Houston
DataFrame after appending, adding, and deleting rows:
Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
3 David 32 Houston
4 Eve 29 San Francisco
5 Frank 30 Seattle
6 Grace 25 Austin
10
iii. Using Pandas, Create a DataFrame with a list of dictionaries, row indices, and column
indices
Program
import pandas as pd
# List of dictionaries
data = [
{'Name': 'Alice', 'Age': 24, 'City': 'New York'},
{'Name': 'Bob', 'Age': 27, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 22, 'City': 'Chicago'},
{'Name': 'David', 'Age': 32, 'City': 'Houston'}
]
# Specifying row indices and column order
df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4'], columns=['Name', 'Age', 'City'])
print("DataFrame with specified row and column indices:")
print(df)
Output:
DataFrame with specified row and column indices:
Name Age City
row1 Alice 24 New York
row2 Bob 27 Los Angeles
row3 Charlie 22 Chicago
row4 David 32 Houston
iv. Write a Pandas program to goet the powers of an array values element-wise.
Note: First array elements raised to powers from second array
Sample data:
{‘X’: [78, 85, 96, 80, 86], ‘Y’: [84, 94, 89, 83, 86], ‘Z’: [86, 97, 96, 72, 83]}
Expected Output:
XYZ
0 78 84 86
1 85 94 97
2 96 89 72
3 80 83 72
4 86 86 83
11
Program
import pandas as pd
import numpy as np
# Sample data as a dictionary
data = {'X': [78, 85, 96, 80, 86], 'Y': [84, 94, 89, 83, 86], 'Z': [86, 97, 96, 72, 83]}
df = pd.DataFrame(data)
# Element-wise power: X raised to the power of Y
df['Power_X_Y'] = np.power(df['X'], df['Y'])
print("Original DataFrame:")
print(df[['X', 'Y', 'Z']])
print("\nDataFrame with element-wise power of X^Y:")
print(df[['X', 'Y', 'Z', 'Power_X_Y']])
Output:
Original DataFrame:
X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83
DataFrame with element-wise power of X^Y:
X Y Z Power_X_Y
0 78 84 86 0
1 85 94 97 4551265826121030281
2 96 89 96 0
3 80 83 72 0
4 86 86 83 0
12
v. Write a Pandas Program to get the numeric representation of an array by identifying distinct
values of a given column of a DataFrame
Sample output:
Original DataFrame:
Name Date_Of_Birth Age
0 Alberto Franco 17/05/2002 18.5
1 Gino Mcnell 16/02/1999 21.2
2 Ryan Parkes 25/09/1998 22.5
3 Eesha Hinton 11/05/2002 22.0
4 Gino Mcnell 15/09/1997 23.0
Numeric representation of an array by identifying distinct values:
[0 1 2 3 1]
Index([‘Alberto Franco’, ‘Gino Mcnell’, ‘Ryan Parkes’, ‘Eesha Hinton’], dtype=’object’)
Program
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alberto Franco', 'Gino Mcnell', 'Ryan Parkes', 'Eesha Hinton', 'Gino Mcnell'],
'Date_Of_Birth': ['17/05/2002', '16/02/1999', '25/09/1998', '11/05/2002', '15/09/1997'],
'Age': [18.5, 21.2, 22.5, 22.0, 23.0]
}
df = pd.DataFrame(data)
# Getting the numeric representation of 'Name' column by identifying distinct values
df['Name_numeric'] = pd.factorize(df['Name'])[0]
print("Original DataFrame:")
print(df[['Name', 'Date_Of_Birth', 'Age']])
print("\nNumeric representation of an array by identifying distinct values:")
print(df['Name_numeric'].values)
print("\nUnique names with their numeric index mapping:")
print(pd.Index(df['Name'].unique()))
13
Output:
Original DataFrame:
Name Date_Of_Birth Age
0 Alberto Franco 17/05/2002 18.5
1 Gino Mcnell 16/02/1999 21.2
2 Ryan Parkes 25/09/1998 22.5
3 Eesha Hinton 11/05/2002 22.0
4 Gino Mcnell 15/09/1997 23.0
Numeric representation of an array by identifying distinct values:
[0 1 2 3 1]
Unique names with their numeric index mapping:
Index(['Alberto Franco', 'Gino Mcnell', 'Ryan Parkes', 'Eesha Hinton'], dtype='object')
vi. Write a Pandas program to count the number of rows and columns of a DataFrame.
Sample python dictionary data and list labels:
exam_data = {‘name’: [‘Anastasia’, ‘Dima’, ‘Katherine’, ‘James’, ‘Emily’, ‘Michael’, ‘Matthew’,
‘Laura’, ‘Kevin’, ‘Jonas’],
‘score’: [12.5, 9, 16.5, np.nan, 9. 20, 14.5, np.nan, 8. 19],
‘attempts’: [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
‘qualify’: [‘yes’, ‘no’, ‘yes’, ‘no’, ‘no’, ‘yes’, ‘yes’, ‘no’, ‘no’, ‘yes’]}
labels = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’]
Expected Output:
Number of Rows: 10
Number of Columns: 4
Program
import pandas as pd
import numpy as np
exam_data = {
'name': ['BarathKumar', 'TamilSelvan', 'Dharshan', 'Saravanan', 'SudhanKumar', 'EsaiVani',
'KalaiVani', 'Rupriya', 'Abirami', 'Murugan'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
14
}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
# Creating the DataFrame with row labels
df = pd.DataFrame(exam_data, index=labels)
# Counting rows and columns
num_rows = df.shape[0]
num_columns = df.shape[1]
print("Number of Rows:", num_rows)
print("Number of Columns:", num_columns)
Output:
Number of Rows: 10
Number of Columns: 4
vii. Write a Pandas program to check a given column is present in a DataFrame or not
Sample data:
Original DataFrame
col1 col2 col3
0147
1258
2 3 6 12
3491
4 7 5 11
Col4 is not present in DataFrame.
Col1 is present in DataFrame.
Program
import pandas as pd
data = {
'col1': [1, 2, 3, 4, 7],
'col2': [4, 5, 6, 9, 5],
'col3': [7, 8, 12, 1, 11]
}
df = pd.DataFrame(data)
def check_column_presence(df, column_name):
15
if column_name in df.columns:
print(f"{column_name} is present in DataFrame.")
else:
print(f"{column_name} is not present in DataFrame.")
check_column_presence(df, 'col4')
check_column_presence(df, 'col1')
Output:
col4 is not present in DataFrame.
col1 is present in DataFrame.
Result
Thus, the python programs to work with Pandas DataFrame are executed successfully.
16
Ex.no: 4
Date:
17
plt.show()
Output:
18
ii. Draw a Scatter Plot for the following Pandas DataFrame with Team name and Rank Points
as x and y axis,
[‘Australia’, 2500], [‘Bangladesh’, 1000], [‘England’, 2000], [‘India’, 3000], [‘Srilanka’, 1500]
Program
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Create the DataFrame with team names and rank points
data = {
'Team': ['Australia', 'Bangladesh', 'England', 'India', 'Srilanka'],
'Rank Points': [2500, 1000, 2000, 3000, 1500]
}
df_teams = pd.DataFrame(data)
# Display the DataFrame
print("DataFrame:")
print(df_teams)
# Plotting the scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_teams, x='Team', y='Rank Points', color='Pink', s=100)
# Adding labels and title
plt.title('Scatter Plot of Team Rank Points')
plt.xlabel('Team')
plt.ylabel('Rank Points')
plt.show()
19
Output:
iii. make a three-dimensional plot with randomly generate 50 data points for x, y, and z. Set the
point colour as red, and size of the point as 50.
Ans:
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
# Generating random data for x, y, and z axes
np.random.seed(42)
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)
# Creating a 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Plotting the points with specified color and size
20
ax.scatter(x, y, z, color='red', s=50)
# Adding labels for clarity
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
ax.set_title('3D Scatter Plot with Random Data Points')
plt.show()
Output:
Result
Thus, the python programs to plot the basic plots using Matplotlib is executed successfully.
21
Ex.no: 5
Date:
Univariate Analysis
Aim
To write python programs to apply Univariate Analysis
Use the diabetes data set from Pima Indians Diabetes data set for performing the following:
Apply Univariate analysis:
a. Frequency
b. Mean
c. Median
d. Mode
e. Variance
f. Standard Deviation
g. Skewness and Kurtosis
Program
# Replace with your actual file path if different
import pandas as pd
import numpy as np
from scipy import stats
# Load the dataset
file_path = "/content/diabetes.csv" # Replace with your actual file path if different
data = pd.read_csv(file_path)
# Filter data for Outcome = 0 and Outcome = 1
data_0 = data[data['Outcome'] == 0]
data_1 = data[data['Outcome'] == 1]
# Dictionary to store the results
analysis_results = {
"Outcome = 0": {
"Pregnancies Frequency": data_0["Pregnancies"].value_counts(),
"Glucose Mean": np.mean(data_0["Glucose"]),
"BloodPressure Median": np.median(data_0["BloodPressure"]),
"SkinThickness Mode": stats.mode(data_0["SkinThickness"])[0],
"Insulin Variance": np.var(data_0["Insulin"]),
"BMI Standard Deviation": np.std(data_0["BMI"]),
22
"DiabetesPedigreeFunction Skewness": stats.skew(data_0["DiabetesPedigreeFunction"]),
"Age Kurtosis": stats.kurtosis(data_0["Age"])
},
"Outcome = 1": {
"Pregnancies Frequency": data_1["Pregnancies"].value_counts(),
"Glucose Mean": np.mean(data_1["Glucose"]),
"BloodPressure Median": np.median(data_1["BloodPressure"]),
"SkinThickness Mode": stats.mode(data_1["SkinThickness"])[0],
"Insulin Variance": np.var(data_1["Insulin"]),
"BMI Standard Deviation": np.std(data_1["BMI"]),
"DiabetesPedigreeFunction Skewness": stats.skew(data_1["DiabetesPedigreeFunction"]),
"Age Kurtosis": stats.kurtosis(data_1["Age"])
}
}
# Display the analysis for both outcomes
for outcome, stats_dict in analysis_results.items():
print(f"\nStatistical Analysis for {outcome}:")
for stat_name, value in stats_dict.items():
print(f"{stat_name}: {value}")
output:
23
8 16
10 14
9 10
13 5
12 5
11 4
Name: count, dtype: int64
Glucose Mean: 109.98
BloodPressure Median: 70.0
SkinThickness Mode: 0
Insulin Variance: 9754.796735999955
BMI Standard Deviation: 7.682161307861215
DiabetesPedigreeFunction Skewness: 2.00021791479704
Age Kurtosis: 1.9318725201269862
Statistical Analysis for Outcome = 1:
24
15 1
17 1
Name: count, dtype: int64
Glucose Mean: 141.25746268656715
BloodPressure Median: 74.0
SkinThickness Mode: 0
Insulin Variance: 19162.902149699297
BMI Standard Deviation: 7.249404266473003
DiabetesPedigreeFunction Skewness: 1.7127179440927176
Age Kurtosis: -0.36378456012609117
Result
Thus, the python programs to apply Univariate Analysis is executed successfully
25
Ex.no: 6
Date:
Use the diabetes data set from UCI data set for performing the following:
Apply Bivariate Analysis
Program:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load the dataset
file_path = "/content/diabetes.csv" # Update with your actual path if needed
data = pd.read_csv(file_path)
# Display dataset info
print("Dataset Info:")
print(data.info())
print("\nDataset Head:")
print(data.head())
# Multiple Regression Analysis - Logistic Regression for 'Outcome' Prediction
# Define predictors and target variable
X = data.drop(columns=["Outcome"]) # Independent variables
y = data["Outcome"] # Dependent variable
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit logistic regression model
logistic_model = LogisticRegression(max_iter=200)
logistic_model.fit(X_train, y_train)
# Predict on the test set
y_pred = logistic_model.predict(X_test)
# Model Evaluation
26
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Logistic Regression Model Evaluation:")
print(f"Accuracy: {accuracy:.3f}")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
# Logistic Regression Summary using StatsModels for detailed statistics
X_train_sm = sm.add_constant(X_train) # Adding constant for intercept in statsmodels
logit_model = sm.Logit(y_train, X_train_sm)
result = logit_model.fit()
print("\nLogistic Regression Analysis Summary (StatsModels):")
print(result.summary())
Output:
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
27
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
Dataset Head:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
Classification Report:
precision recall f1-score support
28
weighted avg 0.74 0.74 0.74 231
Result
Thus, the python program to Bivariant analysis with the diabetes data set from UCI data set is
executed successfully.
29
Ex.no: 7
Date:
Statistical and Probability measures on the Iris data set (This program
requires iris.csv file)
Aim
To write a python program to apply statistical and probability measures on any data set
Program
import pandas as pd
import matplotlib.pyplot as plt
# Load the Iris dataset from a text file, Excel file, or from the web
# 1. Reading data from a text file (CSV format)
# Uncomment if you have iris.csv locally:
# df_text = pd.read_csv('path_to_your_file/iris.csv')
# 2. Reading data from an Excel file
# Uncomment if you have iris.xlsx locally:
df_web = pd.read_csv('/content/iris.csv')
# 3. Reading data directly from a URL (web)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
#df_web = pd.read_csv(url, header=None, names=column_names)
# Displaying the first few rows to verify data load
print("First five rows of the Iris dataset:")
print(df_web.head())
# Descriptive Analytics on the Iris dataset
# 1. Basic information about the dataset
print("\nDataset Information:")
print(df_web.info())
# 2. Summary statistics
print("\nSummary Statistics:")
print(df_web.describe())
# 3. Checking for unique species
print("\nUnique Species in the dataset:")
30
print(df_web['species'].unique())
# 4. Count of each species
print("\nCount of each species:")
print(df_web['species'].value_counts())
# 5. Mean, median, and standard deviation of Sepal Length
print("\nMean Sepal Length:", df_web['sepal_length'].mean())
print("Median Sepal Length:", df_web['sepal_length'].median())
print("Standard Deviation of Sepal Length:", df_web['sepal_length'].std())
# 6. Correlation matrix to see relationships between variables
print("\nCorrelation Matrix:")
print(df_web.corr())
# 7. Grouping data by species and calculating mean values
print("\nMean values by species:")
print(df_web.groupby('species').mean())
# 8. Plotting pairplot for visual analysis (if needed)
#Uncomment to visualize if running in an environment with plotting capability
import seaborn as sns
sns.pairplot(df_web, hue="species")
plt.show()
Output:
First five rows of the Iris dataset:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
31
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
None
Summary Statistics:
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333 1.000000
std 0.828066 0.435866 1.765298 0.762238 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000
Unique Species in the dataset:
[0 1 2]
Count of each species:
species
0 50
1 50
2 50
Name: count, dtype: int64
Mean Sepal Length: 5.843333333333334
Median Sepal Length: 5.8
Standard Deviation of Sepal Length: 0.8280661279778629
Correlation Matrix:
sepal_length sepal_width petal_length petal_width species
sepal_length 1.000000 -0.117570 0.871754 0.817941 0.782561
32
sepal_width -0.117570 1.000000 -0.428440 -0.366126 -0.426658
petal_length 0.871754 -0.428440 1.000000 0.962865 0.949035
petal_width 0.817941 -0.366126 0.962865 1.000000 0.956547
species 0.782561 -0.426658 0.949035 0.956547 1.000000
Mean values by species:
sepal_length sepal_width petal_length petal_width
species
0 5.006 3.428 1.462 0.246
1 5.936 2.770 4.260 1.326
2 6.588 2.974 5.552 2.026
Result
Thus, the python program to apply statistical and probability measures on any data set is
executed successfully.
33
Ex.no: 8
Date:
34
x_min, x_max = X_two_features[:, 0].min() - 1, X_two_features[:, 0].max() + 1
y_min, y_max = X_two_features[:, 1].min() - 1, X_two_features[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the labels for each point in the mesh grid
Z = knn_2D.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.figure(figsize=(8, 6))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
plt.contourf(xx, yy, Z, cmap=cmap_light)
# Plot the original data points
plt.scatter(X_two_features[:, 0], X_two_features[:, 1], c=y_encoded, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title("KNN Decision Boundary (2 features)")
plt.show()
output:
35
ii. Unsupervised learning Implementation of K-means Clustering Algorithm
Program:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
np.random.seed(0)
X = np.random.randn(200, 2) + np.array([2, 2])
X = np.vstack((X, np.random.randn(200, 2) + np.array([-2, -2])))
X = np.vstack((X, np.random.randn(200, 2) + np.array([2, -2])))
X = np.vstack((X, np.random.randn(200, 2) + np.array([-2, 2])))
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', s=200,
color='black')
plt.show()
OUTPUT
Result
Thus, the python program to implement Supervised and unsupervised learning with python
program
36
Ex. No: 9
Date:
37
plt.ylabel("Density")
# 3. Three-Dimensional Plotting
# 3D plot of Age, BMI, and Glucose colored by Outcome
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
# Scatter plot
sc = ax.scatter(data['Age'], data['BMI'], data['Glucose'], c=data['Outcome'], cmap="viridis", s=50,
alpha=0.7)
ax.set_xlabel("Age")
ax.set_ylabel("BMI")
ax.set_zlabel("Glucose")
ax.set_title("3D Plot of Age, BMI, and Glucose")
38
plt.colorbar(sc, label="Outcome")
plt.show()
Output:
39
40
ii. Apply and explore various plotting functions on UCI data set for performing the following:
i. Correlation and scatter plots
ii. Histograms
iii. Three-dimensional plotting
Program:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# ii. Histograms
# Plot histograms for continuous variables
data.hist(bins=15, figsize=(15, 10), color="skyblue", edgecolor="black")
plt.suptitle("Histograms of Diabetes Dataset Features", y=0.95)
plt.show()
41
# iii. Three-Dimensional Plotting
# 3D plot of Age, BMI, and Glucose colored by Outcome
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
# Scatter plot
sc = ax.scatter(data['Age'], data['BMI'], data['Glucose'], c=data['Outcome'], cmap="viridis", s=50,
alpha=0.7)
ax.set_xlabel("Age")
ax.set_ylabel("BMI")
ax.set_zlabel("Glucose")
ax.set_title("3D Plot of Age, BMI, and Glucose")
plt.colorbar(sc, label="Outcome")
plt.show()
output:
42
Result
Thus, the python program to apply and explore various plotting functions on any data set
43