0% found this document useful (0 votes)
48 views52 pages

Manual

Uploaded by

hexagonsih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views52 pages

Manual

Uploaded by

hexagonsih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Write a python NumPy program to create a null vector of size 10 and update sixth value to 11

Ans:

import numpy as np

vector = np.zeros(10)

vector[5] = 11

print(vector)

output:

[ 0. 0. 0. 0. 0. 11. 0. 0. 0. 0.]

write a NumPy program to convert an array to a float type

import numpy as np

array = np.array([1, 2, 3, 4, 5])

float_array = array.astype(float)

print(float_array)

output:

[1. 2. 3. 4. 5.]

Write a NumPy program to create a 3 * 3 matrix with values ranging from 2 to 10

Ans:

import numpy as np

matrix = np.arange(2, 11).reshape(3, 3)

print(matrix)

output:

[[ 2 3 4]

[ 5 6 7]

[ 8 9 10]]
Write a NumPy program to convert a list and tuple into arrays

Ans:

import numpy as np

lst = [1, 2, 3, 4]

tpl = (5, 6, 7, 8)

array_from_list = np.array(lst)

array_from_tuple = np.array(tpl)

print(array_from_list)

print(array_from_tuple)

output:

[1 2 3 4]

[5 6 7 8]

Write a NumPy program to convert the values of Centigrade degrees into Fahrenheit degrees and
vice versa. Values have to be stored into a NumPy array.

Ans

import numpy as np

centigrade = np.array([0, 20, 37, 100])

fahrenheit = (centigrade * 9/5) + 32

print("Centigrade to Fahrenheit:", fahrenheit)

fahrenheit_to_centigrade = (fahrenheit - 32) * 5/9

print("Fahrenheit to Centigrade:", fahrenheit_to_centigrade)

output:

Centigrade to Fahrenheit: [ 32. 68. 98.6 212. ]

Fahrenheit to Centigrade: [ 0. 20. 37. 100.]


Write a NumPy program to perform the basic arithmetic operations

Ans:

import numpy as np

array1 = np.array([10, 20, 30, 40])

array2 = np.array([1, 2, 3, 4])

addition = np.add(array1, array2)

subtraction = np.subtract(array1, array2)

multiplication = np.multiply(array1, array2)

division = np.divide(array1, array2)

print("Addition:", addition)

print("Subtraction:", subtraction)

print("Multiplication:", multiplication)

print("Division:", division)

Output:

Addition: [11 22 33 44]

Subtraction: [ 9 18 27 36]

Multiplication: [ 10 40 90 160]

Division: [10. 10. 10. 10.]

Write a NumPy program to transpose an array

Ans:

import numpy as np

array = np.array([[1, 2, 3], [4, 5, 6]])

transpose_array = np.transpose(array)

print("Original array:")

print(array)

print("Transposed array:")

print(transpose_array)
Output:

Original array:

[[1 2 3]

[4 5 6]]

Transposed array:

[[1 4]

[2 5]

[3 6]]

Use NumPy, create an array with 5 dimensions and verify that it has 5 dimensions

Ans:

import numpy as np

array_5d = np.ones((2, 2, 2, 2, 2))

print("Number of dimensions:", array_5d.ndim)

Output:

Number of dimensions: 5

Write a NumPy program to merge three given NumPy arrays of same shape

Ans:

import numpy as np

array1 = np.array([1, 2, 3])

array2 = np.array([4, 5, 6])

array3 = np.array([7, 8, 9])

merged_array = np.concatenate((array1, array2, array3))

print("Merged array:", merged_array)

output:

Merged array: [1 2 3 4 5 6 7 8 9]
Create two arrays of six elements, write a NumPy program to count the number of instances of a
value occurring in one array on the condition of another array.

Ans:

import numpy as np

array1 = np.array([1, 2, 3, 2, 4, 2])

array2 = np.array([5, 6, 7, 6, 8, 6])

value_to_count = 2

condition_value = 6

count = np.sum((array1 == value_to_count) & (array2 == condition_value))

print("Number of instances:", count)

output:

Number of instances: 3

Write a NumPy program to convert a python dictionary to a NumPy ndarray.

Sample output:

Original dictionary:

{‘column0’: {‘a’: 1, ‘b’: 0.0, ‘c’: 0.0, ‘d’: 2.0},

‘column1’: {‘a’: 3.0, ‘b’: 1, ‘c’: 0.0, ‘d’: -1.0},

‘column2’: {‘a’: 4, ‘b’: 1, ‘c’: 5.0, ‘d’: -1.0},

‘column3’: {‘a’: 3.0, ‘b’: -1.0, ‘c’: -1.0, ‘d’: -1.0}}

Type:

ndarray:

[[1. 0. 0. 2.]

[3. 1. 0. -1.]

[4. 1. 5. -1.]

[3. -1. -1. -1.]]

Type:<class ‘numpy.ndarray’>
Ans:

import numpy as np

# Original dictionary

data_dict = {

'column0': {'a': 1, 'b': 0.0, 'c': 0.0, 'd': 2.0},

'column1': {'a': 3.0, 'b': 1, 'c': 0.0, 'd': -1.0},

'column2': {'a': 4, 'b': 1, 'c': 5.0, 'd': -1.0},

'column3': {'a': 3.0, 'b': -1.0, 'c': -1.0, 'd': -1.0}

# Convert the dictionary to a NumPy ndarray

ndarray = np.array([list(col.values()) for col in data_dict.values()]).T

print("Original dictionary:")

print(data_dict)

print("Type:")

print("ndarray:")

print(ndarray)

print("Type:", type(ndarray))

output:

Original dictionary:

{'column0': {'a': 1, 'b': 0.0, 'c': 0.0, 'd': 2.0}, 'column1': {'a': 3.0, 'b': 1, 'c': 0.0, 'd': -1.0}, 'column2': {'a':
4, 'b': 1, 'c': 5.0, 'd': -1.0}, 'column3': {'a': 3.0, 'b': -1.0, 'c': -1.0, 'd': -1.0}}

Type:

ndarray:

[[ 1. 3. 4. 3.]

[ 0. 1. 1. -1.]

[ 0. 0. 5. -1.]

[ 2. -1. -1. -1.]]

Type: <class 'numpy.ndarray'>


Create your own simple Pandas DataFrame and print its values.

Ans:

import pandas as pd

# Creating a simple DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [24, 27, 22, 32],

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']

df = pd.DataFrame(data)

# Printing DataFrame values

print("DataFrame values:")

print(df.values)

output:

DataFrame values:

[['Alice' 24 'New York']

['Bob' 27 'Los Angeles']

['Charlie' 22 'Chicago']

['David' 32 'Houston']]

Perform appending, slicing, addition and deletion of rows with a pandas dataframe.

Ans:

import pandas as pd

# Initial DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [24, 27, 22, 32],

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']

}
df = pd.DataFrame(data)

# 1. Append a new row

new_row = pd.DataFrame([{'Name': 'Eve', 'Age': 29, 'City': 'San Francisco'}])

df = pd.concat([df, new_row], ignore_index=True)

# 2. Slice rows (e.g., select rows 1 to 3)

sliced_df = df.iloc[1:4]

print("Sliced DataFrame (rows 1 to 3):")

print(sliced_df)

# 3. Add rows (concatenate with another DataFrame)

additional_data = pd.DataFrame({

'Name': ['Frank', 'Grace'],

'Age': [30, 25],

'City': ['Seattle', 'Austin']

})

df = pd.concat([df, additional_data], ignore_index=True)

# 4. Delete a row by index (e.g., delete row with index 2)

df = df.drop(index=2)

print("\nDataFrame after appending, adding, and deleting rows:")

print(df)

Output:

Sliced DataFrame (rows 1 to 3):

Name Age City

1 Bob 27 Los Angeles

2 Charlie 22 Chicago

3 David 32 Houston
DataFrame after appending, adding, and deleting rows:

Name Age City

0 Alice 24 New York

1 Bob 27 Los Angeles

3 David 32 Houston

4 Eve 29 San Francisco

5 Frank 30 Seattle

6 Grace 25 Austin

Using Pandas, Create a DataFrame with a list of dictionaries, row indices, and column indices

Ans:

import pandas as pd

# List of dictionaries

data = [

{'Name': 'Alice', 'Age': 24, 'City': 'New York'},

{'Name': 'Bob', 'Age': 27, 'City': 'Los Angeles'},

{'Name': 'Charlie', 'Age': 22, 'City': 'Chicago'},

{'Name': 'David', 'Age': 32, 'City': 'Houston'}

# Specifying row indices and column order

df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4'], columns=['Name', 'Age', 'City'])

print("DataFrame with specified row and column indices:")

print(df)

Output:

DataFrame with specified row and column indices:

Name Age City

row1 Alice 24 New York

row2 Bob 27 Los Angeles

row3 Charlie 22 Chicago

row4 David 32 Houston


Write a Pandas program to goet the powers of an array values element-wise.

Note: First array elements raised to powers from second array

Sample data:

{‘X’: [78, 85, 96, 80, 86], ‘Y’: [84, 94, 89, 83, 86], ‘Z’: [86, 97, 96, 72, 83]}

Expected Output:

XYZ

0 78 84 86

1 85 94 97

2 96 89 72

3 80 83 72

4 86 86 83

Ans:

import pandas as pd

import numpy as np

# Sample data as a dictionary

data = {'X': [78, 85, 96, 80, 86], 'Y': [84, 94, 89, 83, 86], 'Z': [86, 97, 96, 72, 83]}

df = pd.DataFrame(data)

# Element-wise power: X raised to the power of Y

df['Power_X_Y'] = np.power(df['X'], df['Y'])

print("Original DataFrame:")

print(df[['X', 'Y', 'Z']])

print("\nDataFrame with element-wise power of X^Y:")

print(df[['X', 'Y', 'Z', 'Power_X_Y']])


Output:

Original DataFrame:

X Y Z

0 78 84 86

1 85 94 97

2 96 89 96

3 80 83 72

4 86 86 83

DataFrame with element-wise power of X^Y:

X Y Z Power_X_Y

0 78 84 86 0

1 85 94 97 4551265826121030281

2 96 89 96 0

3 80 83 72 0

4 86 86 83 0

Write a Pandas Program to get the numeric representation of an array by identifying distinct values
of a given column of a DataFrame

Sample output:

Original DataFrame:

Name Date_Of_Birth Age

0 Alberto Franco 17/05/2002 18.5

1 Gino Mcnell 16/02/1999 21.2

2 Ryan Parkes 25/09/1998 22.5

3 Eesha Hinton 11/05/2002 22.0

4 Gino Mcnell 15/09/1997 23.0

Numeric representation of an array by identifying distinct values:

[0 1 2 3 1]

Index([‘Alberto Franco’, ‘Gino Mcnell’, ‘Ryan Parkes’, ‘Eesha Hinton’], dtype=’object’)


Ans:

import pandas as pd

# Sample DataFrame

data = {

'Name': ['Alberto Franco', 'Gino Mcnell', 'Ryan Parkes', 'Eesha Hinton', 'Gino Mcnell'],

'Date_Of_Birth': ['17/05/2002', '16/02/1999', '25/09/1998', '11/05/2002', '15/09/1997'],

'Age': [18.5, 21.2, 22.5, 22.0, 23.0]

df = pd.DataFrame(data)

# Getting the numeric representation of 'Name' column by identifying distinct values

df['Name_numeric'] = pd.factorize(df['Name'])[0]

print("Original DataFrame:")

print(df[['Name', 'Date_Of_Birth', 'Age']])

print("\nNumeric representation of an array by identifying distinct values:")

print(df['Name_numeric'].values)

print("\nUnique names with their numeric index mapping:")

print(pd.Index(df['Name'].unique()))

Output:

Original DataFrame:

Name Date_Of_Birth Age

0 Alberto Franco 17/05/2002 18.5

1 Gino Mcnell 16/02/1999 21.2

2 Ryan Parkes 25/09/1998 22.5

3 Eesha Hinton 11/05/2002 22.0

4 Gino Mcnell 15/09/1997 23.0

Numeric representation of an array by identifying distinct values:

[0 1 2 3 1]

Unique names with their numeric index mapping:

Index(['Alberto Franco', 'Gino Mcnell', 'Ryan Parkes', 'Eesha Hinton'], dtype='object')


Write a Pandas program to count the number of rows and columns of a DataFrame.

Sample python dictionary data and list labels:

exam_data = {‘name’: [‘Anastasia’, ‘Dima’, ‘Katherine’, ‘James’, ‘Emily’, ‘Michael’, ‘Matthew’, ‘Laura’,
‘Kevin’, ‘Jonas’],

‘score’: [12.5, 9, 16.5, np.nan, 9. 20, 14.5, np.nan, 8. 19],

‘attempts’: [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

‘qualify’: [‘yes’, ‘no’, ‘yes’, ‘no’, ‘no’, ‘yes’, ‘yes’, ‘no’, ‘no’, ‘yes’]}

labels = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’]

Expected Output:

Number of Rows: 10

Number of Columns: 4

Ans:

import pandas as pd

import numpy as np

exam_data = {

'name': ['BarathKumar', 'TamilSelvan', 'Dharshan', 'Saravanan', 'SudhanKumar', 'EsaiVani',


'KalaiVani', 'Rupriya', 'Abirami', 'Murugan'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],

'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

# Creating the DataFrame with row labels

df = pd.DataFrame(exam_data, index=labels)

# Counting rows and columns

num_rows = df.shape[0]

num_columns = df.shape[1]

print("Number of Rows:", num_rows)

print("Number of Columns:", num_columns)

Output:

Number of Rows: 10

Number of Columns: 4
Write a Pandas program to check a given column is present in a DataFrame or not

Sample data:

Original DataFrame

col1 col2 col3

0147

1258

2 3 6 12

3491

4 7 5 11

Col4 is not present in DataFrame.

Col1 is present in DataFrame.

Ans:

import pandas as pd

data = {

'col1': [1, 2, 3, 4, 7],

'col2': [4, 5, 6, 9, 5],

'col3': [7, 8, 12, 1, 11]

df = pd.DataFrame(data)

def check_column_presence(df, column_name):

if column_name in df.columns:

print(f"{column_name} is present in DataFrame.")

else:

print(f"{column_name} is not present in DataFrame.")

check_column_presence(df, 'col4')

check_column_presence(df, 'col1')

Output:

col4 is not present in DataFrame.

col1 is present in DataFrame.


Using the ‘concrete strength’ dataset, explore relationships between two continuous variables with
Scatterplots

Ans:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Set a random seed for reproducibility

np.random.seed(42)

# Save the DataFrame as a CSV file

file_path = '/content/concrete_strength_parabolic.csv'

df.to_csv(file_path, index=False)

print(f"Concrete strength dataset saved to {file_path}")

# Plotting relationships between continuous variables

# Scatterplot between 'Cement' and 'Strength'

plt.figure(figsize=(8, 6))

sns.scatterplot(data=df, x='Cement', y='Strength', color='blue')

plt.title('Relationship between Cement and Concrete Strength')

plt.xlabel('Cement (kg/m³)')

plt.ylabel('Concrete Strength (MPa)')

plt.show()

# Scatterplot between 'Water' and 'Strength'

plt.figure(figsize=(8, 6))

sns.scatterplot(data=df, x='Water', y='Strength', color='green')

plt.title('Relationship between Water and Concrete Strength')

plt.xlabel('Water (kg/m³)')

plt.ylabel('Concrete Strength (MPa)')

plt.show()
Output:
Draw a Scatter Plot for the following Pandas DataFrame with Team name and Rank Points as x and
y axis,

[‘Australia’, 2500], [‘Bangladesh’, 1000], [‘England’, 2000], [‘India’, 3000], [‘Srilanka’, 1500]

Ans:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# Create the DataFrame with team names and rank points

data = {

'Team': ['Australia', 'Bangladesh', 'England', 'India', 'Srilanka'],

'Rank Points': [2500, 1000, 2000, 3000, 1500]

df_teams = pd.DataFrame(data)

# Display the DataFrame

print("DataFrame:")

print(df_teams)

# Plotting the scatter plot

plt.figure(figsize=(8, 6))

sns.scatterplot(data=df_teams, x='Team', y='Rank Points', color='Pink', s=100)

# Adding labels and title

plt.title('Scatter Plot of Team Rank Points')

plt.xlabel('Team')

plt.ylabel('Rank Points')

plt.show()

Output:
24. Perform Reading data from text files, Excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set (This program requires iris.csv file)

Ans:

import pandas as pd

import matplotlib.pyplot as plt

# Load the Iris dataset from a text file, Excel file, or from the web

# 1. Reading data from a text file (CSV format)

# Uncomment if you have iris.csv locally:

# df_text = pd.read_csv('path_to_your_file/iris.csv')

# 2. Reading data from an Excel file

# Uncomment if you have iris.xlsx locally:

df_web = pd.read_csv('/content/iris.csv')

# 3. Reading data directly from a URL (web)

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

#df_web = pd.read_csv(url, header=None, names=column_names)


# Displaying the first few rows to verify data load

print("First five rows of the Iris dataset:")

print(df_web.head())

# Descriptive Analytics on the Iris dataset

# 1. Basic information about the dataset

print("\nDataset Information:")

print(df_web.info())

# 2. Summary statistics

print("\nSummary Statistics:")

print(df_web.describe())

# 3. Checking for unique species

print("\nUnique Species in the dataset:")

print(df_web['species'].unique())

# 4. Count of each species

print("\nCount of each species:")

print(df_web['species'].value_counts())

# 5. Mean, median, and standard deviation of Sepal Length

print("\nMean Sepal Length:", df_web['sepal_length'].mean())

print("Median Sepal Length:", df_web['sepal_length'].median())

print("Standard Deviation of Sepal Length:", df_web['sepal_length'].std())

# 6. Correlation matrix to see relationships between variables

print("\nCorrelation Matrix:")

print(df_web.corr())

# 7. Grouping data by species and calculating mean values

print("\nMean values by species:")

print(df_web.groupby('species').mean())

# 8. Plotting pairplot for visual analysis (if needed)

#Uncomment to visualize if running in an environment with plotting capability

import seaborn as sns

sns.pairplot(df_web, hue="species")

plt.show()
Output:

First five rows of the Iris dataset:

sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

Dataset Information:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150 entries, 0 to 149

Data columns (total 5 columns):

# Column Non-Null Count Dtype

0 sepal_length 150 non-null float64

1 sepal_width 150 non-null float64

2 petal_length 150 non-null float64

3 petal_width 150 non-null float64

4 species 150 non-null int64

dtypes: float64(4), int64(1)

memory usage: 6.0 KB

None

Summary Statistics:

sepal_length sepal_width petal_length petal_width species

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333 1.000000

std 0.828066 0.435866 1.765298 0.762238 0.819232

min 4.300000 2.000000 1.000000 0.100000 0.000000

25% 5.100000 2.800000 1.600000 0.300000 0.000000

50% 5.800000 3.000000 4.350000 1.300000 1.000000

75% 6.400000 3.300000 5.100000 1.800000 2.000000

max 7.900000 4.400000 6.900000 2.500000 2.000000


Unique Species in the dataset:

[0 1 2]

Count of each species:

species

0 50

1 50

2 50

Name: count, dtype: int64

Mean Sepal Length: 5.843333333333334

Median Sepal Length: 5.8

Standard Deviation of Sepal Length: 0.8280661279778629

Correlation Matrix:

sepal_length sepal_width petal_length petal_width species

sepal_length 1.000000 -0.117570 0.871754 0.817941 0.782561

sepal_width -0.117570 1.000000 -0.428440 -0.366126 -0.426658

petal_length 0.871754 -0.428440 1.000000 0.962865 0.949035

petal_width 0.817941 -0.366126 0.962865 1.000000 0.956547

species 0.782561 -0.426658 0.949035 0.956547 1.000000

Mean values by species:

sepal_length sepal_width petal_length petal_width

species

0 5.006 3.428 1.462 0.246

1 5.936 2.770 4.260 1.326

2 6.588 2.974 5.552 2.026


25. make a three-dimensional plot with randomly generate 50 data points for x, y, and z. Set the
point colour as red, and size of the point as 50.

Ans:

import matplotlib.pyplot as plt

import numpy as np

from mpl_toolkits.mplot3d import Axes3D

# Generating random data for x, y, and z axes

np.random.seed(42)

x = np.random.rand(50)

y = np.random.rand(50)

z = np.random.rand(50)
# Creating a 3D plot

fig = plt.figure(figsize=(8, 6))

ax = fig.add_subplot(111, projection='3d')

# Plotting the points with specified color and size

ax.scatter(x, y, z, color='red', s=50)

# Adding labels for clarity

ax.set_xlabel('X Axis')

ax.set_ylabel('Y Axis')

ax.set_zlabel('Z Axis')

ax.set_title('3D Scatter Plot with Random Data Points')

plt.show()

Output:
27. Use the diabetes data set from Pima Indians Diabetes data set for performing the following:

Apply Univariate analysis:

a. Frequency
b. Mean
c. Median
d. Mode
e. Variance
f. Standard Deviation
g. Skewness and Kurtosis

Ans:

# Replace with your actual file path if different

import pandas as pd

import numpy as np

from scipy import stats

# Load the dataset

file_path = "/content/diabetes.csv" # Replace with your actual file path if different

data = pd.read_csv(file_path)

# Filter data for Outcome = 0 and Outcome = 1

data_0 = data[data['Outcome'] == 0]

data_1 = data[data['Outcome'] == 1]

# Dictionary to store the results

analysis_results = {

"Outcome = 0": {

"Pregnancies Frequency": data_0["Pregnancies"].value_counts(),

"Glucose Mean": np.mean(data_0["Glucose"]),

"BloodPressure Median": np.median(data_0["BloodPressure"]),

"SkinThickness Mode": stats.mode(data_0["SkinThickness"])[0],

"Insulin Variance": np.var(data_0["Insulin"]),

"BMI Standard Deviation": np.std(data_0["BMI"]),

"DiabetesPedigreeFunction Skewness": stats.skew(data_0["DiabetesPedigreeFunction"]),

"Age Kurtosis": stats.kurtosis(data_0["Age"])

},
"Outcome = 1": {

"Pregnancies Frequency": data_1["Pregnancies"].value_counts(),

"Glucose Mean": np.mean(data_1["Glucose"]),

"BloodPressure Median": np.median(data_1["BloodPressure"]),

"SkinThickness Mode": stats.mode(data_1["SkinThickness"])[0],

"Insulin Variance": np.var(data_1["Insulin"]),

"BMI Standard Deviation": np.std(data_1["BMI"]),

"DiabetesPedigreeFunction Skewness": stats.skew(data_1["DiabetesPedigreeFunction"]),

"Age Kurtosis": stats.kurtosis(data_1["Age"])

# Display the analysis for both outcomes

for outcome, stats_dict in analysis_results.items():

print(f"\nStatistical Analysis for {outcome}:")

for stat_name, value in stats_dict.items():

print(f"{stat_name}: {value}")

output:

Statistical Analysis for Outcome = 0:

Pregnancies Frequency: Pregnancies

1 106

2 84

0 73

3 48

4 45

5 36

6 34

7 20

8 16

10 14

9 10
13 5

12 5

11 4

Name: count, dtype: int64

Glucose Mean: 109.98

BloodPressure Median: 70.0

SkinThickness Mode: 0

Insulin Variance: 9754.796735999955

BMI Standard Deviation: 7.682161307861215

DiabetesPedigreeFunction Skewness: 2.00021791479704

Age Kurtosis: 1.9318725201269862

Statistical Analysis for Outcome = 1:

Pregnancies Frequency: Pregnancies

0 38

1 29

3 27

7 25

4 23

8 22

5 21

2 19

9 18

6 16

10 10

11 7

13 5

12 4

14 2

15 1

17 1
Name: count, dtype: int64

Glucose Mean: 141.25746268656715

BloodPressure Median: 74.0

SkinThickness Mode: 0

Insulin Variance: 19162.902149699297

BMI Standard Deviation: 7.249404266473003

DiabetesPedigreeFunction Skewness: 1.7127179440927176

Age Kurtosis: -0.36378456012609117

28. use the diabetes data set from Pima Indians Diabetes data set for performing the following:

Apply Bivariate analysis the Multiple Regression analysis

Ans:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import statsmodels.api as sm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset

file_path = "/content/diabetes.csv" # Replace with your actual file path if different

data = pd.read_csv(file_path)

# Bivariate Analysis

# Plot pairwise relationships using Seaborn pairplot for a subset of variables

selected_columns = ["Pregnancies", "Glucose", "BloodPressure", "BMI", "Age", "Outcome"]

sns.pairplot(data[selected_columns], hue="Outcome", diag_kind="kde")

plt.suptitle("Bivariate Analysis - Pairwise Relationships", y=1.02)

plt.show()

# Correlation Heatmap to show correlations between all variables

plt.figure(figsize=(10, 8))

sns.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")


plt.title("Correlation Matrix for Diabetes Dataset")

plt.show()

# Multiple Regression Analysis

# Define features (X) and target (y)

X = data[["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI",


"DiabetesPedigreeFunction", "Age"]]

y = data["Outcome"]

# Add a constant term for intercept

X = sm.add_constant(X)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the regression model using statsmodels for detailed statistics

model = sm.OLS(y_train, X_train).fit()

print("Multiple Regression Analysis Summary:")

print(model.summary())

# Predict on the test set

y_pred = model.predict(X_test)

# Evaluate model performance

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"\nModel Evaluation Metrics:\nMean Squared Error: {mse:.3f}\nR-squared: {r2:.3f}")

Output:
Multiple Regression Analysis Summary:

OLS Regression Results

==============================================================================

Dep. Variable: Outcome R-squared: 0.321

Model: OLS Adj. R-squared: 0.310

Method: Least Squares F-statistic: 31.15

Date: Sat, 26 Oct 2024 Prob (F-statistic): 5.04e-40

Time: 15:40:12 Log-Likelihood: -260.66

No. Observations: 537 AIC: 539.3

Df Residuals: 528 BIC: 577.9

Df Model: 8

Covariance Type: nonrobust

==================================================================================
==========

coef std err t P>|t| [0.025 0.975]

--------------------------------------------------------------------------------------------

const -1.0014 0.105 -9.564 0.000 -1.207 -0.796

Pregnancies 0.0090 0.006 1.431 0.153 -0.003 0.021

Glucose 0.0057 0.001 9.352 0.000 0.005 0.007

BloodPressure -0.0017 0.001 -1.673 0.095 -0.004 0.000

SkinThickness -0.0003 0.001 -0.185 0.854 -0.003 0.002

Insulin -0.0001 0.000 -0.710 0.478 -0.000 0.000

BMI 0.0162 0.003 6.350 0.000 0.011 0.021

DiabetesPedigreeFunction 0.0729 0.052 1.407 0.160 -0.029 0.175

Age 0.0063 0.002 3.368 0.001 0.003 0.010

==============================================================================

Omnibus: 24.912 Durbin-Watson: 1.957

Prob(Omnibus): 0.000 Jarque-Bera (JB): 19.537

Skew: 0.376 Prob(JB): 5.72e-05

Kurtosis: 2.444 Cond. No. 1.14e+03

==============================================================================
Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 1.14e+03. This might indicate that there are

strong multicollinearity or other numerical problems.

Model Evaluation Metrics:

Mean Squared Error: 0.176

R-squared: 0.222

30. Use the diabetes data set from UCI data set from performing the following:

Apply Bivariate analysis to Linear and logistic regression modelling

Ans:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import statsmodels.api as sm

from sklearn.linear_model import LinearRegression, LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix,


classification_report

# Load the dataset

file_path = "/content/diabetes.csv" # Replace with your actual file path if different

data = pd.read_csv(file_path)

# Bivariate Analysis

# Correlation Heatmap

plt.figure(figsize=(10, 8))

sns.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")

plt.title("Bivariate Analysis - Correlation Matrix for Diabetes Dataset")


plt.show()

# Pairplot for relationships between features and outcome

sns.pairplot(data, hue="Outcome", diag_kind="kde")

plt.suptitle("Bivariate Analysis - Pairwise Relationships", y=1.02)

plt.show()

# Linear Regression - Using "Glucose" to predict "BMI" as an example

X_linear = data[["Glucose"]] # Independent variable

y_linear = data["BMI"] # Dependent variable

# Split data into training and testing sets

X_train_linear, X_test_linear, y_train_linear, y_test_linear = train_test_split(X_linear, y_linear,


test_size=0.3, random_state=42)

# Fit the linear regression model

linear_model = LinearRegression()

linear_model.fit(X_train_linear, y_train_linear)

# Predict on the test set

y_pred_linear = linear_model.predict(X_test_linear)

# Evaluate Linear Regression model

mse_linear = mean_squared_error(y_test_linear, y_pred_linear)

r2_linear = r2_score(y_test_linear, y_pred_linear)

print("Linear Regression Model:")

print(f"Mean Squared Error: {mse_linear:.3f}")

print(f"R-squared: {r2_linear:.3f}\n")

# Logistic Regression - Predicting "Outcome"


X_logistic = data[["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI",
"DiabetesPedigreeFunction", "Age"]] # Independent variables

y_logistic = data["Outcome"] # Dependent variable

# Split data into training and testing sets

X_train_logistic, X_test_logistic, y_train_logistic, y_test_logistic = train_test_split(X_logistic,


y_logistic, test_size=0.3, random_state=42)

# Fit the logistic regression model

logistic_model = LogisticRegression(max_iter=200)

logistic_model.fit(X_train_logistic, y_train_logistic)

# Predict on the test set

y_pred_logistic = logistic_model.predict(X_test_logistic)

# Evaluate Logistic Regression model

accuracy_logistic = accuracy_score(y_test_logistic, y_pred_logistic)

conf_matrix_logistic = confusion_matrix(y_test_logistic, y_pred_logistic)

class_report_logistic = classification_report(y_test_logistic, y_pred_logistic)

print("Logistic Regression Model:")

print(f"Accuracy: {accuracy_logistic:.3f}")

print("Confusion Matrix:")

print(conf_matrix_logistic)

print("\nClassification Report:")

print(class_report_logistic)
output:
Linear Regression Model:

Mean Squared Error: 67.161

R-squared: 0.061

Logistic Regression Model:

Accuracy: 0.736

Confusion Matrix:

[[120 31]

[ 30 50]]
Classification Report:

precision recall f1-score support

0 0.80 0.79 0.80 151

1 0.62 0.62 0.62 80

accuracy 0.74 231

macro avg 0.71 0.71 0.71 231

weighted avg 0.74 0.74 0.74 231

31. Use the diabetes data set from UCI data set for performing the following:

Apply Bivariate analysis Multiple Regression analysis

Ans:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import statsmodels.api as sm

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the dataset

file_path = "/content/diabetes.csv" # Update with your actual path if needed

data = pd.read_csv(file_path)

# Display dataset info

print("Dataset Info:")

print(data.info())

print("\nDataset Head:")

print(data.head())
# Multiple Regression Analysis - Logistic Regression for 'Outcome' Prediction

# Define predictors and target variable

X = data.drop(columns=["Outcome"]) # Independent variables

y = data["Outcome"] # Dependent variable

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit logistic regression model

logistic_model = LogisticRegression(max_iter=200)

logistic_model.fit(X_train, y_train)

# Predict on the test set

y_pred = logistic_model.predict(X_test)

# Model Evaluation

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

class_report = classification_report(y_test, y_pred)

print("Logistic Regression Model Evaluation:")

print(f"Accuracy: {accuracy:.3f}")

print("Confusion Matrix:")

print(conf_matrix)

print("\nClassification Report:")

print(class_report)

# Logistic Regression Summary using StatsModels for detailed statistics

X_train_sm = sm.add_constant(X_train) # Adding constant for intercept in statsmodels

logit_model = sm.Logit(y_train, X_train_sm)


result = logit_model.fit()

print("\nLogistic Regression Analysis Summary (StatsModels):")

print(result.summary())

Output:

Dataset Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 768 entries, 0 to 767

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pregnancies 768 non-null int64

1 Glucose 768 non-null int64

2 BloodPressure 768 non-null int64

3 SkinThickness 768 non-null int64

4 Insulin 768 non-null int64

5 BMI 768 non-null float64

6 DiabetesPedigreeFunction 768 non-null float64

7 Age 768 non-null int64

8 Outcome 768 non-null int64

dtypes: float64(2), int64(7)

memory usage: 54.1 KB

None

Dataset Head:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6

1 1 85 66 29 0 26.6

2 8 183 64 0 0 23.3

3 1 89 66 23 94 28.1

4 0 137 40 35 168 43.1


DiabetesPedigreeFunction Age Outcome

0 0.627 50 1

1 0.351 31 0

2 0.672 32 1

3 0.167 21 0

4 2.288 33 1

Logistic Regression Model Evaluation:

Accuracy: 0.736

Confusion Matrix:

[[120 31]

[ 30 50]]

Classification Report:

precision recall f1-score support

0 0.80 0.79 0.80 151

1 0.62 0.62 0.62 80

accuracy 0.74 231

macro avg 0.71 0.71 0.71 231

weighted avg 0.74 0.74 0.74 231

Optimization terminated successfully.

Current function value: 0.459388

Iterations 6
Logistic Regression Analysis Summary (StatsModels):

Logit Regression Results

==============================================================================

Dep. Variable: Outcome No. Observations: 537

Model: Logit Df Residuals: 528

Method: MLE Df Model: 8

Date: Sat, 26 Oct 2024 Pseudo R-squ.: 0.2905

Time: 17:33:50 Log-Likelihood: -246.69

converged: True LL-Null: -347.71

Covariance Type: nonrobust LLR p-value: 2.378e-39

==================================================================================
==========

coef std err z P>|z| [0.025 0.975]

--------------------------------------------------------------------------------------------

const -9.4451 0.915 -10.321 0.000 -11.239 -7.651

Pregnancies 0.0580 0.039 1.477 0.140 -0.019 0.135

Glucose 0.0359 0.005 7.714 0.000 0.027 0.045

BloodPressure -0.0108 0.007 -1.584 0.113 -0.024 0.003

SkinThickness -0.0015 0.008 -0.179 0.858 -0.018 0.015

Insulin -0.0010 0.001 -0.884 0.377 -0.003 0.001

BMI 0.1090 0.019 5.740 0.000 0.072 0.146

DiabetesPedigreeFunction 0.4215 0.357 1.182 0.237 -0.278 1.120

Age 0.0359 0.012 3.106 0.002 0.013 0.059


32. Apply and explore various plotting functions on UCI data set for performing the following:

i. Normal Value
ii. Density and contour plots
iii. Three-dimensional plotting

Ans:

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from scipy.stats import norm

# Load the dataset

file_path = "/content/diabetes.csv" # Update path as needed

data = pd.read_csv(file_path)

# 1. Normal Value Distribution Plot

# Let's take 'Glucose' and 'BMI' as examples

plt.figure(figsize=(14, 6))

# Plot for Glucose

plt.subplot(1, 2, 1)

sns.histplot(data['Glucose'], kde=True, stat="density", line_kws={'linestyle':'--'}, color="skyblue")

x_vals = np.linspace(data['Glucose'].min(), data['Glucose'].max(), 100)

plt.plot(x_vals, norm.pdf(x_vals, data['Glucose'].mean(), data['Glucose'].std()), color="red",


linestyle="--")

plt.title("Normal Distribution of Glucose")

plt.xlabel("Glucose")

plt.ylabel("Density")

# Plot for BMI

plt.subplot(1, 2, 2)
sns.histplot(data['BMI'], kde=True, stat="density", line_kws={'linestyle':'--'}, color="orange")

x_vals = np.linspace(data['BMI'].min(), data['BMI'].max(), 100)

plt.plot(x_vals, norm.pdf(x_vals, data['BMI'].mean(), data['BMI'].std()), color="red", linestyle="--")

plt.title("Normal Distribution of BMI")

plt.xlabel("BMI")

plt.ylabel("Density")

plt.show()

# 2. Density and Contour Plots

# Using Glucose vs. Insulin for example

plt.figure(figsize=(8, 6))

sns.kdeplot(x=data['Glucose'], y=data['Insulin'], cmap="coolwarm", fill=True, thresh=0.05)

plt.title("Density and Contour Plot of Glucose vs Insulin")

plt.xlabel("Glucose")

plt.ylabel("Insulin")

plt.show()

# 3. Three-Dimensional Plotting

# 3D plot of Age, BMI, and Glucose colored by Outcome

fig = plt.figure(figsize=(10, 7))

ax = fig.add_subplot(111, projection='3d')

# Scatter plot

sc = ax.scatter(data['Age'], data['BMI'], data['Glucose'], c=data['Outcome'], cmap="viridis", s=50,


alpha=0.7)

ax.set_xlabel("Age")

ax.set_ylabel("BMI")

ax.set_zlabel("Glucose")

ax.set_title("3D Plot of Age, BMI, and Glucose")

plt.colorbar(sc, label="Outcome")

plt.show()
Output:
33. Apply and explore various plotting functions on UCI data set for performing the following:

i. Correlation and scatter plots


ii. Histograms
iii. Three-dimensional plotting

Ans:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

import numpy as np

# Load the dataset

file_path = "/content/diabetes.csv"
data = pd.read_csv(file_path)

# i. Correlation and Scatter Plots

# Correlation Heatmap

plt.figure(figsize=(10, 8))

sns.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")

plt.title("Correlation Matrix")

plt.show()

# Pairplot for Scatter Plots (pairwise relationships between variables)

sns.pairplot(data, hue="Outcome", diag_kind="kde")

plt.suptitle("Scatter Plots for Pairwise Relationships", y=1.02)

plt.show()

# ii. Histograms

# Plot histograms for continuous variables

data.hist(bins=15, figsize=(15, 10), color="skyblue", edgecolor="black")

plt.suptitle("Histograms of Diabetes Dataset Features", y=0.95)

plt.show()

# iii. Three-Dimensional Plotting

# 3D plot of Age, BMI, and Glucose colored by Outcome

fig = plt.figure(figsize=(10, 7))

ax = fig.add_subplot(111, projection='3d')

# Scatter plot

sc = ax.scatter(data['Age'], data['BMI'], data['Glucose'], c=data['Outcome'], cmap="viridis", s=50,


alpha=0.7)

ax.set_xlabel("Age")

ax.set_ylabel("BMI")

ax.set_zlabel("Glucose")
ax.set_title("3D Plot of Age, BMI, and Glucose")

plt.colorbar(sc, label="Outcome")

plt.show()

output:
34. Apply and explore various plotting functions on Pima Indians Diabetes data set for performing
the following:

i. Normal Value
ii. Density and contour plots
iv. Three-dimensional plotting

Ans:

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from scipy.stats import norm

# Load the dataset

file_path = "/content/diabetes.csv"

data = pd.read_csv(file_path)
# i. Normal Value Distribution Plot

# Plot normal distribution for 'Glucose' and 'BMI' as examples

plt.figure(figsize=(14, 6))

# Plot for Glucose

plt.subplot(1, 2, 1)

sns.histplot(data['Glucose'], kde=True, stat="density", line_kws={'linestyle':'--'}, color="skyblue")

x_vals = np.linspace(data['Glucose'].min(), data['Glucose'].max(), 100)

plt.plot(x_vals, norm.pdf(x_vals, data['Glucose'].mean(), data['Glucose'].std()), color="red",


linestyle="--")

plt.title("Normal Distribution of Glucose")

plt.xlabel("Glucose")

plt.ylabel("Density")

# Plot for BMI

plt.subplot(1, 2, 2)

sns.histplot(data['BMI'], kde=True, stat="density", line_kws={'linestyle':'--'}, color="orange")

x_vals = np.linspace(data['BMI'].min(), data['BMI'].max(), 100)

plt.plot(x_vals, norm.pdf(x_vals, data['BMI'].mean(), data['BMI'].std()), color="red", linestyle="--")

plt.title("Normal Distribution of BMI")

plt.xlabel("BMI")

plt.ylabel("Density")

plt.show()

# ii. Density and Contour Plots

# Using Glucose vs. Insulin for example

plt.figure(figsize=(8, 6))

sns.kdeplot(x=data['Glucose'], y=data['Insulin'], cmap="coolwarm", fill=True, thresh=0.05)

plt.title("Density and Contour Plot of Glucose vs Insulin")

plt.xlabel("Glucose")

plt.ylabel("Insulin")
plt.show()

# iii. Three-Dimensional Plotting

# 3D plot of Age, BMI, and Glucose colored by Outcome

fig = plt.figure(figsize=(10, 7))

ax = fig.add_subplot(111, projection='3d')

# Scatter plot

sc = ax.scatter(data['Age'], data['BMI'], data['Glucose'], c=data['Outcome'], cmap="viridis", s=50,


alpha=0.7)

ax.set_xlabel("Age")

ax.set_ylabel("BMI")

ax.set_zlabel("Glucose")

ax.set_title("3D Plot of Age, BMI, and Glucose")

plt.colorbar(sc, label="Outcome")

plt.show()

output:
35. Apply and explore various plotting functions on Pima Indians Diabetes data set for performing
the following:

i. Correlation and scatter plots


ii. Histograms
iii. Three-dimensional plotting

36. Apply and explore various plotting functions on UCI data sets

37. Compare the results of the Univariate and Bivariate analysis for the UCI diabetes data set

38. Use the diabetes data set from UCI, perform Univariate analysis

39. Use the diabetes data set from Pima Indians Diabetes, Perform Bivariate analysis

40. Perform Multiple Regression analysis on your own dataset (For example, Car dataset with
Information Company Name, Model, Volume, Weight, CO2) with more than one independent value
to predict a value based on two or more variables.

You might also like