Machine Intelligence
Machine Intelligence
SEM-2Part-1
YEAR 2023-2024
Part 1 - Semester 2
Machine Intelligence
By
Mr. RAJIGARE PRATHAMESH ARUN
CERTIFICATE
8 Decision Tree
import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
def generate_dataset(n):
x = []
y = []
random_x1 = np.random.rand()
random_x2 = np.random.rand()
for i in range(n):
x1 = i
x2 = i/2 + np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)
return np.array(x), np.array(y)
x, y = generate_dataset(200)
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.add_subplot(projection ='3d')
ax.scatter(x[:, 1], x[:, 2], y, label ='y', s = 5)
ax.legend()
ax.view_init(45, 0)
plt.show()
Output:
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Output:
Confusion Matrix:
[[36 13]
[11 29]]
Classification Report:
precision recall f1-score support
accuracy 0.73 89
macro avg 0.73 0.73 0.73 89
weighted avg 0.73 0.73 0.73 89
irisData = load_iris()
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Generate plot
plt.plot(neighbors, test_accuracy, label = 'Testing dataset Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training dataset Accuracy')
plt.legend()
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.show()
Example of how bootstrap samples are created and used to estimate a statistic
of interest.
Let's say we have a small dataset of 5 observations:
Original Data: [3, 4, 5, 6, 7]
Create bootstrap samples by resampling with replacement:
We'll create 3 bootstrap samples of size 5 by randomly drawing observations from
the original data with replacement.
Each bootstrap sample will have the same size as the original dataset.
Bootstrap Sample 1: [5, 6, 3, 4, 7]
Bootstrap Sample 2: [4, 3, 6, 4, 6]
Bootstrap Sample 3: [7, 5, 7, 3, 4]
Calculate the statistic of interest (median) for each bootstrap sample:
Bootstrap Sample 1 median: 5
Bootstrap Sample 2 median: 4
Bootstrap Sample 3 median: 5
Repeat steps 1 and 2 many times (e.g., 10,000 times):
By repeating the process of creating bootstrap samples and calculating the median,
we can build an empirical sampling distribution of the median.
Use the empirical sampling distribution to calculate confidence intervals or perform
hypothesis tests:
For example, if we want to construct a 95% confidence interval for the median, we
can find the 2.5th and 97.5th percentiles of the empirical sampling distribution of the
median.
Let's say the 2.5th percentile is 4, and the 97.5th percentile is 6.
Then, the 95% confidence interval for the median would be [4, 6].
Let's say we have a small sample of data representing the heights (in inches) of 10
individuals:
Heights = [65.2, 67.1, 68.5, 69.3, 70.0, 71.2, 72.4, 73.1, 74.5, 75.8]
We want to estimate the 95% confidence interval for the mean height in the
population using bootstrapping.
Here are the steps we would follow:
Calculate the sample mean from the original data:
Sample mean = (65.2 + 67.1 + 68.5 + 69.3 + 70.0 + 71.2 + 72.4 + 73.1 + 74.5 +
75.8) / 10 = 70.71 inches
Create a large number of bootstrap samples from the original data by resampling
with replacement. For example, let's create 10,000 bootstrap samples, each of size
10.
For each bootstrap sample, calculate the mean height.
After computing the means for all 10,000 bootstrap samples, we now have an
empirical bootstrap sampling distribution of the mean.
From this empirical bootstrap sampling distribution, we can determine the 95%
confidence interval by finding the 2.5th and 97.5th percentiles of the distribution.
Let's say the 2.5th percentile is 69.8 inches, and the 97.5th percentile is 71.6 inches.
Then, the 95% confidence interval for the mean height is [69.8, 71.6] inches.
This confidence interval means that if we were to repeat the process of taking a
sample of size 10 and constructing a bootstrap confidence interval many times, 95%
of those intervals would contain the true population mean height.
The key advantage of bootstrapping in this example is that it does not require any
assumptions about the underlying distribution of heights in the population. It relies
solely on the information contained in the original sample data.
Practical 6
For a given data set, split the data into two training and testing and fit the
following on the training set: (i) Linear model using least squares (ii) Ridge
regression model (iii) Lasso model (iv) PCR model (v) PLS model
#Model
lr = LinearRegression()
#Fit model
lr.fit(X_train, y_train)
#predict
#prediction = lr.predict(X_test)
#actual
actual = y_test
ridgeReg.fit(X_train,y_train)
print("\nRidge Model............................................\n")
print("The train score for ridge model is {}".format(train_score_ridge))
print("The test score for ridge model is {}".format(test_score_ridge))
Output:
Output:
iii) Lasso model
Output:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.cross_decomposition import PLSRegression, PLSSVD
from sklearn.metrics import mean_squared_error
df = pd.read_csv('Hitters.csv').dropna().drop('Player', axis=1)
df.info()
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df.Salary
# Drop the column with the independent variable (Salary), and columns for which we
created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
pca = PCA()
X_reduced = pca.fit_transform(scale(X))
pd.DataFrame(pca.components_.T).loc[:4,:5]
# 10-fold CV, with shuffle
n = len(X_reduced)
kf_10 = model_selection.KFold( n_splits=10, shuffle=True, random_state=1)
regr = LinearRegression()
mse = []
# Calculate MSE with only the intercept (no principal components in regression)
score = -1*model_selection.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10,
scoring='neg_mean_squared_error').mean()
mse.append(score)
# Calculate MSE using CV for the 19 principle components, adding one component
at the time.
for i in np.arange(1, 20):
score = -1*model_selection.cross_val_score(regr, X_reduced[:,:i], y.ravel(),
cv=kf_10, scoring='neg_mean_squared_error').mean()
mse.append(score)
# Plot results
plt.plot(mse, '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')
plt.title('Salary')
plt.xlim(xmin=-1);
pca2 = PCA()
mse = []
# Calculate MSE with only the intercept (no principal components in regression)
score = -1*model_selection.cross_val_score(regr, np.ones((n,1)), y_train.ravel(),
cv=kf_10, scoring='neg_mean_squared_error').mean()
mse.append(score)
# Calculate MSE using CV for the 19 principle components, adding one component
at the time.
for i in np.arange(1, 20):
score = -1*model_selection.cross_val_score(regr, X_reduced_train[:,:i],
y_train.ravel(), cv=kf_10, scoring='neg_mean_squared_error').mean()
mse.append(score)
plt.plot(np.array(mse), '-v')
plt.xlabel('Number of principal components in regression')
plt.ylabel('MSE')
plt.title('Salary')
plt.xlim(xmin=-1);
output:
Practical 7:
For a given data set, perform the following:
Perform the polynomial regression and make a plot of the resulting polynomial
fit to the data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
datas = pd.read_csv('data.csv')
datas
X = datas.iloc[:, 1:2].values
y = datas.iloc[:, 2].values
# Features and the target variables
X = datas.iloc[:, 1:2].values
y = datas.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin = LinearRegression()
lin.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)
# Visualising the Linear Regression results
plt.scatter(X, y, color='blue')
plt.plot(X, lin.predict(X), color='red')
plt.title('Linear Regression')
plt.xlabel('Temperature')
plt.ylabel('Pressure')
plt.show()
Output
Output:
Practical 9
For a given data set, split the dataset into training and testing. Fit the following
models on the training set and evaluate the performance on the test set: (i)
Boosting and Bagging (ii) Random Forest
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head()
df.info()
df.isnull().sum()
pd.set_option('display.float_format', '{:.2f}'.format)
df.describe()
categorical_val = []
continous_val = []
for column in df.columns:
# print('==============================')
# print(f"{column} : {df[column].unique()}")
if len(df[column].unique()) <= 10:
categorical_val.append(column)
else:
continous_val.append(column)
df.columns
X = df[feature_columns]
y = df.Outcome
ii)Random Forest
# Data Processing
import pandas as pd
import numpy as np
# Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score,
recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint
# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz
bank_data['default'] = bank_data['default'].map({'no':0,'yes':1,'unknown':0})
bank_data['y'] = bank_data['y'].map({'no':0,'yes':1})
# Split the data into features (X) and target (y)
X = bank_data.drop('y', axis=1)
y = bank_data['y']
for i in range(3):
tree = rf.estimators_[i]
dot_data = export_graphviz(tree,
feature_names=X_train.columns,
filled=True,
max_depth=2,
impurity=False,
proportion=True)
graph = graphviz.Source(dot_data)
display(graph)
output: