0% found this document useful (0 votes)
9 views13 pages

AAM PR QB

The document contains a series of Python programming tasks focused on data analysis and machine learning using various datasets, including Iris, IMDb Movies, Titanic, Wine Quality, Mobile Price, SMS Spam, and Mushroom datasets. Each task involves loading the dataset, performing exploratory data analysis, data cleaning, feature selection, and applying different machine learning algorithms such as Naive Bayes, Decision Trees, KNN, and Neural Networks. The document serves as a comprehensive guide for implementing these algorithms and techniques in Python.

Uploaded by

Ishwari khebade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

AAM PR QB

The document contains a series of Python programming tasks focused on data analysis and machine learning using various datasets, including Iris, IMDb Movies, Titanic, Wine Quality, Mobile Price, SMS Spam, and Mushroom datasets. Each task involves loading the dataset, performing exploratory data analysis, data cleaning, feature selection, and applying different machine learning algorithms such as Naive Bayes, Decision Trees, KNN, and Neural Networks. The document serves as a comprehensive guide for implementing these algorithms and techniques in Python.

Uploaded by

Ishwari khebade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

#1 Write python program to study Iris dataset.

# Import necessary libraries


from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Iris dataset


iris = load_iris()

# Create a DataFrame
df = pd.DataFrame(data=iris.data)
df['species'] = iris.target

# Display first 5 rows


print(" First 5 rows of the dataset:")
print(df.head())

# Basic information
print("\n Dataset Information:")
print(df.info())

# Statistical summary
print("\n Statistical Summary:")
print(df.describe())

# Check class distribution


print("\n
Class Distribution:")
print(df['species'].value_counts())

# Pairplot to visualize relationships


sns.pairplot(df, hue='species')
plt.suptitle('Iris Dataset Pair Plot', y=1.02)
plt.show()

#2 Write python program to study Imdb Movies dataset

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the IMDb dataset


df = pd.read_csv("imdb_top_1000.csv")

# Show first 5 rows


print("🔹 First 5 rows:")
print(df.head())
# Show last 5 rows
print("🔹 last 5 rows:")
print(df.tail())

# Dataset information
print("\n🔹 Dataset Info:")
print(df.info())

# Summary statistics
print("\n🔹 Statistical Summary:")
print(df.describe(include='all'))

# Check missing values


print("\n🔹 Missing Values:")
print(df.isnull().sum())

# Visualization: Top 10 Movies by Rating


plt.figure(figsize=(10, 6))
sns.barplot(x='IMDB_Rating', y='Series_Title', data=top_rated)
plt.title('Top 10 IMDb Rated Movies')
plt.xlabel('IMDb Rating')
plt.ylabel('Movie Title')
plt.tight_layout()
plt.show()

#3 Write python program to study Titanic dataset.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load your Titanic dataset


df = pd.read_csv("titanic.csv")

# Show first 5 rows


print("First 5 rows of the dataset:")
print(df.head())

# Dataset structure and data types


print("\n Dataset Info:")
print(df.info())

# Summary statistics
print("\n Summary Statistics:")
print(df.describe)
# Missing values
print("\n Missing Values in Each Column:")
print(df.isnull().sum())

# Survival counts
print("\n Survival Counts:")
print(df['2urvived'].value_counts())

# Plot 1: Survival Count


sns.countplot(data=df, x='2urvived')
plt.title('Survival Count')
plt.xlabel('Survived')
plt.ylabel('Number of Passengers')
plt.show()

# Plot 2: Survival by Passenger Class


sns.countplot(data=df, x='Pclass', hue='2urvived')
plt.title('Survival by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.show()

"""4. Perform followings operations on Wine quality


dataset.
a) Write program to read dataset (Text, CSV, JSON, XML)
b) Which of the attributes are numeric and which are
categorical?
c) Performing Data Cleaning, Handling Missing Data, and
Removing Null data.
d) Rescaling Data v. Encoding Data
e) Feature Selection"""

import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectFromModel

# Read the dataset


data = pd.read_csv('WineQT.csv')

# Identify numeric and categorical attributes


numeric_columns = data.select_dtypes(include='number').columns
categorical_columns = data.select_dtypes(exclude='number').columns

# Handle missing data


data = data.dropna() # Drop rows with missing values

# Rescale numeric data


scaler = StandardScaler()
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

# Encode categorical data (if any)


encoder = LabelEncoder()
for column in categorical_columns:
data[column] = encoder.fit_transform(data[column])

# Feature selection using Linear Regression


X = data.drop('quality', axis=1)
y = data['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train Linear Regression


lr = LinearRegression()
lr.fit(X_train, y_train)

# Feature selection using model coefficients


selector = SelectFromModel(lr, threshold="mean", max_features=5)
selector.fit(X_train, y_train)

# Print selected features


selected_features = X.columns[selector.get_support()]
print("Selected Features using Linear Regression:", selected_features)

"""5 Perform followings operations on Mobile Price dataset.


f) Write program to read dataset (Text, CSV, JSON, XML)
g) Which of the attributes are numeric and which are
categorical?
h) Performing Data Cleaning, Handling Missing Data, and
Removing Null data.
i) Rescaling Data v. Encoding Data
j) Feature Selection"""

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectFromModel

# Read the dataset


data = pd.read_csv('mobile.csv')

print(data.head())

# Identify numeric and categorical attributes


numeric_columns = data.select_dtypes(include='number').columns
categorical_columns = data.select_dtypes(exclude='number').columns

# Handle missing data


data = data.dropna() # Drop rows with missing values

# Rescale numeric data


scaler = StandardScaler()
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

# Encode categorical data (if any)


encoder = LabelEncoder()
for column in categorical_columns:
data[column] = encoder.fit_transform(data[column])

# Feature selection using Linear Regression


X = data.drop('Price ($)', axis=1)
y = data['Price ($)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train Linear Regression


lr = LinearRegression()
lr.fit(X_train, y_train)

# Feature selection using model coefficients


selector = SelectFromModel(lr, threshold="mean", max_features=5)
selector.fit(X_train, y_train)

# Print selected features


selected_features = X.columns[selector.get_support()]
print("Selected Features using Linear Regression:", selected_features)

#6 Write a python Programming code to apply Naive Bayesian


algorithm for sms spam classification.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Step 1: Load dataset


data = pd.read_csv("spam.csv", encoding='ISO-8859-1')
print(data.head())

# Step 2: Split the data into v2 (X) and v1 (y)


X = data['v2']
y = data['v1']
# Step 3: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Step 4: Convert text data to numerical using CountVectorizer (Bag of Words)


vectorizer = CountVectorizer(stop_words='english')
X_train_counts = vectorizer.fit_transform(X_train).toarray() # Convert to
dense array
X_test_counts = vectorizer.transform(X_test).toarray()

# Step 5: Train the Gaussian Naive Bayes model


model = GaussianNB()
model.fit(X_train_counts, y_train)

# Step 6: Make predictions


y_pred = model.predict(X_test_counts)

# Step 7: Evaluate the model


print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

#7 Write a python programming code to apply Naive Bayesian


algorithm for classification of iris dataset.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report

# Step 1: Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Labels

# Step 2: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Step 3: Train the Gaussian Naive Bayes model


gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Step 4: Make predictions


y_pred = gnb.predict(X_test)
# Step 5: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

#8 Write a python Programming code to apply Naive Bayesian


algorithm for classification on Mushroom dataset (to check
if mushroom is poisonous or not)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

# Step 1: Load the Mushroom dataset


df = pd.read_csv("mushrooms.csv")

# Step 2: Preprocessing - Convert categorical data to numeric


# The dataset contains only categorical features, so we need to encode them
using LabelEncoder

label_encoder = LabelEncoder()
# Apply LabelEncoder to each column
for column in df.columns:
df[column] = label_encoder.fit_transform(df[column])

# Step 3: Split the data into features (X) and target (y)
X = df.drop(columns=['class']) # Features (all columns except 'class')
y = df['class'] # Target (the 'class' column contains poisonous or edible)

# Step 4: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Step 5: Initialize and train the Naive Bayes classifier (GaussianNB)


naive_bayes_model = GaussianNB()
naive_bayes_model.fit(X_train, y_train)

# Step 6: Make predictions on the test set


y_pred = naive_bayes_model.predict(X_test)

# Step 7: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy of Naive Bayes Classifier:", accuracy)


print("Confusion Matrix:")
print(conf_matrix)
#9 Write a python Programming code to implement decision
tree for classification using Titanic dataset.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from os import rename

# Load Titanic dataset


df = pd.read_csv("titanic.csv")
print(df.head())

print(df.describe)

df.dropna
df.rename(columns={'2urvived': 'survived'}, inplace=True)

# Features (X) and target (y)


X = df.drop('survived', axis=1)
y = df['survived']

# Split data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the Decision Tree Classifier


model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#10 Write a python Programming code to implement decision


tree for classification using Wine quality dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Wine Quality dataset


df = pd.read_csv("WineQT.csv")
print(df.head())

# Step 2: Preprocess the data


# In this dataset, all features are numerical, but we will split into X and y

# Features (X) and target (y)


X = df.drop('quality', axis=1) # All columns except 'quality' are features
y = df['quality'] # 'quality' is the target column

# Step 3: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Step 4: Initialize and train the Decision Tree classifier


model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 5: Make predictions on the test set


y_pred = model.predict(X_test)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Decision Tree Classifier:", accuracy)

#11 Implement supervised machine learning algorithm (KNN)


in python for Mobile price prediction.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt

# Load dataset from CSV


df = pd.read_csv("Mobile phone price.csv")
print(df.head())
df['Price ($)'] = pd.to_numeric(df['Price ($)'], errors='coerce')
df.dropna(subset=['Price ($)'], inplace=True)

# Label Encoding for 'Brand' and 'Model'


le = LabelEncoder()
df['Brand'] = le.fit_transform(df['Brand'])
df['Model'] = le.fit_transform(df['Model'])

# Features and Target


X = df[['Brand', 'Model','Battery Capacity (mAh)']]
y = df['Price ($)']

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)

# Standardize the features


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the KNN model (KNeighborsRegressor for regression)


knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict the prices on the test set


y_pred = knn.predict(X_test)

# Calculate Mean Absolute Error for accuracy


mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# Plotting the actual vs predicted prices


plt.scatter(y_test, y_pred, color='blue')
plt.xlabel("Actual Price ($)")
plt.ylabel("Predicted Price ($)")
plt.title("Actual vs Predicted Mobile Price")
plt.show()

#12 Implement/Simulate Back Propagation & feed forward


neural network. (Create Your own dataset)

import numpy as np
# Sigmoid activation and its derivative
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
return x * (1 - x)

# Input dataset (4 samples, 2 features)


X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
# Expected output (4 samples, 1 output)
y = np.array([[0], [1], [1], [0]])
# Seed for consistent results
np.random.seed(1)
# Initialize weights and biases
input_layer_neurons = 2
hidden_layer_neurons = 2
output_neurons = 1

# Weights
wh = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons)) # 2x2
bh = np.random.uniform(size=(1, hidden_layer_neurons)) # 1x2
wo = np.random.uniform(size=(hidden_layer_neurons, output_neurons)) # 2x1
bo = np.random.uniform(size=(1, output_neurons)) # 1x1
# Training loop
for epoch in range(10000):

# ---- Feedforward ----


hidden_input = np.dot(X, wh) + bh
hidden_output = sigmoid(hidden_input)

final_input = np.dot(hidden_output, wo) + bo


predicted_output = sigmoid(final_input)

# ---- Backpropagation ----


error = y - predicted_output
d_output = error * sigmoid_derivative(predicted_output)

error_hidden = d_output.dot(wo.T)
d_hidden = error_hidden * sigmoid_derivative(hidden_output)

# ---- Updating weights and biases ----


wo += hidden_output.T.dot(d_output) * 0.1
bo += np.sum(d_output, axis=0, keepdims=True) * 0.1
wh += X.T.dot(d_hidden) * 0.1
bh += np.sum(d_hidden, axis=0, keepdims=True) * 0.1
# Final Output
print("Final Output after training:")
print(predicted_output)

#13 Implement unsupervised machine learning algorithm


(Clustering — K Means) in python on Spotify song dataset to
cluster data.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Load dataset
data = pd.read_csv("spotify-2023.csv", encoding='ISO-8859-1')
print(data.head())

# Use a few features for clustering


data_small = data[['bpm', 'energy_%', 'danceability_%']]

# Scale the data


scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_small)

# Apply K-Means
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(data_scaled)

# Add cluster to data


data_small['cluster'] = clusters

# Reduce to 2D for plotting


pca = PCA(n_components=2)
pca_data = pca.fit_transform(data_scaled)

# Plot
plt.scatter(pca_data[:, 0], pca_data[:, 1], c=clusters)
plt.title("Song Clusters")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.show()

#14 Write a python programming code to apply naive Bayesian


algorithm by converting original dataset to frequency table
& likelihood table for following dataset.

# Step 1: Data
weather = ['Rainy', 'Sunny', 'Overcast', 'Overcast', 'Sunny', 'Rainy']
temp = ['Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']

# Step 2: Frequency count


total_yes = temp.count('Yes')
total_no = temp.count('No')
total = len(temp)

# Prior probabilities
p_yes = total_yes / total
p_no = total_no / total

# Likelihoods
# P(Weather | Temperature)
p_sunny_given_yes = 1 / total_yes # 1 Sunny with Yes
p_sunny_given_no = 1 / total_no # 1 Sunny with No

# Step 3: Apply Naive Bayes formula


prob_yes = p_sunny_given_yes * p_yes
prob_no = p_sunny_given_no * p_no

# Step 4: Compare probabilities


print("Probability of Yes given Sunny:", prob_yes)
print("Probability of No given Sunny:", prob_no)

if prob_yes > prob_no:


print("Prediction: Temperature = Yes")
else:
print("Prediction: Temperature = No")

You might also like