0% found this document useful (0 votes)
13 views20 pages

Da Program Upto 6

The document outlines a Data Analytics Lab course focused on data preprocessing, imputation models, and regression techniques using Python. It includes practical implementations for handling missing values, noise detection, and redundant data elimination, along with examples of Linear and Logistic Regression. Additionally, it covers the Decision Tree Classifier and provides explanations of key concepts such as dataframes, numpy, pandas, and sklearn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Da Program Upto 6

The document outlines a Data Analytics Lab course focused on data preprocessing, imputation models, and regression techniques using Python. It includes practical implementations for handling missing values, noise detection, and redundant data elimination, along with examples of Linear and Logistic Regression. Additionally, it covers the Decision Tree Classifier and provides explanations of key concepts such as dataframes, numpy, pandas, and sklearn.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

COURSE NAME: DATA ANALYTICS LAB COURSE CODE: 22ML607PC

Write a python programs for the following

1. Data Preprocessing

a. Handling missing values

b. Noise detection removal

c. Identifying data redundancy and elimination

import numpy as np

import pandas as pd

from sklearn.impute import SimpleImputer

from scipy import stats

from sklearn.preprocessing import StandardScaler

# Sample dataset with missing values, noise, and redundant data

data = {

'A': [10, 15, np.nan, 20, 25, 30, np.nan, 35, 1000], # Contains missing values & an outlier (1000)

'B': [5, 7, 8, 5, 10, 12, 8, 5, 15], # No missing values

'C': [10, 15, 20, 25, 30, 35, 40, 45, 50], # Highly correlated with A (redundant)

'D': ['yes', 'no', np.nan, 'yes', 'no', 'yes', 'no', 'yes', 'no'] # Categorical with missing values

df = pd.DataFrame(data)

print("Original Data:\n", df)

# ---------- Handling Missing Values ----------


imputer_numeric = SimpleImputer(strategy='mean') # Using mean imputation for numerical columns

df[['A']] = imputer_numeric.fit_transform(df[['A']]) # Fill missing values in column 'A'

imputer_categorical = SimpleImputer(strategy='most_frequent') # Using mode imputation for


categorical data

df[['D']] = imputer_categorical.fit_transform(df[['D']]) # Fill missing values in column 'D'

print("\nData After Handling Missing Values:\n", df)

# ---------- Noise Detection & Removal (Z-score method) ----------

z_scores = np.abs(stats.zscore(df[['A', 'B']])) # Compute Z-scores for numerical columns

df_no_noise = df[(z_scores < 3).all(axis=1)] # Remove rows where Z-score > 3

print("\nData After Removing Noise:\n", df_no_noise)

# ---------- Identifying & Removing Redundant Data ----------

correlation_matrix = df_no_noise.corr() # Compute correlation matrix

high_correlation_features = [col for col in correlation_matrix.columns if any(correlation_matrix[col] >


0.95) and col != 'A']

df_final = df_no_noise.drop(columns=high_correlation_features) # Drop highly correlated features

print("\nFinal Processed Data (Redundant Features Removed):\n", df_final)


OUTPUT:

Original Data:

A B C D

0 10.0 5 10 yes

1 15.0 7 15 no

2 NaN 8 20 NaN

3 20.0 5 25 yes

4 25.0 10 30 no

5 30.0 12 35 yes

6 NaN 8 40 no

7 35.0 5 45 yes

8 1000.0 15 50 no

Data After Handling Missing Values:

A B C D

0 10.000000 5 10 yes

1 15.000000 7 15 no

2 162.142857 8 20 no

3 20.000000 5 25 yes

4 25.000000 10 30 no

5 30.000000 12 35 yes

6 162.142857 8 40 no

7 35.000000 5 45 yes

8 1000.000000 15 50 no
Data After Removing Noise:

A B C D

0 10.000000 5 10 yes

1 15.000000 7 15 no

2 162.142857 8 20 no

3 20.000000 5 25 yes

4 25.000000 10 30 no

5 30.000000 12 35 yes

6 162.142857 8 40 no

7 35.000000 5 45 yes

8 1000.000000 15 50 no
2. Implement any one imputation model
import pandas as pd

import numpy as np

def mean_imputation(data):

"""Imputes missing values with the mean of each column."""

return data.fillna(data.mean())

# Example dataset with missing values

data = pd.DataFrame({

'A': [1, 2, np.nan, 4, 5],

'B': [3, np.nan, 7, 8, 9],

'C': [10, 11, 12, np.nan, 14]

})

print("Original Data:")

print(data)

# Apply mean imputation

imputed_data = mean_imputation(data)

print("\nData after Mean Imputation:")

print(imputed_data)

OUTPUT:
Original Data:

A B C

0 1.0 3.0 10.0

1 2.0 NaN 11.0

2 NaN 7.0 12.0

3 4.0 8.0 NaN

4 5.0 9.0 14.0

Data after Mean Imputation:

A B C

0 1.0 3.00 10.00

1 2.0 6.75 11.00

2 3.0 7.00 12.00

3 4.0 8.00 11.75

4 5.0 9.00 14.00

What is an imputer?

The imputer is an estimator used to fill the missing values in datasets. For numerical values, it uses
mean, median, and constant. For categorical values, it uses the most frequently used and constant
value. You can also train your model to predict the missing labels.

What is Numpy and Pandas?

NumPy and Pandas are two popular Python libraries often used in data analytics. It is used for working
with arrays. It also has functions for working in domain of linear algebra, fourier transform, and
matrices.NumPy excels in creating N-dimension data objects and performing mathematical operations
efficiently, while Pandas is renowned for data wrangling and its ability to handle large datasets.

What are the uses of sklearn?


It is one of the most useful library for machine learning in Python. The sklearn library contains a lot of
efficient tools for machine learning and statistical modeling including classification, regression, clustering
and dimensionality reduction. Generated sklearn datasets are synthetic datasets, generated using the
sklearn library in Python. They are used for testing, benchmarking and developing machine learning
algorithms/models.

What is a dataframe?

A dataframe is a data structure constructed with rows and columns, similar to a database or Excel
spreadsheet. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or
Calc. Pandas DataFrame is a Two-dimensional data structure of mutable size and heterogeneous tabular
data

What is an dict{} in python?

dictionary can be created by placing a sequence of elements within curly {} braces, separated by a
'comma' Python dictionary are Ordered. Dictionary keys are case sensitive: the same name but different
cases of Key will be treated distinctly. With dictionaries you access values via the keys. The keys can be
of any datatype (int, float, string, and even tuple). A dictionary may contain duplicate values inside it,
but the keys MUST be unique (so it isn't possible to access different values via the same key).
3. Implement Linear Regression
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

# Generate synthetic data

np.random.seed(42)

X = 2 * np.random.rand(100, 1)

y = 4 + 3 * X + np.random.randn(100, 1)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression using sklearn

model = LinearRegression()

model.fit(X_train, y_train)

# Predict on test set

y_pred = model.predict(X_test)

# Calculate mean squared error

mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error: {mse}")


# Plot results

plt.scatter(X_test, y_test, color='blue', label='Actual')

plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')

plt.xlabel("X")

plt.ylabel("y")

plt.legend()

plt.show()

# Manual Implementation of Linear Regression using Normal Equation

X_b = np.c_[np.ones((100, 1)), X] # Add bias term

theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

print(f"Calculated coefficients: {theta_best.ravel()}")

what is linear regression?

Linear Regression is a supervised learning algorithm used for predicting a continuous dependent
variable based on one or more independent variables. It models the relationship between
variables by fitting a linear equation:

y= β0+β1x1+β2x2+...+βnxn+ϵ

where:

 y is the dependent variable (target),


 x i are independent variables (features),
 β i are coefficients (weights),
 ϵ epsilonϵ is the error term.
The goal of Linear Regression is to find the best-fitting line (or hyperplane in higher dimensions)
that minimizes the difference between predicted and actual values, often using methods like
Ordinary Least Squares (OLS).

Linear Regression

 Predicts continuous values.


 The model fits a straight line to the data.
4. Implement Logistic Regression
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, confusion_matrix

# Generate synthetic data

np.random.seed(42)

X = 2 * np.random.rand(100, 1)

y = (X > 1).astype(int).ravel() # Binary classification based on threshold

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression using sklearn

model = LogisticRegression()

model.fit(X_train, y_train)

# Predict on test set

y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")

# Confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")

print(conf_matrix)

# Plot decision boundary

X_values = np.linspace(0, 2, 100).reshape(-1, 1)

y_proba = model.predict_proba(X_values)[:, 1]

plt.scatter(X_test, y_test, color='blue', label='Actual')

plt.plot(X_values, y_proba, color='red', linewidth=2, label='Predicted Probability')

plt.xlabel("X")

plt.ylabel("Probability")

plt.legend()

plt.show()

What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for classification problems.


Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts
probabilities and assigns data points to discrete classes (e.g., 0 or 1, spam or not spam, disease
or no disease).

Mathematical Formulation

Instead of a direct linear equation like in Linear Regression, Logistic Regression uses the
sigmoid (logistic) function to map outputs between 0 and 1:

P(y=1)=11+e−(β0+β1x1+β2x2+...+βnxn)P(y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \


beta_2 x_2 + ... + \beta_n x_n)}}P(y=1)=1+e−(β0+β1x1+β2x2+...+βnxn)1

where:

 P(y=1)P(y=1)P(y=1) is the probability that the output belongs to class 1.


 β0,β1,...,βn\beta_0, \beta_1, ..., \beta_nβ0,β1,...,βn are the model coefficients.
 x1,x2,...,xnx_1, x_2, ..., x_nx1,x2,...,xn are input features.
 The sigmoid function transforms the linear output into a probability range of (0,1)(0,1)
(0,1).

Classification Decision

Once the probability is computed, the decision boundary is set (commonly at 0.5):

 If P(y=1)≥0.5P(y=1) \geq 0.5P(y=1)≥0.5, classify as 1.


 If P(y=1)<0.5P(y=1) < 0.5P(y=1)<0.5, classify as 0.

Types of Logistic Regression

1. Binary Logistic Regression (Two classes, e.g., spam vs. not spam).
2. Multinomial Logistic Regression (More than two classes, e.g., cat, dog, horse).
3. Ordinal Logistic Regression (Ordered classes, e.g., low, medium, high risk).

Loss Function in Logistic Regression

Instead of Mean Squared Error (MSE) used in Linear Regression, Logistic Regression optimizes
the Log Loss (Cross-Entropy Loss):

It ensures the model penalizes wrong classifications more strongly.

Logistic Regression

 Predicts binary class probabilities.


 The model fits an S-shaped sigmoid curve.
***Linear Regression produces continuous outputs, while Logistic Regression produces probabilities
mapped to class labels

When to Use Logistic Regression?

✅ When you need classification (yes/no, pass/fail, fraud/not fraud).


✅ When the relationship between independent variables and output is non-linear but can be
mapped using probabilities.
✅ When the dataset is small to medium-sized, as logistic regression is computationally
efficient.

Logistic Regression is suitable for classification tasks, whereas Linear Regression is for regression tasks.

Real-Life Examples of Linear and Logistic Regression

📌 Linear Regression Examples (Predicting Continuous Values)

1. House Price Prediction


o Predicting house prices based on factors like size, location, number of bedrooms, and
age.

2. Stock Market Forecasting


o Predicting stock prices based on past trends, economic indicators, and company
performance.

3. Salary Prediction
o Estimating an employee's salary based on experience, education, and skills.

4. Temperature Prediction
o Forecasting the temperature of a city based on historical weather data, humidity, and
wind speed.

Logistic Regression Examples (Predicting Classifications)

1. Spam Detection 📩
o Classifying emails as spam (1) or not spam (0) based on word frequency and metadata.

2. Disease Diagnosis 🏥
o Predicting whether a patient has diabetes (1) or not (0) based on glucose levels, age, and
BMI.

3. Credit Card Fraud Detection 💳


o Identifying fraudulent transactions based on transaction patterns, location, and
frequency.

4. Customer Churn Prediction 📊


o Predicting whether a customer will continue (0) or cancel (1) a subscription based on
usage and complaints.

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier, export_text

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Load dataset

data = load_iris()

X, y = data.data, data.target

# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree classifier

clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)

clf.fit(X_train, y_train)

# Make predictions

y_pred = clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')

# Display the decision tree rules

print(export_text(clf, feature_names=data.feature_names))

NOTES:

Decision Tree Classifier:

A Decision Tree Classifier is a supervised learning algorithm used for


classification tasks. It works by splitting the data into subsets based on feature
values, forming a tree-like structure where:

Each internal node represents a decision based on a feature.

Each branch represents an outcome of that decision.

Each leaf node represents a class label (final prediction).

How It Works:

The algorithm selects the best feature to split the dataset using criteria like:

Gini Impurity (default in sklearn)

Entropy (Information Gain)

It recursively splits the data, forming a tree structure.

The process stops when:

A predefined depth is reached.

All samples in a node belong to the same class.

Further splits don’t improve accuracy.


Advantages:

Easy to understand and interpret

Requires little data preprocessing (no need for feature scaling)

Handles both numerical and categorical data

Disadvantages:

Prone to overfitting (solved using pruning or ensemble methods like Random


Forest)

Can be unstable with small data changes

What is an Iris dataset?

The Iris dataset is a well-known dataset in machine learning and statistics, used primarily for
classification tasks. It consists of 150 samples of iris flowers, categorized into three species:

Setosa

Versicolor

Virginica

Each sample has four features (measured in centimeters):

Sepal length

Sepal width

Petal length

Petal width

It is often used in educational contexts used for classification tasks because:

Simplicity It is small, well-structured, easy to understand and visualize.


It is built into scikit-learn, making it easy to access.

Balanced Classes – The dataset has three classes with roughly equal representation.

Benchmarking – Many algorithms have been tested on it, making it a good reference.

Well-Defined Features – The four numerical features (sepal length, sepal width, petal length,
petal width) provide clear distinctions between classes. The classes are well-separated, making it
a good dataset for testing classification algorithms.

However, you can use other datasets like:

Wine Dataset (sklearn.datasets.load_wine) – Good for multi-class classification.

Breast Cancer Dataset (sklearn.datasets.load_breast_cancer) – Used for binary classification.

Custom Data – You can use real-world datasets from CSV files or databases.
Implement Random Forest Classifier

from sklearn.datasets import load_iris

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Load dataset

data = load_iris()

X, y = data.data, data.target

# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest classifier

clf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=3, random_state=42)

clf.fit(X_train, y_train)

# Make predictions

y_pred = clf.predict(X_test)
# Evaluate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')

NOTE: In your code to implement a Random Forest classifier instead of a Decision Tree classifier.

NOTES:

A Random Forest Classifier is an ensemble learning method that builds multiple decision trees and
combines their predictions to improve accuracy and reduce overfitting. Here's how it works:

Bootstrap Sampling – The dataset is randomly sampled with replacement to create multiple training
subsets.

Multiple Decision Trees – A decision tree is trained on each subset.

Random Feature Selection – Each tree considers a random subset of features at each split, increasing
diversity among trees.

Voting/Averaging – For classification, the majority vote from all trees determines the final prediction.

Advantages of Random Forest

Reduces overfitting compared to a single decision tree

Handles missing values and large datasets well

Works for both classification and regression tasks

Can measure feature importance

You might also like