Da Program Upto 6
Da Program Upto 6
1. Data Preprocessing
import numpy as np
import pandas as pd
data = {
'A': [10, 15, np.nan, 20, 25, 30, np.nan, 35, 1000], # Contains missing values & an outlier (1000)
'C': [10, 15, 20, 25, 30, 35, 40, 45, 50], # Highly correlated with A (redundant)
'D': ['yes', 'no', np.nan, 'yes', 'no', 'yes', 'no', 'yes', 'no'] # Categorical with missing values
df = pd.DataFrame(data)
Original Data:
A B C D
0 10.0 5 10 yes
1 15.0 7 15 no
2 NaN 8 20 NaN
3 20.0 5 25 yes
4 25.0 10 30 no
5 30.0 12 35 yes
6 NaN 8 40 no
7 35.0 5 45 yes
8 1000.0 15 50 no
A B C D
0 10.000000 5 10 yes
1 15.000000 7 15 no
2 162.142857 8 20 no
3 20.000000 5 25 yes
4 25.000000 10 30 no
5 30.000000 12 35 yes
6 162.142857 8 40 no
7 35.000000 5 45 yes
8 1000.000000 15 50 no
Data After Removing Noise:
A B C D
0 10.000000 5 10 yes
1 15.000000 7 15 no
2 162.142857 8 20 no
3 20.000000 5 25 yes
4 25.000000 10 30 no
5 30.000000 12 35 yes
6 162.142857 8 40 no
7 35.000000 5 45 yes
8 1000.000000 15 50 no
2. Implement any one imputation model
import pandas as pd
import numpy as np
def mean_imputation(data):
return data.fillna(data.mean())
data = pd.DataFrame({
})
print("Original Data:")
print(data)
imputed_data = mean_imputation(data)
print(imputed_data)
OUTPUT:
Original Data:
A B C
A B C
What is an imputer?
The imputer is an estimator used to fill the missing values in datasets. For numerical values, it uses
mean, median, and constant. For categorical values, it uses the most frequently used and constant
value. You can also train your model to predict the missing labels.
NumPy and Pandas are two popular Python libraries often used in data analytics. It is used for working
with arrays. It also has functions for working in domain of linear algebra, fourier transform, and
matrices.NumPy excels in creating N-dimension data objects and performing mathematical operations
efficiently, while Pandas is renowned for data wrangling and its ability to handle large datasets.
What is a dataframe?
A dataframe is a data structure constructed with rows and columns, similar to a database or Excel
spreadsheet. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or
Calc. Pandas DataFrame is a Two-dimensional data structure of mutable size and heterogeneous tabular
data
dictionary can be created by placing a sequence of elements within curly {} braces, separated by a
'comma' Python dictionary are Ordered. Dictionary keys are case sensitive: the same name but different
cases of Key will be treated distinctly. With dictionaries you access values via the keys. The keys can be
of any datatype (int, float, string, and even tuple). A dictionary may contain duplicate values inside it,
but the keys MUST be unique (so it isn't possible to access different values via the same key).
3. Implement Linear Regression
import numpy as np
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
Linear Regression is a supervised learning algorithm used for predicting a continuous dependent
variable based on one or more independent variables. It models the relationship between
variables by fitting a linear equation:
y= β0+β1x1+β2x2+...+βnxn+ϵ
where:
Linear Regression
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
# Confusion matrix
print("Confusion Matrix:")
print(conf_matrix)
y_proba = model.predict_proba(X_values)[:, 1]
plt.xlabel("X")
plt.ylabel("Probability")
plt.legend()
plt.show()
Mathematical Formulation
Instead of a direct linear equation like in Linear Regression, Logistic Regression uses the
sigmoid (logistic) function to map outputs between 0 and 1:
where:
Classification Decision
Once the probability is computed, the decision boundary is set (commonly at 0.5):
1. Binary Logistic Regression (Two classes, e.g., spam vs. not spam).
2. Multinomial Logistic Regression (More than two classes, e.g., cat, dog, horse).
3. Ordinal Logistic Regression (Ordered classes, e.g., low, medium, high risk).
Instead of Mean Squared Error (MSE) used in Linear Regression, Logistic Regression optimizes
the Log Loss (Cross-Entropy Loss):
Logistic Regression
Logistic Regression is suitable for classification tasks, whereas Linear Regression is for regression tasks.
3. Salary Prediction
o Estimating an employee's salary based on experience, education, and skills.
4. Temperature Prediction
o Forecasting the temperature of a city based on historical weather data, humidity, and
wind speed.
1. Spam Detection 📩
o Classifying emails as spam (1) or not spam (0) based on word frequency and metadata.
2. Disease Diagnosis 🏥
o Predicting whether a patient has diabetes (1) or not (0) based on glucose levels, age, and
BMI.
# Load dataset
data = load_iris()
X, y = data.data, data.target
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print(export_text(clf, feature_names=data.feature_names))
NOTES:
How It Works:
The algorithm selects the best feature to split the dataset using criteria like:
Disadvantages:
The Iris dataset is a well-known dataset in machine learning and statistics, used primarily for
classification tasks. It consists of 150 samples of iris flowers, categorized into three species:
Setosa
Versicolor
Virginica
Sepal length
Sepal width
Petal length
Petal width
Balanced Classes – The dataset has three classes with roughly equal representation.
Benchmarking – Many algorithms have been tested on it, making it a good reference.
Well-Defined Features – The four numerical features (sepal length, sepal width, petal length,
petal width) provide clear distinctions between classes. The classes are well-separated, making it
a good dataset for testing classification algorithms.
Custom Data – You can use real-world datasets from CSV files or databases.
Implement Random Forest Classifier
# Load dataset
data = load_iris()
X, y = data.data, data.target
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate accuracy
print(f'Accuracy: {accuracy:.2f}')
NOTE: In your code to implement a Random Forest classifier instead of a Decision Tree classifier.
NOTES:
A Random Forest Classifier is an ensemble learning method that builds multiple decision trees and
combines their predictions to improve accuracy and reduce overfitting. Here's how it works:
Bootstrap Sampling – The dataset is randomly sampled with replacement to create multiple training
subsets.
Random Feature Selection – Each tree considers a random subset of features at each split, increasing
diversity among trees.
Voting/Averaging – For classification, the majority vote from all trees determines the final prediction.