Machine Learning Unit 2
Machine Learning Unit 2
Handling missing values of the dataset (a most common issue with every dataset)
import pandas as pd
df = pd.read_csv(".../Train.csv") ID 0
Gender 0
Ever_Married 140
# Count of Records in data Age 0
df.shape Graduated 78
>> (8068, 11) Profession 124
Work_Experience 829
Spending_Score 0
# Check which variables have missing values and how Family_Size 335
many Var_1 76
Segmentation 0
df.isnull().sum() dtype: int64
Dr. Sandip Bankar, STME Navi Mumbai
• Removal : You can either remove rows or columns containing missing
values.
• Also, you can replace the NULL values with the column’s median
# Impute Work_Experience feature by its median in our dataset
df['Work_Experience'] =
df['Work_Experience'].fillna(df['Work_Experience'].median())
Filling the missing values of gender with the string “No Gender”
• df["Gender"].fillna("No Gender", inplace = True)
➢ But many Machine learning models require all input and output variables to be numeric.
➢ If your data contains categorical data, you must encode it to numbers before you can fit and evaluate a
model.
• Ordinal data: This type of categorical data consists of a set of orders or scales. For example, a list of patients
consists of the level of sugar present in the body of a person which can be divided into high, low and
medium classes.
# Create a dataset
data = {'cost': ['50', '35', '75', '42', '54', '71'],
'size': ['large', 'small', 'extra large', 'medium', 'large', 'extra large']}
df = pd.DataFrame(data)
# Initiate OrdinalEncoder
encoder = OrdinalEncoder()
#Original Data
data
• elements = numpy.array(arr)
Python code to delete the outlier and copy the rest of the elements to another array.
# Trimming
for i in sample_outliers:
a = np.delete(sample, np.where(sample==i))
print(a)
# print(len(sample), len(a))
The outlier ‘101’ is deleted and the rest of the data points are copied to another array ‘a’.
• Categorical Imputation: When dealing with categorical columns, replacing missing values with the highest
value in the column is a smart solution.
• #Max fill function for categorical columns
• data[‘column_name’].fillna(data[‘column_name’].value_counts().idxmax(), inplace=True)
• STANDARDIZATION : Standardization is the process of scaling the data values in such a way that that they gain the
properties of standard normal distribution. This means that the data is rescaled in such a way that the mean
becomes zero and the data has unit standard deviation.
• Standardized values do not have a fixed bounded range like Normalised values.
• Our dependent variable y is also a categorical variable. However in this case we can simply assign 0 and 1 to the two categories ‘No’ and
‘Yes’. In this case, we do not require dummy variables to encode the ‘Predicted’ variable as it is a dependent variable that will not be used
to train the model.
• To code this, we are going to need the LabelEncoder class.
• from sklearn.preprocessing import LabelEncoder
• le = LabelEncoder()
• y = le.fit_transform(y)
df = pd.DataFrame({
'Income': [15000, 1800, 120000, 10000],
'Age': [25, 18, 42, 51],
'Department': ['HR','Legal','Marketing','Management']
})
we 1st create a copy of our dataframe and store the numerical feature names in a list, and their
values as well:
Dr. Sandip Bankar, STME Navi Mumbai
MinMax Scaler
• The MinMax scaler is one of the simplest scalers to understand. It just scales all the data between 0 and
1. The formula for calculating the scaled value is-
• x_scaled = (x – x_min)/(x_max – x_min)
These are fast processing methods similar to the filter method but more accurate than the filter
method.
• Random Forest Importance - Different tree-based methods of feature selection help us with
feature importance to provide a way of selecting features. Here, feature importance specifies
which feature has more importance in model building or has a great impact on the target
variable. Random Forest is such a tree-based method, which is a type of bagging algorithm that
aggregates a different number of decision trees. It automatically ranks the nodes by their
performance or decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged
as per the impurity values, and thus it allows to pruning of trees below a specific node. The
remaining nodes create a subset of the most important features.
Dr. Sandip Bankar, STME Navi Mumbai
Dr. Sandip Bankar, STME Navi Mumbai