0% found this document useful (0 votes)
0 views

1Data Preprocessing

The document outlines the steps for data preprocessing, including cleaning, feature scaling, encoding categorical variables, and feature engineering, using popular Python libraries like pandas and sklearn. It also provides guidance on setting up Visual Studio Code for Python development and utilizing Kaggle for dataset exercises, with a specific example of a pipeline for the Titanic dataset. The example demonstrates importing libraries, loading data, preprocessing, training a model, and evaluating its accuracy.

Uploaded by

rajesh.a04082004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

1Data Preprocessing

The document outlines the steps for data preprocessing, including cleaning, feature scaling, encoding categorical variables, and feature engineering, using popular Python libraries like pandas and sklearn. It also provides guidance on setting up Visual Studio Code for Python development and utilizing Kaggle for dataset exercises, with a specific example of a pipeline for the Titanic dataset. The example demonstrates importing libraries, loading data, preprocessing, training a model, and evaluating its accuracy.

Uploaded by

rajesh.a04082004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

 Data Preprocessing:

 Cleaning the data: This could involve handling missing values (e.g., using imputation
or dropping rows), outliers, or duplicates.
 Feature scaling: Standardization or normalization (especially important for models
like KNN, SVM, and neural networks).
 Encoding categorical variables: Converting categorical data to numerical format
using techniques like one-hot encoding or label encoding.
 Feature engineering: Creating new features or selecting the most relevant ones to
improve model performance.

Popular Python libraries for this:

 pandas for data manipulation

 sklearn.preprocessing for scaling and encoding

 numpy for numerical operations

 Working with Visual Studio Code:

 Install Python extensions in VS Code for better functionality, such as Python, Jupyter,
and Pylance.
 Make sure to set up a virtual environment to manage dependencies. You can use venv
or conda for this.
 Use Jupyter notebooks within VS Code for interactive data exploration and testing
out models.

 Kaggle Dataset Exercises:

 Kaggle is a goldmine for learning. You can explore competitions, kernels (notebooks),
and datasets for practice.
 Download the datasets and load them into your Python environment. After
preprocessing the data, you can experiment with different models (e.g., Decision Trees,
Random Forest, XGBoost, or even neural networks if you’re feeling adventurous).

 Getting Started with a Kaggle Exercise:


 Download a dataset from Kaggle, say the Titanic dataset (for classification) or House
Prices (for regression).
 Start by exploring the data (using pandas and matplotlib/seaborn for visualization).
 Preprocess the data: handle missing values, encode categories, and scale the features.
 Train a basic model (Logistic Regression for Titanic, Linear Regression for House
Prices) using sklearn and evaluate it.
 Gradually improve your model by experimenting with different algorithms,
hyperparameters, and feature engineering.

Example Pipeline in Python (Titanic Dataset):

# 1. Import Libraries

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# 2. Load Data

data = pd.read_csv('titanic.csv')

# 3. Data Preprocessing

# Fill missing values

data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Encode categorical columns

data = pd.get_dummies(data, columns=['Sex', 'Embarked'])

# Select features and target

X = data[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q',


'Embarked_S']]

y = data['Survived']

# 4. Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Feature Scaling

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# 6. Train Model

model = LogisticRegression()

model.fit(X_train_scaled, y_train)

# 7. Evaluate Model

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)


print(f'Accuracy: {accuracy * 100:.2f}%')

You might also like