1Data Preprocessing
1Data Preprocessing
Cleaning the data: This could involve handling missing values (e.g., using imputation
or dropping rows), outliers, or duplicates.
Feature scaling: Standardization or normalization (especially important for models
like KNN, SVM, and neural networks).
Encoding categorical variables: Converting categorical data to numerical format
using techniques like one-hot encoding or label encoding.
Feature engineering: Creating new features or selecting the most relevant ones to
improve model performance.
Install Python extensions in VS Code for better functionality, such as Python, Jupyter,
and Pylance.
Make sure to set up a virtual environment to manage dependencies. You can use venv
or conda for this.
Use Jupyter notebooks within VS Code for interactive data exploration and testing
out models.
Kaggle is a goldmine for learning. You can explore competitions, kernels (notebooks),
and datasets for practice.
Download the datasets and load them into your Python environment. After
preprocessing the data, you can experiment with different models (e.g., Decision Trees,
Random Forest, XGBoost, or even neural networks if you’re feeling adventurous).
# 1. Import Libraries
import pandas as pd
import numpy as np
# 2. Load Data
data = pd.read_csv('titanic.csv')
# 3. Data Preprocessing
data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
y = data['Survived']
# 4. Split Data
# 5. Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 6. Train Model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# 7. Evaluate Model
y_pred = model.predict(X_test_scaled)