W03 - AI Data Handling
W03 - AI Data Handling
DATA HANDLING IN AI
By: SEK SOCHEAT
Lecturer Artificial Intelligence
2023 – 2024
Mobile: 017 879 967 MSIT – AEU
Email: [email protected]
TABLE OF CONTENTS
Data Handling
• Data Handling
2. Data Exploration
3. Data Cleaning
4. Data Manipulation
5. Data Transformation
6. Exporting Data
• Practical Example
2
INTRODUCTION TO PYTHON DATA HANDLING
INTRODUCTION TO PYTHON FOR DATA HANDLING
Introduction
4
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
INTRODUCTION TO PYTHON FOR DATA HANDLING
6
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
Step-by-Step Instructions
Setting up a virtual environment (venv) in Python helps you create an isolated environment for your
projects, allowing you to manage dependencies and avoid conflicts with other projects.
Here are the instructions to set up a virtual environment with Python:
python --version
Step-by-Step Instructions
5. Verify Activation
- After activation, your command prompt should change to indicate that the virtual environment is active. It
will typically show the environment name in parentheses.
(myAIvenv) $
6. Install Packages
- With the virtual environment activated, you can now install packages using pip. For example, to install
requests:
pip --version
pip install requests
9
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
Step-by-Step Instructions
pip list
deactivate
10
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
Summary
What is Anaconda?
What is Anaconda?
5. Jupyter Notebooks:
- Integrated support for Jupyter Notebooks, an open-source web
application that allows you to create and share documents containing
live code, equations, visualizations, and narrative text.
Following:
1. Download:
- Visit the Anaconda website and download the installer for your operating
system.
2. Run the Installer:
- Follow the instructions in the installer to complete the installation process.
3. Verify Installation:
- Open a terminal (or Anaconda Prompt on Windows) and type:
conda --version
15
- You should see the version of conda that is installed.
SETTING UP ANACONDA
Using Conda
Following:
1. Creating a Virtual Environment
# Create a new environment named myAIvenv with Python 3.12
conda create --name myAIvenv python=3.12
2. Activating a Virtual Environment
# Activate the environment named myAIvenv
conda activate myAIvenv
3. Deactivating a Virtual Environment
# Deactivate the current environment
conda deactivate
16
SETTING UP ANACONDA
Using Conda
Following:
4. Installing Packages
# Install a package, for example, numpy
conda install numpy
5. Listing Installed Packages
# List all packages installed in the current environment
conda list
6. Removing a Package
# Remove a package, for example, numpy
conda remove numpy
17
SETTING UP ANACONDA
Benefits:
• Simplified Package Management: Conda makes it easy to install, update, and manage packages and
dependencies.
• Isolated Environments: Virtual environments help prevent conflicts between different projects.
• Comprehensive Toolset: Includes many tools and libraries commonly used in data science and machine
learning.
• Cross-Platform Support: Works on Windows, macOS, and Linux.
Summary:
• Anaconda is a powerful distribution that simplifies the management of Python and R packages, making
it easier to develop, test, and deploy data science and machine learning projects.
• With tools like conda for package management and Jupyter Notebook for interactive computing,
Anaconda provides a comprehensive environment for scientific computing.
18
SETTING UP ANACONDA
Summary of use Conda or PowerShell Prompt: Create virtual environment
# 2. Read data from Excel file # 4. Read data from JSON file
df = pd.read_excel('file.xlsx') df = pd.read_json('file.json')
21
DATA HANDLING
2. Data Exploration
# Example: df.info()
# Example:
df.isnull().sum()
22
DATA HANDLING
3. Data Cleaning
df.drop_duplicates(inplace=True)
23
DATA HANDLING
4. Data Manipulation
25
DATA HANDLING
6. Exporting Data
• Excel file
Saving Data to Various Formats # Example:
df.to_json('cleaned_data.json')
26
IMPORTING DATA AND DATA EXPLORATION
Practical Example
# Example:
# Importing data and read data from csv file
import pandas as pd
df = pd.read_csv('dataset.csv') End-to-End Data Handling
# Cleaning data
• Real-world Dataset:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True) • Importing data, cleaning,
# Manipulating data manipulating, and exporting.
df['new_column'] = df['id'].apply(lambda x: x * 2)
sorted_df = df.sort_values(by='new_column', ascending=False)
# Exporting data
sorted_df.to_csv('cleaned_and_sorted_data.csv', index=False)
27
IMPORTING DATA AND DATA EXPLORATION
Solution:
df.drop_duplicates(inplace=True) output_file_path
28
EXERCISE: PREDICTING HOUSING PRICES
Objective:
Build a machine learning model to predict 3. Data Exploration:
housing prices based on various features such
• Perform exploratory data analysis (EDA) to understand the data.
as the number of rooms, location, size, etc.
• Visualize the relationships between features and the target
Steps:
variable.
1. Data Collection:
4. Model Building:
• Use the Boston Housing Dataset from
• Split the data into training and testing sets.
scikit-learn, which contains data about
housing prices in Boston. • Train a linear regression model.
2. Data Preprocessing: • Evaluate the model performance using appropriate metrics.
• Handle missing values if any. 5. Model Evaluation:
• Encode categorical variables. • Interpret the results.
• Scale numerical features.
29
• Suggest improvements.
EXERCISE: PREDICTING HOUSING PRICES
Questions:
3. What is the correlation between the features and the target variable?
Answer:
34
PYTHON: PREDICTING HOUSING PRICES
35
EXERCISE: PREDICTING HOUSING PRICES
Explain:
• pandas: Library for data manipulation and analysis. It provides data structures like DataFrame.
• matplotlib.pyplot: Library for creating static, animated, and interactive visualizations import pandas as pd
in Python.
import numpy as np
• seaborn: Data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. import matplotlib.pyplot as plt
• sklearn.metrics.mean_squared_error: Function to calculate mean squared error from sklearn.metrics import mean_squared_error, r2_score
for regression models.
X = data.drop('PRICE', axis=1)
• data.drop('PRICE', axis=1): Drop the target variable 'PRICE' from
y = data['PRICE'] the DataFrame, resulting in the features DataFrame x.
37
EXERCISE: PREDICTING HOUSING PRICES
Explain:
• StandardScaler(): Instantiate the StandardScaler object.
scaler = StandardScaler()
• scaler.fit_transform(X): Fit to the data and then transform it to scale the
X_scaled = scaler.fit_transform(X) features to have zero mean and unit variance. The result is X_scaled.
model = LinearRegression()
• LinearRegression(): Instantiate the LinearRegression model.
model.fit(X_train, y_train)
• model.fit(X_train, y_train): Train the linear regression model using
the training data.
y_pred = model.predict(X_test)
• model.predict(X_test): Use the trained model to make predictions on
the testing data.
39
EXERCISE: PREDICTING HOUSING PRICES
Explain:
41
EXERCISE: PREDICTING HOUSING PRICES
2. R-Squared Score (R²)
• R-Squared (R²) Score: This is a statistical measure that represents the proportion of the variance for the
dependent variable (target) that's explained by the independent variables (features) in the model.
r2 = r2_score(y_test, y_pred)
Output:
42
EXERCISE: PREDICTING HOUSING PRICES
3. Correlation Matrix
Interpretation:
• Positive correlation indicates that as one feature increases, the target variable also
plt.figure(figsize=(12, 8)) increases.
sns.heatmap(data.corr(), • Negative correlation indicates that as one feature increases, the target variable
decreases.
annot=True, cmap='coolwarm')
• Strong correlations (close to -1 or 1) suggest a strong relationship, while weak
plt.show() correlations (close to 0) suggest a weak relationship.
• For instance, 'RM' (average number of rooms per dwelling) might show a strong
positive correlation with 'PRICE', indicating that houses with more rooms tend to be
more expensive.
• 'LSTAT' (percentage of lower status of the population) might show a strong negative
correlation with 'PRICE', indicating that neighborhoods with higher percentages of
lower-status residents tend to have lower house prices.
43
EXERCISE: PREDICTING HOUSING PRICES
4. Pair Plots
plt.show()
• Pair Plots: These scatter plots show the relationship between selected features ('RM', 'LSTAT',
'PTRATIO') and the target variable 'PRICE'.
Interpretation:
• 'RM' vs. 'PRICE': A positive slope indicates that as the number of rooms increases, the house
price also tends to increase.
• 'LSTAT' vs. 'PRICE': A negative slope indicates that as the percentage of lower status of the
population increases, the house price tends to decrease.
• 'PTRATIO' (pupil-teacher ratio by town) vs. 'PRICE': The relationship might show that
higher pupil-teacher ratios are associated with lower house prices.
44
EXERCISE: PREDICTING HOUSING PRICES
Model Performance Summary
• Mean Squared Error (MSE): A lower MSE value (e.g., 20.252) suggests that the model's
predictions are reasonably close to the actual values, indicating decent performance.
• R-Squared Score (R²): An R² value of around 0.734 suggests that the model explains 73.4% of the
variance in the house prices, which is a good indication that the model captures most of the
important patterns in the data.
• Correlation Insights:
• Positive correlation between 'RM' and 'PRICE' indicates that houses with more rooms are
generally more expensive.
• Negative correlation between 'LSTAT' and 'PRICE' suggests that houses in neighborhoods with a
higher percentage of lower-status individuals are generally less expensive.
• The pair plots visually confirm these relationships, providing further insights into how these
features influence house prices.
45
EXERCISE: PREDICTING HOUSING PRICES
Limitations and Improvements
1. Limitations:
• The linear regression model assumes a linear relationship between the features and the target variable,
which may not always be true.
• The model might be sensitive to outliers, which can significantly affect the predictions.
• The model may not capture complex patterns in the data due to its simplicity.
2. Improvements:
• Experiment with more complex models like Decision Trees, Random Forests, or Gradient Boosting
Machines (GBMs) to capture non-linear relationships.
• Perform feature engineering to create new features that might better capture the underlying patterns in the
data.
• Tune hyperparameters using GridSearchCV or RandomizedSearchCV to optimize model performance.
• Use cross-validation to obtain a more reliable estimate of the model's performance and ensure it
generalizes well to unseen data.
46
Thank You!
If you have any questions, please reach me!