0% found this document useful (0 votes)
39 views47 pages

W03 - AI Data Handling

Uploaded by

KaNika TH11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views47 pages

W03 - AI Data Handling

Uploaded by

KaNika TH11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

INTRODUCTION TO

DATA HANDLING IN AI
By: SEK SOCHEAT
Lecturer Artificial Intelligence
2023 – 2024
Mobile: 017 879 967 MSIT – AEU
Email: [email protected]
TABLE OF CONTENTS
Data Handling

• Introduction to Python for Data Handling

• Setting up a virtual environment (venv) in Python

• Data Handling

1. Reading Data from Various Sources

2. Data Exploration

3. Data Cleaning

4. Data Manipulation

5. Data Transformation

6. Exporting Data

• Practical Example
2
INTRODUCTION TO PYTHON DATA HANDLING
INTRODUCTION TO PYTHON FOR DATA HANDLING
Introduction

Overview of Data Handling


Importance in AI and ML:
• Data is the foundation of AI and ML models.
• Quality data leads to better models and insights.
Key Concepts:
• Data Cleaning: Removing or fixing incorrect, corrupted, or incomplete data.
• Data Manipulation: Changing data to make it suitable for analysis.
• Data Transformation: Converting data into a different format or structure.

4
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
INTRODUCTION TO PYTHON FOR DATA HANDLING

Setting Up the Environment

• Python Installation • Essential Libraries


• Anaconda Distribution: • Pandas:
• Easiest way to get Python and packages. • Primary tool for data handling.
• Includes Jupyter Notebook, useful for • Provides data structures like DataFrame.
interactive coding.
• NumPy:
• Jupyter Notebook:
• Supports large, multi-dimensional arrays and
• Browser-based interface for running Python matrices.
code.
• Provides mathematical functions to operate on
• Ideal for data analysis and visualization. arrays.

6
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Step-by-Step Instructions

Setting up a virtual environment (venv) in Python helps you create an isolated environment for your
projects, allowing you to manage dependencies and avoid conflicts with other projects.
Here are the instructions to set up a virtual environment with Python:

1. Ensure Python is Installed


- Make sure Python is installed on your system. You can check the installation by running:

python --version

2. Install venv Module


- The ‘venv’ module is included in Python 3.3 and later. If you have Python 3.x, you already have ‘venv’
installed.
7
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
Step-by-Step Instructions

Here are the instructions to set up a virtual environment with Python:


3. Create a Virtual Environment
- Navigate to your project directory and create a virtual environment. Replace myenv with
your desired environment name.

python -m venv myAIvenv

4. Activate the Virtual Environment


- On Windows: myAIvenv\Scripts\activate
- On macOS and Linux: source myenv/bin/activate
8
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Step-by-Step Instructions

Here are the instructions to set up a virtual environment with Python:

5. Verify Activation
- After activation, your command prompt should change to indicate that the virtual environment is active. It
will typically show the environment name in parentheses.

(myAIvenv) $

6. Install Packages
- With the virtual environment activated, you can now install packages using pip. For example, to install
requests:

pip --version
pip install requests
9
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Step-by-Step Instructions

Here are the instructions to set up a virtual environment with Python:

7. List Installed Packages


- To see a list of installed packages in the virtual environment:

pip list

8. Deactivate the Virtual Environment


- When you're done working in the virtual environment, you can deactivate it:

deactivate
10
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Summary

• Create: python -m venv myAIvenv


• Activate: source myAIvenv/bin/activate (macOS/Linux) or
myAIvenv\Scripts\activate (Windows)
• Install Packages: pip install <package>
• Deactivate: deactivate

Using virtual environments is a best practice for Python development, as it


keeps dependencies isolated and manageable.
11
SETTING UP ANACONDA

What is Anaconda?

• Anaconda is a popular open-source distribution of Python and R


programming languages for scientific computing, data science, machine
learning, and analytics.
• It simplifies package management and deployment, and it includes many
popular data science and machine learning libraries.
Key Features of Anaconda
1. Comprehensive Distribution:
- Anaconda comes with over 1,500 Python and R packages pre-installed,
making it a powerful toolkit for data science and machine learning.
12
SETTING UP ANACONDA
What is Anaconda?

Key Features of Anaconda

2. Conda Package Manager:


- Anaconda includes conda, a versatile package management system that allows you
to install, update, and remove packages and manage virtual environments.
3. Virtual Environments:
- Create isolated environments to manage dependencies and avoid conflicts between
projects.
4. Cross-Platform:
- Anaconda is available for Windows, macOS, and Linux.
13
SETTING UP ANACONDA

What is Anaconda?

Key Features of Anaconda

5. Jupyter Notebooks:
- Integrated support for Jupyter Notebooks, an open-source web
application that allows you to create and share documents containing
live code, equations, visualizations, and narrative text.

6. Data Science Tools:


- Includes tools like JupyterLab, Spyder, and RStudio, which are essential
for data science and analysis.
14
SETTING UP ANACONDA
Installing Anaconda

Following:
1. Download:
- Visit the Anaconda website and download the installer for your operating
system.
2. Run the Installer:
- Follow the instructions in the installer to complete the installation process.
3. Verify Installation:
- Open a terminal (or Anaconda Prompt on Windows) and type:
conda --version
15
- You should see the version of conda that is installed.
SETTING UP ANACONDA

Using Conda

Following:
1. Creating a Virtual Environment
# Create a new environment named myAIvenv with Python 3.12
conda create --name myAIvenv python=3.12
2. Activating a Virtual Environment
# Activate the environment named myAIvenv
conda activate myAIvenv
3. Deactivating a Virtual Environment
# Deactivate the current environment
conda deactivate
16
SETTING UP ANACONDA

Using Conda

Following:
4. Installing Packages
# Install a package, for example, numpy
conda install numpy
5. Listing Installed Packages
# List all packages installed in the current environment
conda list
6. Removing a Package
# Remove a package, for example, numpy
conda remove numpy
17
SETTING UP ANACONDA

Benefits and Summary of Using Anaconda

Benefits:
• Simplified Package Management: Conda makes it easy to install, update, and manage packages and
dependencies.
• Isolated Environments: Virtual environments help prevent conflicts between different projects.
• Comprehensive Toolset: Includes many tools and libraries commonly used in data science and machine
learning.
• Cross-Platform Support: Works on Windows, macOS, and Linux.

Summary:
• Anaconda is a powerful distribution that simplifies the management of Python and R packages, making
it easier to develop, test, and deploy data science and machine learning projects.
• With tools like conda for package management and Jupyter Notebook for interactive computing,
Anaconda provides a comprehensive environment for scientific computing.
18
SETTING UP ANACONDA
Summary of use Conda or PowerShell Prompt: Create virtual environment

1. Create a conda enviroment


conda create --name myenv
Using venv within Anaconda is
2. Activate conda environment
possible and straightforward,
conda activate myenv but using conda is often more
3. Installing a Package convenient and powerful when
working within the Anaconda
conda install numpy pandas ecosystem.
4. Deactivate the conda environment
conda deactivate
19
DATA HANDLING IN AI AND ML
DATA HANDLING
1. Reading Data from Various Sources

# 3. Read data from SQLite3 Database


# 1. Read data from CSV file
import sqlite3
import pandas as pd conn = sqlite3.connect('database.db')
df = pd.read_csv('file.csv') df = pd.read_sql_query('SELECT * FROM
table', conn)

# 2. Read data from Excel file # 4. Read data from JSON file

df = pd.read_excel('file.xlsx') df = pd.read_json('file.json')

21
DATA HANDLING
2. Data Exploration

1. Basic Data Inspection 2. Data Summary


• Viewing Data: • info( ): Overview of the DataFrame.
• head( ): Displays the first few rows.
• describe( ): Summary statistics.
• tail( ): Displays the last few rows.
# Example:

# Example: df.info()

3. Handling Missing Data df.describe()


df.head()
df.tail() • Checking for missing values.

# Example:

df.isnull().sum()
22
DATA HANDLING
3. Data Cleaning

2. Imputing Missing Values


1. Identifying and Handling Missing Values
• Filling missing values with a specific value or method.
• Removing Missing Values:
# Example:
• Dropping rows/columns with missing
values. df.fillna(method='ffill', inplace=True)

3. Handling Duplicate Data


# Example:
• Removing Duplicates
df.dropna(inplace=True)
# Example:

df.drop_duplicates(inplace=True)
23
DATA HANDLING
4. Data Manipulation

1. Filtering and Selecting Data • Sorting and Grouping Data


• Using ‘loc’ and ‘iloc’: • Sorting: Sorting data by one or more columns.
• loc: Label-based indexing.
# Example:
• iloc: Position-based indexing.

# Example: df.sort_values(by='column', ascending=False)

df.loc[df['column'] > value] • Grouping: Grouping data and calculating aggregate


df.iloc[0:5, 1:3] statistics.

• Conditional Selection # Example:


df.groupby('column').mean()
# Example:
df[df['column'] == 'value']
24
DATA HANDLING
5. Data Transformation

1. Applying Functions • Lambda Functions: Small anonymous


functions defined with lambda.
• Using apply()
• Applying a function along an axis of
the DataFrame. # Example:
# Example:
df['new_column'] = df['column'].apply(lambda x: x * 2)
df['column'] = df['column'].apply(lambda x: x + 1)

25
DATA HANDLING
6. Exporting Data

• Excel file
Saving Data to Various Formats # Example:

• CSV file df.to_excel('cleaned_data.xlsx', index=False)


# Example:
• JSON file
df.to_csv('cleaned_data.csv', index=False) # Example:

df.to_json('cleaned_data.json')
26
IMPORTING DATA AND DATA EXPLORATION
Practical Example

# Example:
# Importing data and read data from csv file
import pandas as pd
df = pd.read_csv('dataset.csv') End-to-End Data Handling
# Cleaning data
• Real-world Dataset:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True) • Importing data, cleaning,
# Manipulating data manipulating, and exporting.
df['new_column'] = df['id'].apply(lambda x: x * 2)
sorted_df = df.sort_values(by='new_column', ascending=False)

# Exporting data
sorted_df.to_csv('cleaned_and_sorted_data.csv', index=False)

27
IMPORTING DATA AND DATA EXPLORATION
Solution:

import pandas as pd # Manipulating data

# Using the 'id' column for manipulation


# Importing data df['new_column'] = df['id'].apply(lambda x: x * 2)

file_path = 'dataset.csv' sorted_df = df.sort_values(by='new_column',


ascending=False)
df = pd.read_csv(file_path)
# Exporting data

# Cleaning data output_file_path = 'cleaned_and_sorted_data.csv'

df.dropna(inplace=True) sorted_df.to_csv(output_file_path, index=False)

df.drop_duplicates(inplace=True) output_file_path
28
EXERCISE: PREDICTING HOUSING PRICES
Objective:
Build a machine learning model to predict 3. Data Exploration:
housing prices based on various features such
• Perform exploratory data analysis (EDA) to understand the data.
as the number of rooms, location, size, etc.
• Visualize the relationships between features and the target
Steps:
variable.
1. Data Collection:
4. Model Building:
• Use the Boston Housing Dataset from
• Split the data into training and testing sets.
scikit-learn, which contains data about
housing prices in Boston. • Train a linear regression model.
2. Data Preprocessing: • Evaluate the model performance using appropriate metrics.
• Handle missing values if any. 5. Model Evaluation:
• Encode categorical variables. • Interpret the results.
• Scale numerical features.
29
• Suggest improvements.
EXERCISE: PREDICTING HOUSING PRICES
Questions:

1. What is the shape of the dataset?


2. Are there any missing values in the dataset?
3. What is the correlation between the features and the target
variable?
4. How well does the linear regression model perform? What are
its limitations?
5. What improvements can be made to enhance the model's
performance?
30
EXERCISE: PREDICTING HOUSING PRICES
Answers:
1. What is the shape of the dataset?

Answer: The dataset contains 506 rows and 14 columns.

2. Are there any missing values in the dataset?

Answer: There are no missing values in the dataset.

3. What is the correlation between the features and the target variable?

Answer:

• Some key correlations with the target variable (PRICE) include:

• RM (number of rooms): Strong positive correlation

• LSTAT (percentage of lower status of the population): Strong negative correlation

• PTRATIO (pupil-teacher ratio by town): Moderate negative correlation


31
EXERCISE: PREDICTING HOUSING PRICES
Answers:
4. How well does the linear regression model perform? What are its limitations?
Answer:
The linear regression model's performance:
• Mean Squared Error (MSE): A measure of the average squared difference between the observed
actual outcomes and the predicted outcomes.
• R^2 Score: A statistical measure that represents the proportion of the variance for the dependent
variable that's explained by the independent variables in the model.
Limitations:
• Assumes a linear relationship between features and the target.
• Sensitive to outliers.
• May not capture complex patterns in the data.
32
EXERCISE: PREDICTING HOUSING PRICES
Answers:

5. What improvements can be made to enhance the model's performance?


Answer:
• Try different models (e.g., Decision Tree, Random Forest, Gradient
Boosting).
• Perform feature engineering (e.g., interaction terms, polynomial features).
• Tune hyperparameters using GridSearchCV or RandomizedSearchCV.
• Use cross-validation for better evaluation.
• Address multicollinearity and scale features appropriately.
33
PYTHON: PREDICTING HOUSING PRICES

34
PYTHON: PREDICTING HOUSING PRICES

35
EXERCISE: PREDICTING HOUSING PRICES
Explain:
• pandas: Library for data manipulation and analysis. It provides data structures like DataFrame.

• numpy: Library for numerical operations in Python.

• matplotlib.pyplot: Library for creating static, animated, and interactive visualizations import pandas as pd
in Python.
import numpy as np
• seaborn: Data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. import matplotlib.pyplot as plt

• sklearn.model_selection.train_test_split: Function to split the dataset into import seaborn as sns


training and testing sets.
from sklearn.model_selection import train_test_split
• sklearn.preprocessing.StandardScaler: Class to standardize features by removing
the mean and scaling to unit variance. from sklearn.preprocessing import StandardScaler

• sklearn.linear_model.LinearRegression: Class to perform linear regression. from sklearn.linear_model import LinearRegression

• sklearn.metrics.mean_squared_error: Function to calculate mean squared error from sklearn.metrics import mean_squared_error, r2_score
for regression models.

• sklearn.metrics.r2_score: Function to calculate the R-squared score, which indicates the


proportion of the variance in the dependent variable that is predictable from the independent
36
variables.
EXERCISE: PREDICTING HOUSING PRICES
Explain:

• Load the dataset from the boston_housing.csv file into a pandas


data = pd.read_csv('boston_housing.csv') DataFrame.

• data.isnull().sum(): Check for missing values in each column.


if data.isnull().sum().any(): • .any(): Check if there are any missing values in the DataFrame.
data = data.fillna(data.median()) • data.fillna(data.median()): If there are missing values, fill them
with the median of each column.

X = data.drop('PRICE', axis=1)
• data.drop('PRICE', axis=1): Drop the target variable 'PRICE' from
y = data['PRICE'] the DataFrame, resulting in the features DataFrame x.

• data['PRICE']: Extract the target variable 'PRICE' into the Series y.

37
EXERCISE: PREDICTING HOUSING PRICES
Explain:
• StandardScaler(): Instantiate the StandardScaler object.
scaler = StandardScaler()
• scaler.fit_transform(X): Fit to the data and then transform it to scale the
X_scaled = scaler.fit_transform(X) features to have zero mean and unit variance. The result is X_scaled.

X_train, X_test, y_train, y_test = • train_test_split(X_scaled, y, test_size=0.2, random_state=42): Split the


train_test_split(X_scaled, y, test_size=0.2, scaled features X_scaled and target y into training and testing sets. 80% of
random_state=42) the data is used for training, and 20% for testing. random_state=42 ensures
reproducibility.
plt.figure(figsize=(12, 8))
• plt.figure(figsize=(12, 8)): Set the figure size for the plot.
sns.heatmap(data.corr(), annot=True,
cmap='coolwarm') • sns.heatmap(data.corr(), annot=True, cmap='coolwarm'): Create a
plt.show() heatmap of the correlation matrix of the dataset. annot=True shows the
correlation values on the heatmap. cmap='coolwarm' sets the color map.

38 • plt.show(): Display the plot.


EXERCISE: PREDICTING HOUSING PRICES
Explain:

• sns.pairplot(data, x_vars=['RM', 'LSTAT', 'PTRATIO'],


y_vars='PRICE', height=7, aspect=0.7): Create pair plots to
sns.pairplot(data, x_vars=['RM', 'LSTAT',
'PTRATIO'], y_vars='PRICE', height=7,
visualize the relationships between the specified features ('RM',
aspect=0.7) 'LSTAT', 'PTRATIO') and the target variable 'PRICE'. height=7 sets
the height of the plots, and aspect=0.7 sets the aspect ratio.
plt.show()
• plt.show(): Display the plots.

model = LinearRegression()
• LinearRegression(): Instantiate the LinearRegression model.
model.fit(X_train, y_train)
• model.fit(X_train, y_train): Train the linear regression model using
the training data.
y_pred = model.predict(X_test)
• model.predict(X_test): Use the trained model to make predictions on
the testing data.
39
EXERCISE: PREDICTING HOUSING PRICES
Explain:

• mean_squared_error(y_test, y_pred): Calculate


the mean squared error between the actual target
mse = mean_squared_error(y_test, values (y_test) and the predicted values (y_pred).
y_pred)
• r2_score(y_test, y_pred): Calculate the R-squared
r2 = r2_score(y_test, y_pred)
score, which indicates the proportion of the
variance in the dependent variable that is
print(f'Mean Squared Error: {mse}') predictable from the independent variables.
print(f'R^2 Score: {r2}') • print(f'Mean Squared Error: {mse}'): Print the
mean squared error.
• print(f'R^2 Score: {r2}'): Print the R-squared
40 score.
EXERCISE: PREDICTING HOUSING PRICES
1. Mean Squared Error (MSE)

mse = mean_squared_error(y_test, y_pred) Example Output:


print(f'Mean Squared Error: {mse}')
Mean Squared Error: 20.252
• Mean Squared Error (MSE): This is a measure of the average squared difference between
the observed actual outcomes and the predicted outcomes by the model.

41
EXERCISE: PREDICTING HOUSING PRICES
2. R-Squared Score (R²)

• R-Squared (R²) Score: This is a statistical measure that represents the proportion of the variance for the
dependent variable (target) that's explained by the independent variables (features) in the model.

r2 = r2_score(y_test, y_pred)

print(f'R^2 Score: {r2}')

Output:

R^2 Score: 0.734

42
EXERCISE: PREDICTING HOUSING PRICES
3. Correlation Matrix

• Correlation Matrix: This visual representation shows the correlation coefficients


between pairs of features in the dataset, including the target variable 'PRICE’.

Interpretation:
• Positive correlation indicates that as one feature increases, the target variable also
plt.figure(figsize=(12, 8)) increases.
sns.heatmap(data.corr(), • Negative correlation indicates that as one feature increases, the target variable
decreases.
annot=True, cmap='coolwarm')
• Strong correlations (close to -1 or 1) suggest a strong relationship, while weak
plt.show() correlations (close to 0) suggest a weak relationship.
• For instance, 'RM' (average number of rooms per dwelling) might show a strong
positive correlation with 'PRICE', indicating that houses with more rooms tend to be
more expensive.
• 'LSTAT' (percentage of lower status of the population) might show a strong negative
correlation with 'PRICE', indicating that neighborhoods with higher percentages of
lower-status residents tend to have lower house prices.
43
EXERCISE: PREDICTING HOUSING PRICES
4. Pair Plots

sns.pairplot(data, x_vars=['RM', 'LSTAT', 'PTRATIO'], y_vars='PRICE', height=7, aspect=0.7)

plt.show()
• Pair Plots: These scatter plots show the relationship between selected features ('RM', 'LSTAT',
'PTRATIO') and the target variable 'PRICE'.
Interpretation:
• 'RM' vs. 'PRICE': A positive slope indicates that as the number of rooms increases, the house
price also tends to increase.
• 'LSTAT' vs. 'PRICE': A negative slope indicates that as the percentage of lower status of the
population increases, the house price tends to decrease.
• 'PTRATIO' (pupil-teacher ratio by town) vs. 'PRICE': The relationship might show that
higher pupil-teacher ratios are associated with lower house prices.

44
EXERCISE: PREDICTING HOUSING PRICES
Model Performance Summary

• Mean Squared Error (MSE): A lower MSE value (e.g., 20.252) suggests that the model's
predictions are reasonably close to the actual values, indicating decent performance.

• R-Squared Score (R²): An R² value of around 0.734 suggests that the model explains 73.4% of the
variance in the house prices, which is a good indication that the model captures most of the
important patterns in the data.

• Correlation Insights:
• Positive correlation between 'RM' and 'PRICE' indicates that houses with more rooms are
generally more expensive.
• Negative correlation between 'LSTAT' and 'PRICE' suggests that houses in neighborhoods with a
higher percentage of lower-status individuals are generally less expensive.
• The pair plots visually confirm these relationships, providing further insights into how these
features influence house prices.
45
EXERCISE: PREDICTING HOUSING PRICES
Limitations and Improvements
1. Limitations:
• The linear regression model assumes a linear relationship between the features and the target variable,
which may not always be true.
• The model might be sensitive to outliers, which can significantly affect the predictions.
• The model may not capture complex patterns in the data due to its simplicity.

2. Improvements:
• Experiment with more complex models like Decision Trees, Random Forests, or Gradient Boosting
Machines (GBMs) to capture non-linear relationships.
• Perform feature engineering to create new features that might better capture the underlying patterns in the
data.
• Tune hyperparameters using GridSearchCV or RandomizedSearchCV to optimize model performance.
• Use cross-validation to obtain a more reliable estimate of the model's performance and ensure it
generalizes well to unseen data.
46
Thank You!
If you have any questions, please reach me!

You might also like