0% found this document useful (0 votes)

39 views47 pages

W03 - AI Data Handling

Uploaded by

KaNika TH11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views47 pages

W03 - AI Data Handling

Uploaded by

KaNika TH11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

INTRODUCTION TO

DATA HANDLING IN AI
By: SEK SOCHEAT
Lecturer Artificial Intelligence
2023 – 2024
Mobile: 017 879 967 MSIT – AEU
Email: [email protected]
TABLE OF CONTENTS
Data Handling

• Introduction to Python for Data Handling

• Setting up a virtual environment (venv) in Python

• Data Handling

1. Reading Data from Various Sources

2. Data Exploration

3. Data Cleaning

4. Data Manipulation

5. Data Transformation

6. Exporting Data

• Practical Example
2
INTRODUCTION TO PYTHON DATA HANDLING
INTRODUCTION TO PYTHON FOR DATA HANDLING
Introduction

Overview of Data Handling

Importance in AI and ML:
• Data is the foundation of AI and ML models.
• Quality data leads to better models and insights.
Key Concepts:
• Data Cleaning: Removing or fixing incorrect, corrupted, or incomplete data.
• Data Manipulation: Changing data to make it suitable for analysis.
• Data Transformation: Converting data into a different format or structure.

4
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
INTRODUCTION TO PYTHON FOR DATA HANDLING

Setting Up the Environment

• Python Installation • Essential Libraries

• Anaconda Distribution: • Pandas:
• Easiest way to get Python and packages. • Primary tool for data handling.
• Includes Jupyter Notebook, useful for • Provides data structures like DataFrame.
interactive coding.
• NumPy:
• Jupyter Notebook:
• Supports large, multi-dimensional arrays and
• Browser-based interface for running Python matrices.
code.
• Provides mathematical functions to operate on
• Ideal for data analysis and visualization. arrays.

6
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Step-by-Step Instructions

Setting up a virtual environment (venv) in Python helps you create an isolated environment for your
projects, allowing you to manage dependencies and avoid conflicts with other projects.
Here are the instructions to set up a virtual environment with Python:

1. Ensure Python is Installed

- Make sure Python is installed on your system. You can check the installation by running:

python --version

2. Install venv Module

- The ‘venv’ module is included in Python 3.3 and later. If you have Python 3.x, you already have ‘venv’
installed.
7
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON
Step-by-Step Instructions

Here are the instructions to set up a virtual environment with Python:

3. Create a Virtual Environment
- Navigate to your project directory and create a virtual environment. Replace myenv with
your desired environment name.

python -m venv myAIvenv

4. Activate the Virtual Environment

- On Windows: myAIvenv\Scripts\activate
- On macOS and Linux: source myenv/bin/activate
8
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Step-by-Step Instructions

Here are the instructions to set up a virtual environment with Python:

5. Verify Activation
- After activation, your command prompt should change to indicate that the virtual environment is active. It
will typically show the environment name in parentheses.

(myAIvenv) $

6. Install Packages
- With the virtual environment activated, you can now install packages using pip. For example, to install
requests:

pip --version
pip install requests
9
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Step-by-Step Instructions

Here are the instructions to set up a virtual environment with Python:

7. List Installed Packages

- To see a list of installed packages in the virtual environment:

pip list

8. Deactivate the Virtual Environment

- When you're done working in the virtual environment, you can deactivate it:

deactivate
10
SETTING UP A VIRTUAL ENVIRONMENT (VENV) IN PYTHON

Summary

• Create: python -m venv myAIvenv

• Activate: source myAIvenv/bin/activate (macOS/Linux) or
myAIvenv\Scripts\activate (Windows)
• Install Packages: pip install <package>
• Deactivate: deactivate

Using virtual environments is a best practice for Python development, as it

keeps dependencies isolated and manageable.
11
SETTING UP ANACONDA

What is Anaconda?

• Anaconda is a popular open-source distribution of Python and R

programming languages for scientific computing, data science, machine
learning, and analytics.
• It simplifies package management and deployment, and it includes many
popular data science and machine learning libraries.
Key Features of Anaconda
1. Comprehensive Distribution:
- Anaconda comes with over 1,500 Python and R packages pre-installed,
making it a powerful toolkit for data science and machine learning.
12
SETTING UP ANACONDA
What is Anaconda?

Key Features of Anaconda

2. Conda Package Manager:

- Anaconda includes conda, a versatile package management system that allows you
to install, update, and remove packages and manage virtual environments.
3. Virtual Environments:
- Create isolated environments to manage dependencies and avoid conflicts between
projects.
4. Cross-Platform:
- Anaconda is available for Windows, macOS, and Linux.
13
SETTING UP ANACONDA

What is Anaconda?

Key Features of Anaconda

5. Jupyter Notebooks:
- Integrated support for Jupyter Notebooks, an open-source web
application that allows you to create and share documents containing
live code, equations, visualizations, and narrative text.

6. Data Science Tools:

- Includes tools like JupyterLab, Spyder, and RStudio, which are essential
for data science and analysis.
14
SETTING UP ANACONDA
Installing Anaconda

Following:
1. Download:
- Visit the Anaconda website and download the installer for your operating
system.
2. Run the Installer:
- Follow the instructions in the installer to complete the installation process.
3. Verify Installation:
- Open a terminal (or Anaconda Prompt on Windows) and type:
conda --version
15
- You should see the version of conda that is installed.
SETTING UP ANACONDA

Using Conda

Following:
1. Creating a Virtual Environment
# Create a new environment named myAIvenv with Python 3.12
conda create --name myAIvenv python=3.12
2. Activating a Virtual Environment
# Activate the environment named myAIvenv
conda activate myAIvenv
3. Deactivating a Virtual Environment
# Deactivate the current environment
conda deactivate
16
SETTING UP ANACONDA

Using Conda

Following:
4. Installing Packages
# Install a package, for example, numpy
conda install numpy
5. Listing Installed Packages
# List all packages installed in the current environment
conda list
6. Removing a Package
# Remove a package, for example, numpy
conda remove numpy
17
SETTING UP ANACONDA

Benefits and Summary of Using Anaconda

Benefits:
• Simplified Package Management: Conda makes it easy to install, update, and manage packages and
dependencies.
• Isolated Environments: Virtual environments help prevent conflicts between different projects.
• Comprehensive Toolset: Includes many tools and libraries commonly used in data science and machine
learning.
• Cross-Platform Support: Works on Windows, macOS, and Linux.

Summary:
• Anaconda is a powerful distribution that simplifies the management of Python and R packages, making
it easier to develop, test, and deploy data science and machine learning projects.
• With tools like conda for package management and Jupyter Notebook for interactive computing,
Anaconda provides a comprehensive environment for scientific computing.
18
SETTING UP ANACONDA
Summary of use Conda or PowerShell Prompt: Create virtual environment

1. Create a conda enviroment

conda create --name myenv
Using venv within Anaconda is
2. Activate conda environment
possible and straightforward,
conda activate myenv but using conda is often more
3. Installing a Package convenient and powerful when
working within the Anaconda
conda install numpy pandas ecosystem.
4. Deactivate the conda environment
conda deactivate
19
DATA HANDLING IN AI AND ML
DATA HANDLING
1. Reading Data from Various Sources

# 3. Read data from SQLite3 Database

# 1. Read data from CSV file
import sqlite3
import pandas as pd conn = sqlite3.connect('database.db')
df = pd.read_csv('file.csv') df = pd.read_sql_query('SELECT * FROM
table', conn)

# 2. Read data from Excel file # 4. Read data from JSON file

df = pd.read_excel('file.xlsx') df = pd.read_json('file.json')

21
DATA HANDLING
2. Data Exploration

1. Basic Data Inspection 2. Data Summary

• Viewing Data: • info( ): Overview of the DataFrame.
• head( ): Displays the first few rows.
• describe( ): Summary statistics.
• tail( ): Displays the last few rows.
# Example:

# Example: df.info()

3. Handling Missing Data df.describe()

df.head()
df.tail() • Checking for missing values.

# Example:

df.isnull().sum()
22
DATA HANDLING
3. Data Cleaning

2. Imputing Missing Values

1. Identifying and Handling Missing Values
• Filling missing values with a specific value or method.
• Removing Missing Values:
# Example:
• Dropping rows/columns with missing
values. df.fillna(method='ffill', inplace=True)

3. Handling Duplicate Data

# Example:
• Removing Duplicates
df.dropna(inplace=True)
# Example:

df.drop_duplicates(inplace=True)
23
DATA HANDLING
4. Data Manipulation

1. Filtering and Selecting Data • Sorting and Grouping Data

• Using ‘loc’ and ‘iloc’: • Sorting: Sorting data by one or more columns.
• loc: Label-based indexing.
# Example:
• iloc: Position-based indexing.

# Example: df.sort_values(by='column', ascending=False)

df.loc[df['column'] > value] • Grouping: Grouping data and calculating aggregate

df.iloc[0:5, 1:3] statistics.

• Conditional Selection # Example:

df.groupby('column').mean()
# Example:
df[df['column'] == 'value']
24
DATA HANDLING
5. Data Transformation

1. Applying Functions • Lambda Functions: Small anonymous

functions defined with lambda.
• Using apply()
• Applying a function along an axis of
the DataFrame. # Example:
# Example:
df['new_column'] = df['column'].apply(lambda x: x * 2)
df['column'] = df['column'].apply(lambda x: x + 1)

25
DATA HANDLING
6. Exporting Data

• Excel file
Saving Data to Various Formats # Example:

• CSV file df.to_excel('cleaned_data.xlsx', index=False)

# Example:
• JSON file
df.to_csv('cleaned_data.csv', index=False) # Example:

df.to_json('cleaned_data.json')
26
IMPORTING DATA AND DATA EXPLORATION
Practical Example

# Example:
# Importing data and read data from csv file
import pandas as pd
df = pd.read_csv('dataset.csv') End-to-End Data Handling
# Cleaning data
• Real-world Dataset:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True) • Importing data, cleaning,
# Manipulating data manipulating, and exporting.
df['new_column'] = df['id'].apply(lambda x: x * 2)
sorted_df = df.sort_values(by='new_column', ascending=False)

# Exporting data
sorted_df.to_csv('cleaned_and_sorted_data.csv', index=False)

27
IMPORTING DATA AND DATA EXPLORATION
Solution:

import pandas as pd # Manipulating data

# Using the 'id' column for manipulation

# Importing data df['new_column'] = df['id'].apply(lambda x: x * 2)

file_path = 'dataset.csv' sorted_df = df.sort_values(by='new_column',

ascending=False)
df = pd.read_csv(file_path)
# Exporting data

# Cleaning data output_file_path = 'cleaned_and_sorted_data.csv'

df.dropna(inplace=True) sorted_df.to_csv(output_file_path, index=False)

df.drop_duplicates(inplace=True) output_file_path
28
EXERCISE: PREDICTING HOUSING PRICES
Objective:
Build a machine learning model to predict 3. Data Exploration:
housing prices based on various features such
• Perform exploratory data analysis (EDA) to understand the data.
as the number of rooms, location, size, etc.
• Visualize the relationships between features and the target
Steps:
variable.
1. Data Collection:
4. Model Building:
• Use the Boston Housing Dataset from
• Split the data into training and testing sets.
scikit-learn, which contains data about
housing prices in Boston. • Train a linear regression model.
2. Data Preprocessing: • Evaluate the model performance using appropriate metrics.
• Handle missing values if any. 5. Model Evaluation:
• Encode categorical variables. • Interpret the results.
• Scale numerical features.
29
• Suggest improvements.
EXERCISE: PREDICTING HOUSING PRICES
Questions:

1. What is the shape of the dataset?

2. Are there any missing values in the dataset?
3. What is the correlation between the features and the target
variable?
4. How well does the linear regression model perform? What are
its limitations?
5. What improvements can be made to enhance the model's
performance?
30
EXERCISE: PREDICTING HOUSING PRICES
Answers:
1. What is the shape of the dataset?

Answer: The dataset contains 506 rows and 14 columns.

2. Are there any missing values in the dataset?

Answer: There are no missing values in the dataset.

3. What is the correlation between the features and the target variable?

Answer:

• Some key correlations with the target variable (PRICE) include:

• RM (number of rooms): Strong positive correlation

• LSTAT (percentage of lower status of the population): Strong negative correlation

• PTRATIO (pupil-teacher ratio by town): Moderate negative correlation

31
EXERCISE: PREDICTING HOUSING PRICES
Answers:
4. How well does the linear regression model perform? What are its limitations?
Answer:
The linear regression model's performance:
• Mean Squared Error (MSE): A measure of the average squared difference between the observed
actual outcomes and the predicted outcomes.
• R^2 Score: A statistical measure that represents the proportion of the variance for the dependent
variable that's explained by the independent variables in the model.
Limitations:
• Assumes a linear relationship between features and the target.
• Sensitive to outliers.
• May not capture complex patterns in the data.
32
EXERCISE: PREDICTING HOUSING PRICES
Answers:

5. What improvements can be made to enhance the model's performance?

Answer:
• Try different models (e.g., Decision Tree, Random Forest, Gradient
Boosting).
• Perform feature engineering (e.g., interaction terms, polynomial features).
• Tune hyperparameters using GridSearchCV or RandomizedSearchCV.
• Use cross-validation for better evaluation.
• Address multicollinearity and scale features appropriately.
33
PYTHON: PREDICTING HOUSING PRICES

34
PYTHON: PREDICTING HOUSING PRICES

35
EXERCISE: PREDICTING HOUSING PRICES
Explain:
• pandas: Library for data manipulation and analysis. It provides data structures like DataFrame.

• numpy: Library for numerical operations in Python.

• matplotlib.pyplot: Library for creating static, animated, and interactive visualizations import pandas as pd
in Python.
import numpy as np
• seaborn: Data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. import matplotlib.pyplot as plt

• sklearn.model_selection.train_test_split: Function to split the dataset into import seaborn as sns

training and testing sets.
from sklearn.model_selection import train_test_split
• sklearn.preprocessing.StandardScaler: Class to standardize features by removing
the mean and scaling to unit variance. from sklearn.preprocessing import StandardScaler

• sklearn.linear_model.LinearRegression: Class to perform linear regression. from sklearn.linear_model import LinearRegression

• sklearn.metrics.mean_squared_error: Function to calculate mean squared error from sklearn.metrics import mean_squared_error, r2_score
for regression models.

• sklearn.metrics.r2_score: Function to calculate the R-squared score, which indicates the

proportion of the variance in the dependent variable that is predictable from the independent
36
variables.
EXERCISE: PREDICTING HOUSING PRICES
Explain:

• Load the dataset from the boston_housing.csv file into a pandas

data = pd.read_csv('boston_housing.csv') DataFrame.

• data.isnull().sum(): Check for missing values in each column.

if data.isnull().sum().any(): • .any(): Check if there are any missing values in the DataFrame.
data = data.fillna(data.median()) • data.fillna(data.median()): If there are missing values, fill them
with the median of each column.

X = data.drop('PRICE', axis=1)
• data.drop('PRICE', axis=1): Drop the target variable 'PRICE' from
y = data['PRICE'] the DataFrame, resulting in the features DataFrame x.

• data['PRICE']: Extract the target variable 'PRICE' into the Series y.

37
EXERCISE: PREDICTING HOUSING PRICES
Explain:
• StandardScaler(): Instantiate the StandardScaler object.
scaler = StandardScaler()
• scaler.fit_transform(X): Fit to the data and then transform it to scale the
X_scaled = scaler.fit_transform(X) features to have zero mean and unit variance. The result is X_scaled.

X_train, X_test, y_train, y_test = • train_test_split(X_scaled, y, test_size=0.2, random_state=42): Split the

train_test_split(X_scaled, y, test_size=0.2, scaled features X_scaled and target y into training and testing sets. 80% of
random_state=42) the data is used for training, and 20% for testing. random_state=42 ensures
reproducibility.
plt.figure(figsize=(12, 8))
• plt.figure(figsize=(12, 8)): Set the figure size for the plot.
sns.heatmap(data.corr(), annot=True,
cmap='coolwarm') • sns.heatmap(data.corr(), annot=True, cmap='coolwarm'): Create a
plt.show() heatmap of the correlation matrix of the dataset. annot=True shows the
correlation values on the heatmap. cmap='coolwarm' sets the color map.

38 • plt.show(): Display the plot.

EXERCISE: PREDICTING HOUSING PRICES
Explain:

• sns.pairplot(data, x_vars=['RM', 'LSTAT', 'PTRATIO'],

y_vars='PRICE', height=7, aspect=0.7): Create pair plots to
sns.pairplot(data, x_vars=['RM', 'LSTAT',
'PTRATIO'], y_vars='PRICE', height=7,
visualize the relationships between the specified features ('RM',
aspect=0.7) 'LSTAT', 'PTRATIO') and the target variable 'PRICE'. height=7 sets
the height of the plots, and aspect=0.7 sets the aspect ratio.
plt.show()
• plt.show(): Display the plots.

model = LinearRegression()
• LinearRegression(): Instantiate the LinearRegression model.
model.fit(X_train, y_train)
• model.fit(X_train, y_train): Train the linear regression model using
the training data.
y_pred = model.predict(X_test)
• model.predict(X_test): Use the trained model to make predictions on
the testing data.
39
EXERCISE: PREDICTING HOUSING PRICES
Explain:

• mean_squared_error(y_test, y_pred): Calculate

the mean squared error between the actual target
mse = mean_squared_error(y_test, values (y_test) and the predicted values (y_pred).
y_pred)
• r2_score(y_test, y_pred): Calculate the R-squared
r2 = r2_score(y_test, y_pred)
score, which indicates the proportion of the
variance in the dependent variable that is
print(f'Mean Squared Error: {mse}') predictable from the independent variables.
print(f'R^2 Score: {r2}') • print(f'Mean Squared Error: {mse}'): Print the
mean squared error.
• print(f'R^2 Score: {r2}'): Print the R-squared
40 score.
EXERCISE: PREDICTING HOUSING PRICES
1. Mean Squared Error (MSE)

mse = mean_squared_error(y_test, y_pred) Example Output:

print(f'Mean Squared Error: {mse}')
Mean Squared Error: 20.252
• Mean Squared Error (MSE): This is a measure of the average squared difference between
the observed actual outcomes and the predicted outcomes by the model.

41
EXERCISE: PREDICTING HOUSING PRICES
2. R-Squared Score (R²)

• R-Squared (R²) Score: This is a statistical measure that represents the proportion of the variance for the
dependent variable (target) that's explained by the independent variables (features) in the model.

r2 = r2_score(y_test, y_pred)

print(f'R^2 Score: {r2}')

Output:

R^2 Score: 0.734

42
EXERCISE: PREDICTING HOUSING PRICES
3. Correlation Matrix

• Correlation Matrix: This visual representation shows the correlation coefficients

between pairs of features in the dataset, including the target variable 'PRICE’.

Interpretation:
• Positive correlation indicates that as one feature increases, the target variable also
plt.figure(figsize=(12, 8)) increases.
sns.heatmap(data.corr(), • Negative correlation indicates that as one feature increases, the target variable
decreases.
annot=True, cmap='coolwarm')
• Strong correlations (close to -1 or 1) suggest a strong relationship, while weak
plt.show() correlations (close to 0) suggest a weak relationship.
• For instance, 'RM' (average number of rooms per dwelling) might show a strong
positive correlation with 'PRICE', indicating that houses with more rooms tend to be
more expensive.
• 'LSTAT' (percentage of lower status of the population) might show a strong negative
correlation with 'PRICE', indicating that neighborhoods with higher percentages of
lower-status residents tend to have lower house prices.
43
EXERCISE: PREDICTING HOUSING PRICES
4. Pair Plots

sns.pairplot(data, x_vars=['RM', 'LSTAT', 'PTRATIO'], y_vars='PRICE', height=7, aspect=0.7)

plt.show()
• Pair Plots: These scatter plots show the relationship between selected features ('RM', 'LSTAT',
'PTRATIO') and the target variable 'PRICE'.
Interpretation:
• 'RM' vs. 'PRICE': A positive slope indicates that as the number of rooms increases, the house
price also tends to increase.
• 'LSTAT' vs. 'PRICE': A negative slope indicates that as the percentage of lower status of the
population increases, the house price tends to decrease.
• 'PTRATIO' (pupil-teacher ratio by town) vs. 'PRICE': The relationship might show that
higher pupil-teacher ratios are associated with lower house prices.

44
EXERCISE: PREDICTING HOUSING PRICES
Model Performance Summary

• Mean Squared Error (MSE): A lower MSE value (e.g., 20.252) suggests that the model's
predictions are reasonably close to the actual values, indicating decent performance.

• R-Squared Score (R²): An R² value of around 0.734 suggests that the model explains 73.4% of the
variance in the house prices, which is a good indication that the model captures most of the
important patterns in the data.

• Correlation Insights:
• Positive correlation between 'RM' and 'PRICE' indicates that houses with more rooms are
generally more expensive.
• Negative correlation between 'LSTAT' and 'PRICE' suggests that houses in neighborhoods with a
higher percentage of lower-status individuals are generally less expensive.
• The pair plots visually confirm these relationships, providing further insights into how these
features influence house prices.
45
EXERCISE: PREDICTING HOUSING PRICES
Limitations and Improvements
1. Limitations:
• The linear regression model assumes a linear relationship between the features and the target variable,
which may not always be true.
• The model might be sensitive to outliers, which can significantly affect the predictions.
• The model may not capture complex patterns in the data due to its simplicity.

2. Improvements:
• Experiment with more complex models like Decision Trees, Random Forests, or Gradient Boosting
Machines (GBMs) to capture non-linear relationships.
• Perform feature engineering to create new features that might better capture the underlying patterns in the
data.
• Tune hyperparameters using GridSearchCV or RandomizedSearchCV to optimize model performance.
• Use cross-validation to obtain a more reliable estimate of the model's performance and ensure it
generalizes well to unseen data.
46
Thank You!
If you have any questions, please reach me!

Pranic Feng Shui Technique by Elshaddai Pranic Healing Center
50% (6)
Pranic Feng Shui Technique by Elshaddai Pranic Healing Center
10 pages
SIC - C - P - Chapter 1. Programing Basic Concept and Starting Python - v1
100% (1)
SIC - C - P - Chapter 1. Programing Basic Concept and Starting Python - v1
545 pages
Natural Language Processing in Python
No ratings yet
Natural Language Processing in Python
214 pages
Learning Pandas Library
100% (2)
Learning Pandas Library
271 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
329 pages
PyTorch - Advanced Deep Learning
No ratings yet
PyTorch - Advanced Deep Learning
237 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
313 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Statistics Machine Learning Python Draft
100% (1)
Statistics Machine Learning Python Draft
333 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
218 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
319 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
219 pages
Jacky Bai - Pandas Hands-On - Data Analysis Crash Course (2020)
No ratings yet
Jacky Bai - Pandas Hands-On - Data Analysis Crash Course (2020)
139 pages
Statistics and Machine Learning in Python
100% (1)
Statistics and Machine Learning in Python
166 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
323 pages
ML LAB Record
No ratings yet
ML LAB Record
54 pages
Tuning Disk IO Operations in Oracle Database
No ratings yet
Tuning Disk IO Operations in Oracle Database
28 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
300 pages
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
No ratings yet
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
86 pages
SIC Python Course Material
No ratings yet
SIC Python Course Material
74 pages
Python - Follow Dr. AngShu (@drangshu) For More
100% (1)
Python - Follow Dr. AngShu (@drangshu) For More
300 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
319 pages
AGNIVEER VAYU 01 2025 HaryanaJobs - in Air Force Agniveer 1 2025 Notification
No ratings yet
AGNIVEER VAYU 01 2025 HaryanaJobs - in Air Force Agniveer 1 2025 Notification
2 pages
MySQL InnoDB Cluster 8-0-1746623006
No ratings yet
MySQL InnoDB Cluster 8-0-1746623006
14 pages
Ejercicios de Ingles
No ratings yet
Ejercicios de Ingles
9 pages
NCP Schizophrenia
67% (3)
NCP Schizophrenia
2 pages
Complete The Conversations With The Correct Words in Parenthenses
No ratings yet
Complete The Conversations With The Correct Words in Parenthenses
3 pages
2024 CS224N Python Review Session Slides
No ratings yet
2024 CS224N Python Review Session Slides
66 pages
List of Biomaterial Fossil Papers
No ratings yet
List of Biomaterial Fossil Papers
43 pages
Instalation - Python
No ratings yet
Instalation - Python
11 pages
Stat and Machine Learning Python PDF
No ratings yet
Stat and Machine Learning Python PDF
300 pages
Ethics Chapter 3
No ratings yet
Ethics Chapter 3
17 pages
Quiz Sts
No ratings yet
Quiz Sts
54 pages
Session 9 - Body Language Basics (Decoding)
No ratings yet
Session 9 - Body Language Basics (Decoding)
16 pages
PYTHON
No ratings yet
PYTHON
43 pages
DM2324 Lab01
No ratings yet
DM2324 Lab01
66 pages
Python For AI Developers
No ratings yet
Python For AI Developers
45 pages
Setup Environment & Python Basics
No ratings yet
Setup Environment & Python Basics
62 pages
Gratisexam Com Oracle Testkings 1z0 064 v2019!01!16 by Ryan 53q
No ratings yet
Gratisexam Com Oracle Testkings 1z0 064 v2019!01!16 by Ryan 53q
43 pages
Treib, O. Et Al. (2007) Modes of Governance, Towards A Conceptual Clarification
No ratings yet
Treib, O. Et Al. (2007) Modes of Governance, Towards A Conceptual Clarification
22 pages
FODS Record
No ratings yet
FODS Record
66 pages
1 Introduction Python Programming For Data Science
No ratings yet
1 Introduction Python Programming For Data Science
11 pages
Saint Louis University Baguio City Principal'S Recommendation Form
No ratings yet
Saint Louis University Baguio City Principal'S Recommendation Form
1 page
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
1.1-1.4 - Introduction To Python
No ratings yet
1.1-1.4 - Introduction To Python
50 pages
01 Python Introduction
No ratings yet
01 Python Introduction
39 pages
Les 04
No ratings yet
Les 04
41 pages
Tuning The PGA
No ratings yet
Tuning The PGA
25 pages
Lecture 2
No ratings yet
Lecture 2
16 pages
Tuning The Redo Path
No ratings yet
Tuning The Redo Path
20 pages
Using Smart Flash Cache
No ratings yet
Using Smart Flash Cache
18 pages
Conda 24.5.1 Documentation - Combined
No ratings yet
Conda 24.5.1 Documentation - Combined
40 pages
Exorcising The Popular Seriously Luhmann S Concept of Semantics
No ratings yet
Exorcising The Popular Seriously Luhmann S Concept of Semantics
20 pages
Detecting CPU Bottlenecks
No ratings yet
Detecting CPU Bottlenecks
16 pages
Python Programming Development Environment Set-Up
No ratings yet
Python Programming Development Environment Set-Up
19 pages
Past Simple Questions Worksheet
No ratings yet
Past Simple Questions Worksheet
1 page
Harvard CS197 Lecture 2 Notes
No ratings yet
Harvard CS197 Lecture 2 Notes
37 pages
Introduction To Python
No ratings yet
Introduction To Python
26 pages
TP 01
No ratings yet
TP 01
22 pages
Python Intro
No ratings yet
Python Intro
13 pages
MySQL InnoDB Cluster With HAProxy For High Availability 1746622861
No ratings yet
MySQL InnoDB Cluster With HAProxy For High Availability 1746622861
14 pages
Oracle Prep4sure 1z0 064 Rapidshare 2020 Nov 30 by Harriet 83q Vce
No ratings yet
Oracle Prep4sure 1z0 064 Rapidshare 2020 Nov 30 by Harriet 83q Vce
14 pages
Laporan Praktikum MPS (Processing) - Teknisi 1200 2024
No ratings yet
Laporan Praktikum MPS (Processing) - Teknisi 1200 2024
10 pages
Wa0013.
No ratings yet
Wa0013.
12 pages
0 Anaconda-Guide 040323
No ratings yet
0 Anaconda-Guide 040323
22 pages
1 Introduction
No ratings yet
1 Introduction
24 pages
EzzeddinAbdullah2022 Cleaner Python
No ratings yet
EzzeddinAbdullah2022 Cleaner Python
23 pages
Introduction To Python Lecture 1: Setting Up Your Python Environment
No ratings yet
Introduction To Python Lecture 1: Setting Up Your Python Environment
33 pages
Introduction To Conda
No ratings yet
Introduction To Conda
14 pages
1744959877813-Understanding The Conda Environment
No ratings yet
1744959877813-Understanding The Conda Environment
8 pages
Conda and IDE Setup
No ratings yet
Conda and IDE Setup
4 pages
Curriculum Map 10
No ratings yet
Curriculum Map 10
6 pages
Exp No. 1-3 (MLC)
No ratings yet
Exp No. 1-3 (MLC)
12 pages
Backups in Mysql Using Mysql Enterprise Utility 1746624008
No ratings yet
Backups in Mysql Using Mysql Enterprise Utility 1746624008
4 pages
Defininghumanities CLAIRE
No ratings yet
Defininghumanities CLAIRE
12 pages
Replication Using GTID MySQL 1746624115
No ratings yet
Replication Using GTID MySQL 1746624115
3 pages
Concepcion Holy Cross College, Inc.: Students' Name
No ratings yet
Concepcion Holy Cross College, Inc.: Students' Name
23 pages
Python Module 1
No ratings yet
Python Module 1
9 pages
MATE Course Syllabus CURRICULUM DEVELOPMENT
No ratings yet
MATE Course Syllabus CURRICULUM DEVELOPMENT
6 pages
Ida 1
No ratings yet
Ida 1
5 pages
Py Chapter 1 Topic 6
No ratings yet
Py Chapter 1 Topic 6
6 pages
1 - Setting Up Your Environment
No ratings yet
1 - Setting Up Your Environment
5 pages
Python Info
No ratings yet
Python Info
5 pages
Shaikh Ahmad Hassan School of Law: Lectures and Talks
No ratings yet
Shaikh Ahmad Hassan School of Law: Lectures and Talks
6 pages
Python Packages and Virtual Environments
No ratings yet
Python Packages and Virtual Environments
3 pages
Lesson Plan in Projectile Motion
No ratings yet
Lesson Plan in Projectile Motion
6 pages
Data Sci Lab 1
No ratings yet
Data Sci Lab 1
4 pages
Incupd Axa
No ratings yet
Incupd Axa
2 pages
Lozezofelix
No ratings yet
Lozezofelix
4 pages
CV Wisnudy Tommy Wijaya 2021 PDF
No ratings yet
CV Wisnudy Tommy Wijaya 2021 PDF
1 page
Bristol Sop
No ratings yet
Bristol Sop
3 pages
IELTS MASTER - IELTS Writing Test 1
No ratings yet
IELTS MASTER - IELTS Writing Test 1
2 pages
Getting Started With Python
No ratings yet
Getting Started With Python
8 pages
JD - Fidelity International
No ratings yet
JD - Fidelity International
1 page
Background of The Study
No ratings yet
Background of The Study
3 pages
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
No ratings yet
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
13 pages
Honoris-Causa SAMPLE SYNTHESIS PAPER
No ratings yet
Honoris-Causa SAMPLE SYNTHESIS PAPER
2 pages
Template RPH CEFR
No ratings yet
Template RPH CEFR
2 pages
60th Graduation Ceremony
No ratings yet
60th Graduation Ceremony
1 page

W03 - AI Data Handling

Uploaded by

W03 - AI Data Handling

Uploaded by

INTRODUCTION TO

• Introduction to Python for Data Handling

• Setting up a virtual environment (venv) in Python

1. Reading Data from Various Sources

Overview of Data Handling

Setting Up the Environment

• Python Installation • Essential Libraries

1. Ensure Python is Installed

2. Install venv Module

Here are the instructions to set up a virtual environment with Python:

python -m venv myAIvenv

4. Activate the Virtual Environment

Here are the instructions to set up a virtual environment with Python:

Here are the instructions to set up a virtual environment with Python:

7. List Installed Packages

8. Deactivate the Virtual Environment

• Create: python -m venv myAIvenv

Using virtual environments is a best practice for Python development, as it

• Anaconda is a popular open-source distribution of Python and R

Key Features of Anaconda

2. Conda Package Manager:

Key Features of Anaconda

6. Data Science Tools:

Benefits and Summary of Using Anaconda

1. Create a conda enviroment

# 3. Read data from SQLite3 Database

1. Basic Data Inspection 2. Data Summary

3. Handling Missing Data df.describe()

2. Imputing Missing Values

3. Handling Duplicate Data

1. Filtering and Selecting Data • Sorting and Grouping Data

# Example: df.sort_values(by='column', ascending=False)

df.loc[df['column'] > value] • Grouping: Grouping data and calculating aggregate

• Conditional Selection # Example:

1. Applying Functions • Lambda Functions: Small anonymous

• CSV file df.to_excel('cleaned_data.xlsx', index=False)

import pandas as pd # Manipulating data

# Using the 'id' column for manipulation

file_path = 'dataset.csv' sorted_df = df.sort_values(by='new_column',

# Cleaning data output_file_path = 'cleaned_and_sorted_data.csv'

df.dropna(inplace=True) sorted_df.to_csv(output_file_path, index=False)

1. What is the shape of the dataset?

Answer: The dataset contains 506 rows and 14 columns.

2. Are there any missing values in the dataset?

Answer: There are no missing values in the dataset.

• Some key correlations with the target variable (PRICE) include:

• RM (number of rooms): Strong positive correlation

• LSTAT (percentage of lower status of the population): Strong negative correlation

• PTRATIO (pupil-teacher ratio by town): Moderate negative correlation

5. What improvements can be made to enhance the model's performance?

• numpy: Library for numerical operations in Python.

• sklearn.model_selection.train_test_split: Function to split the dataset into import seaborn as sns

• sklearn.linear_model.LinearRegression: Class to perform linear regression. from sklearn.linear_model import LinearRegression

• sklearn.metrics.r2_score: Function to calculate the R-squared score, which indicates the

• Load the dataset from the boston_housing.csv file into a pandas

• data.isnull().sum(): Check for missing values in each column.

• data['PRICE']: Extract the target variable 'PRICE' into the Series y.

X_train, X_test, y_train, y_test = • train_test_split(X_scaled, y, test_size=0.2, random_state=42): Split the

38 • plt.show(): Display the plot.

• sns.pairplot(data, x_vars=['RM', 'LSTAT', 'PTRATIO'],

• mean_squared_error(y_test, y_pred): Calculate

mse = mean_squared_error(y_test, y_pred) Example Output:

print(f'R^2 Score: {r2}')

R^2 Score: 0.734

• Correlation Matrix: This visual representation shows the correlation coefficients

sns.pairplot(data, x_vars=['RM', 'LSTAT', 'PTRATIO'], y_vars='PRICE', height=7, aspect=0.7)

You might also like