0% found this document useful (0 votes)
13 views10 pages

Week 3

The document discusses the importance of Python libraries Pandas and Matplotlib for machine learning applications, highlighting their roles in data manipulation, visualization, and model evaluation. Pandas is essential for data cleaning, transformation, and exploration, while Matplotlib is used for creating various visualizations to understand data and evaluate model performance. The document provides examples of using both libraries in a complete machine learning workflow.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Week 3

The document discusses the importance of Python libraries Pandas and Matplotlib for machine learning applications, highlighting their roles in data manipulation, visualization, and model evaluation. Pandas is essential for data cleaning, transformation, and exploration, while Matplotlib is used for creating various visualizations to understand data and evaluate model performance. The document provides examples of using both libraries in a complete machine learning workflow.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

Week_3:Study of Python Libraries for ML application such as

Pandas and Matplotlib

 Python is a popular language for machine learning (ML)


applications, and several libraries are frequently used for data
manipulation, visualization, and machine learning tasks. Among
the most widely used libraries for ML applications are Pandas
and Matplotlib.
 These libraries are essential for data preprocessing, exploration,
and visualization.

1. Pandas: Data Manipulation and Analysis

Pandas is a powerful library for data manipulation and analysis. It


provides two key data structures:

 Series: 1-dimensional labeled array.


 DataFrame: 2-dimensional labeled data structure, similar to a
table (rows and columns).

Key Uses of Pandas in ML:

 Data cleaning and preprocessing: Handling missing data,


filtering, and transforming data.
 Data transformation: Applying functions to columns and rows.

 Data exploration: Summarizing and analyzing datasets.

 Merging and joining: Combining multiple datasets.


Example: Pandas for Data Loading, Cleaning, and Preprocessing

import pandas as pd

# Load dataset
df = pd.read_csv('data.csv') # Read a CSV file into a DataFrame

# Show the first few rows of the dataset


print(df.head())

# Check for missing data


print(df.isnull().sum()) # Count the missing values in each column

# Drop rows with missing values


df_clean = df.dropna()

# Fill missing values with the mean of the column


df_filled = df.fillna(df.mean())

# Filter data based on a condition (e.g., values > 50 in a column)


filtered_data = df[df['column_name'] > 50]

# Grouping data by a category


grouped_data = df.groupby('category_column').mean()
# Basic statistics
print(df.describe()) # Summary statistics like mean, std, min, max,
etc.

Key Features in Pandas for ML:

 Handling missing data: fillna(), dropna().


 Aggregation: groupby(), pivot_table().

 Merging and joining: merge(), concat().

 Data transformation: apply(), map(), applymap().

 Data visualization: Integrated plotting with matplotlib.

2. Matplotlib: Data Visualization

Matplotlib is a popular plotting library for creating static, animated,


and interactive visualizations in Python. It is frequently used for
visualizing data and results in machine learning.

Key Uses of Matplotlib in ML:

 Visualizing datasets: Creating various charts such as bar charts,


line plots, scatter plots, histograms, etc.
 Evaluating model performance: Plotting results such as
confusion matrices, ROC curves, or loss and accuracy over
epochs during model training.

Example: Basic Plotting with Matplotlib


import matplotlib.pyplot as plt
import numpy as np

# Sample data
x = np.linspace(0, 10, 100) # 100 points between 0 and 10
y = np.sin(x)

# Line plot
plt.plot(x, y, label='Sine wave', color='b')
plt.title('Sine Wave Example')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

# Scatter plot
x2 = np.random.rand(50)
y2 = np.random.rand(50)
plt.scatter(x2, y2, color='r', alpha=0.7)
plt.title('Random Scatter Plot')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Histogram
data = np.random.randn(1000)
plt.hist(data, bins=30, color='green', edgecolor='black')
plt.title('Histogram of Random Data')
plt.show()

Key Features in Matplotlib for ML:

 Line plots: Great for showing trends or relationships.


 Scatter plots: Useful for visualizing relationships between two
variables.

 Bar charts: Ideal for categorical data comparison.

 Histograms: Useful for displaying the distribution of data.

 Subplots: Combine multiple plots in a single figure for


comparison.

 Customization: Control over colors, markers, lines, axes, and


titles.

Example: Visualizing Model Performance (e.g., ROC Curve)

from sklearn.metrics import roc_curve, auc


from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Generate a synthetic binary classification dataset


X, y = make_classification(n_samples=1000, n_features=20,
n_classes=2, random_state=42)

# Train a logistic regression model


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

# Get predicted probabilities


y_pred_prob = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve


fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve


plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area =
%0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

Combining Pandas and Matplotlib in Machine Learning


Workflows

1. Data Exploration with Pandas:


o Load the dataset, clean it, and explore it to understand the
relationships between features.

2. Data Visualization with Matplotlib:

o Visualize the data using plots to better understand


distributions and relationships between features, and to
detect patterns or anomalies.

3. Model Training and Evaluation:

o Use Matplotlib to plot model performance metrics like


accuracy, loss curves, confusion matrices, and ROC
curves, and use Pandas to analyze prediction results (e.g.,
computing precision, recall, etc.).

Example: Complete Workflow with Pandas, Matplotlib, and a


ML Model

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Load the dataset (use a built-in dataset like Iris)


df = pd.read_csv('Iris.csv') # Example: Replace with actual dataset
file

# Basic data cleaning and exploration


print(df.head())
print(df.describe())

# Visualizing relationships using scatter plot


plt.scatter(df['sepal_length'], df['sepal_width'],
c=df['species'].apply(lambda x: 0 if x == 'setosa' else 1))
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Sepal Length vs Sepal Width')
plt.show()

# Train a simple model (Logistic Regression)


X = df.drop('species', axis=1) # Features
y = df['species'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Visualizing Confusion Matrix


import seaborn as sns
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

Conclusion
 Pandas is essential for data manipulation, cleaning, and
preprocessing. It allows you to efficiently handle large datasets
and perform operations like filtering, grouping, and merging
data.
 Matplotlib is crucial for data visualization, making it easy to
plot data distributions, trends, and performance metrics, which
is important for both exploratory data analysis and model
evaluation.

You might also like