Chapter 2. Data Analysis and Processing - Full
Chapter 2. Data Analysis and Processing - Full
plt.title('Correlation Heatmap')
plt.show()
DATA VISUALIZATION
• Data visualization is a powerful tool to understand and communicate
insights from data.
• Data visualization helps you see patterns, trends, and relationships
within your data that might be difficult to identify just by looking at
raw numbers. It's a great way to:
• Summarize large datasets
• Find correlations between variables
• Communicate complex ideas to others
• Python Libraries for Data Visualization: Matplotlib, Seaborn, …
DATA VISUALIZATION
• Install and load libraries.
• Pandas.
• SciPy and NumPy.
• Scikit-learn.
INSTALL AND LOAD LIBRARIES
• Install libraries in Python
py -m pip install [package_name]
Ex:
py -m pip install numpy
py -m pip install pandas
py -m pip install matplotlit
INSTALL AND LOAD LIBRARIES
• Install libraries in Pycharm
Step 1. File\Settings… (Ctrl+Alt+S)
Step 2. Project: pythonProject…\Python Interpreter
Step 3. Install (+) (Alt+S)
DATA VISUALIZATION WITH PANDAS
• Pandas, while not a dedicated visualization library, offers built-in
plotting functionalities that are great for exploratory data analysis
(EDA).
DATA VISUALIZATION WITH PANDAS
• Scatter Plots: Show relationships between two numerical variables.
• Line Plots: Useful for visualizing trends over time or along a sequence.
• Histograms: Depict the distribution of a single numerical variable.
• Box Plots: Summarize the distribution of a numerical variable, highlighting
outliers.
• Bar Plots: Represent categorical data or compare values between categories.
• Area Plots: Similar to line plots, but emphasize the area between the line and the
axis.
• Pie Charts: Represent proportions of a whole using slices.
DATA VISUALIZATION WITH PANDAS
• Ex:
import pandas as pd
import matplotlib.pyplot as plt
# Sample Dataframe
data = {'col1': [1, 2, 3, 4], 'col2': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)
# Create a scatter plot
df.plot(kind='scatter', x='col1', y='col2')
plt.show()
DATA VISUALIZATION WITH NUMPY
• WhileNumPy are fundamental libraries for scientific computing in
Python, data visualization isn't their primary focus.
• However, they can be a powerful foundation for creating custom
visualizations.
DATA VISUALIZATION WITH NUMPY
• Ex:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 5, 100) # Create x-axis data (100 points from 0 to 5)
y = x**2 # Create y-axis data (square the x values)
plt.plot(x, y) # Plot the line
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot (NumPy Data)')
plt.show()
DATA VISUALIZATION WITH SCIPY
• While SciPy is a powerful library for scientific computing in Python,
it's not primarily designed for data visualization.
• SciPy's Role in Data Visualization Workflow:
• Data Preparation: SciPy functions can help you clean, transform, and analyze
your data before visualization. For instance, you can use SciPy for outlier
detection, filtering, and data smoothing.
• Statistical Analysis: SciPy provides functions for statistical calculations like
finding correlations or fitting distributions. These results can be incorporated
into your visualizations to add context and insights.
DATA VISUALIZATION WITH SCIPY
• Ex:
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
# Generate noisy data
x = range(100)
y = np.sin(x) + np.random.randn(100)
# Smooth the data using Savitzky-Golay filter
y_smooth = savgol_filter(y, 51, 3)
# Plot the original and smoothed data
plt.plot(x, y, label='Original Data')
plt.plot(x, y_smooth, label='Smoothed Data')
plt.legend()
plt.show()
DATA VISUALIZATION WITH MATPLOTLIB
• Line Charts: Ideal for showcasing trends or changes over time (e.g.,
temperature fluctuations, stock prices).
• Bar Charts: Effective for comparing categories or quantities (e.g.,
sales figures across regions, customer satisfaction ratings).
• Scatter Plots: Used to explore relationships between two variables
(e.g., correlation between weight and height, relationship between
study hours and exam scores).
• PieCharts: Useful for representing proportions of a whole (e.g.,
budget allocation, market share distribution).
DATA VISUALIZATION WITH MATPLOTLIB
• Ex: Data visualization with Matplotlib that creates a line chart.
import matplotlib.pyplot as plt
# Sample temperature data for a week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
temperatures = [18, 21, 24, 22, 20, 19, 17]
# Create the line chart
plt.plot(days, temperatures)
# Add labels and title
plt.xlabel('Days')
plt.ylabel('Temperature (°C)')
plt.title('Weekly Temperature Variation')
# Display the chart
plt.show()
DATA ANALYSIS AND VISUALIZATION
• Data Loading: Reading data from CSV files.
• Data Cleaning: Checking for missing values and data types.
• Data Analysis: Grouping, sorting, and summarizing data.
• Data Visualization: Creating charts to explore trends.
DATA ANALYSIS AND VISUALIZATION
• Ex:
import pandas as pd
# Read the data from the CSV file
df = pd.read_csv("website_traffic.csv")
print(df.head())
print(df.info())
DATA ANALYSIS AND VISUALIZATION
• Ex:
# Group data by source and sort by visits in descending order
source_visits =
df.groupby('source')['visits'].sum().sort_values(ascending=False)
# Get the source with the most visits
top_source = source_visits.index[0]
# Get the number of visits from that source
top_visits = source_visits.iloc[0]
print(f"Top traffic source: {top_source} with {top_visits} visits")
DATA ANALYSIS AND VISUALIZATION
• Ex:
import matplotlib.pyplot as plt
# Plot a bar graph of source vs visits
source_visits.plot(kind='bar')
plt.xlabel("Traffic Source")
plt.ylabel("Visits")
plt.title("Website Traffic by Source")
plt.show()
SCIKIT-LEARN
• Free and open-source library for machine learning in Python
• User-friendly interface for various machine learning tasks
• Wide range of algorithms for classification, regression, clustering, and
more
SCIKIT-LEARN
• Classification: Categorizes data points (e.g., spam filtering)
• Regression: Predicts continuous values (e.g., stock price prediction)
• Clustering: Groups similar data points (e.g., customer segmentation)
• Dimensionality Reduction: Reduces features while preserving
information (e.g., image compression)
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Import libraries:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import pairplot
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Load data:
iris = load_iris()
x = iris.data # Features
y = iris.target # Target labels (species)
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Perform K-Means clustering:
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Visualize clusters:
plt.figure(figsize=(8, 6))
pairplot(X, labels=kmeans.labels_, hue=kmeans.labels_)
plt.title("Iris Dataset - KMeans Clustering")
plt.show()
SUMMARY