0% found this document useful (0 votes)
52 views49 pages

Chapter 2. Data Analysis and Processing - Full

Uploaded by

schlaggen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views49 pages

Chapter 2. Data Analysis and Processing - Full

Uploaded by

schlaggen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

DATA ANALYSIS AND PROCESSING

DR. PHẠM MINH HOÀN – [email protected]


OBJECTIVES OF CHAPTER 2
• Understanding different types of data sources and how to access and manipulate
them.
• Data analysis is all about extracting meaningful insights from your data.
• Data exploration is about getting familiar with your data and identifying patterns
and trends.
• Data
visualization is about creating visual representations of your data to
communicate insights effectively.
• Most data analysis tasks involve using specialized libraries that provide functions
and tools for working with data.
CONTENTS
2.1. Introduce and work with data sources
2.2. Data Analysis
2.3. Data Exploration
2.4. Data Visualization
2.5. Working with library
2.5.1. Pandas
2.5.2. SciPy and Numpy
2.5.3. Matplotlib
2.5.4. Scikit-learn
INTRODUCE AND WORK WITH DATA SOURCES
• Datasources: A data source is a location or system that stores and
manages data. This data can be anything from numbers and text to
images and audio files.
• Databases: These are structured collections of data that allow for easy access
and manipulation.
• Spreadsheets: Familiar programs like Microsoft Excel that store data in tables
with rows and columns.
• Cloud-based platforms: Services like Google Drive or Dropbox that store data
online and allow access from anywhere.
INTRODUCE AND WORK WITH DATA SOURCES
• Working with data sources:
• Identify the data source: Determine what kind of data you need and where it's
stored.
• Connect to the data source: This will involve using specific tools or software
depending on the data source type.
• Extractthe data: Use tools or write queries to extract the specific data you
need for your project.
INTRODUCE AND WORK WITH DATA SOURCES
• Working with data sources:
• Clean and transform the data: Real-world data often has inconsistencies or
errors. This stage involves cleaning the data and transforming it into a usable
format for analysis.
• Analyze the data: After the data is prepared, can use various techniques to
analyze it and extract insights.
DATA ANALYSIS
• Dataanalysis is the process of extracting meaningful information
from data.
• Goals of data analysis:
• Uncover patterns and trends: Data analysis helps identify relationships
between different pieces of data. This can reveal trends or patterns.
• Make informed decisions: Data-driven insights can inform better choices in
various fields, from business strategy to scientific research.
• Solve problems: Data analysis is a powerful tool for identifying and solving
problems. By examining data, can pinpoint root causes and develop solutions.
DATA ANALYSIS
• Main types of data analysis:
• Descriptive Analysis: This is the foundation for further analysis. It provides a
summary of the data, describing its central tendencies (like average or
median) and variability.
• Diagnostic Analysis: Understand why things are happening. Identify factors
influencing specific outcomes or behaviors. Using data to diagnose the root
cause of a problem.
• PredictiveAnalysis: Uses historical data to forecast future trends or events.
The goal is to make predictions about what might happen based on patterns
observed in the data.
DATA ANALYSIS
• Popular Python Libraries for Data Analysis:
• NumPy: The foundation for numerical computing in Python. It offers efficient
arrays, linear algebra operations, and mathematical functions for data
manipulation.
• Pandas: Builds on top of NumPy and provides high-performance data
structures like DataFrames (think spreadsheet on steroids) for handling tabular
data. It excels in data cleaning, transformation, and analysis.
• SciPy: Offers a collection of algorithms and functions for advanced scientific
computing and data analysis tasks like optimization, integration, and
statistical modeling.
EXPLORATORY DATA ANALYSIS
EDA is a phenomenon under data analysis used for gaining a better
understanding of data aspects like:
• main features of data
• variables and relationships that hold between them
• Identifying which variables are important for our problem
EXPLORATORY DATA ANALYSIS
EDA process:
• Reading dataset
• Analyzing the data
• Checking for the duplicates
• Missing Values Calculation
• Exploratory Data Analysis
• Univariate Analysis
• Bivariate Analysis
• Multivariate Analysis
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 0: Install Libraries
# íntall Libraries
python pip install pandas
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 1: Importing Required Libraries
# importting Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore')
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 2: Reading Dataset
# loading and reading dataset
df = pd.read_csv("winequality-red.csv")
print(df.head())
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 3: Analyzing the Data
# shape of the data
df.shape
#data information
df.info()
# describing the data
df.describe()
#column to list
df.columns.tolist()
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 3: Analyzing the Data
# check for missing values:
df.isnull().sum()
#checking duplicate values
df.nunique()
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 4: Univariate Analysis
# Assuming 'df' is your DataFrame
quality_counts = df['quality'].value_counts()
# Using Matplotlib to create a count plot
plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts, color='darpink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 4: Univariate Analysis
# Set Seaborn style
sns.set_style("darkgrid")

# Identify numerical columns


numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns

# Plot distribution of each numerical feature


plt.figure(figsize=(14, len(numerical_columns) * 3))
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 4: Univariate Analysis
for idx, feature in enumerate(numerical_columns, 1):
plt.subplot(len(numerical_columns), 2, idx)
sns.histplot(df[feature], kde=True)
plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")

# Adjust layout and show plots


plt.tight_layout()
plt.show()
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 4: Univariate Analysis
# Assuming 'df' is your DataFrame
plt.figure(figsize=(10, 8))
# Using Seaborn to create a swarm plot
sns.swarmplot(x="quality", y="alcohol", data=df, palette='viridis')
plt.title('Swarm Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 5: Bivariate Analysis
# Set the color palette
sns.set_palette("Pastel1")
# Assuming 'df' is your DataFrame
plt.figure(figsize=(10, 6))
# Using Seaborn to create a pair plot with the specified color palette
sns.pairplot(df)
plt.suptitle('Pair Plot for DataFrame')
plt.show()
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 5: Bivariate Analysis
# Assuming 'df' is your DataFrame
df['quality'] = df['quality'].astype(str) # Convert 'quality' to categorical
plt.figure(figsize=(10, 8))
# Using Seaborn to create a violin plot
sns.violinplot(x="quality", y="alcohol", data=df, palette={
'3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6': 'gold', '7': 'lightskyblue',
'8': 'lightpink'}, alpha=0.7)
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 5: Bivariate Analysis
plt.title('Violin Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 5: Bivariate Analysis
#plotting box plot between alcohol and quality
sns.boxplot(x='quality', y='alcohol', data=df)
EXPLORATORY DATA ANALYSIS
EDA process:
• Step 6: Multivariate Analysis
# Assuming 'df' is your DataFrame
plt.figure(figsize=(15, 10))

# Using Seaborn to create a heatmap


sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)

plt.title('Correlation Heatmap')
plt.show()
DATA VISUALIZATION
• Data visualization is a powerful tool to understand and communicate
insights from data.
• Data visualization helps you see patterns, trends, and relationships
within your data that might be difficult to identify just by looking at
raw numbers. It's a great way to:
• Summarize large datasets
• Find correlations between variables
• Communicate complex ideas to others
• Python Libraries for Data Visualization: Matplotlib, Seaborn, …
DATA VISUALIZATION
• Install and load libraries.
• Pandas.
• SciPy and NumPy.
• Scikit-learn.
INSTALL AND LOAD LIBRARIES
• Install libraries in Python
py -m pip install [package_name]
Ex:
py -m pip install numpy
py -m pip install pandas
py -m pip install matplotlit
INSTALL AND LOAD LIBRARIES
• Install libraries in Pycharm
Step 1. File\Settings… (Ctrl+Alt+S)
Step 2. Project: pythonProject…\Python Interpreter
Step 3. Install (+) (Alt+S)
DATA VISUALIZATION WITH PANDAS
• Pandas, while not a dedicated visualization library, offers built-in
plotting functionalities that are great for exploratory data analysis
(EDA).
DATA VISUALIZATION WITH PANDAS
• Scatter Plots: Show relationships between two numerical variables.
• Line Plots: Useful for visualizing trends over time or along a sequence.
• Histograms: Depict the distribution of a single numerical variable.
• Box Plots: Summarize the distribution of a numerical variable, highlighting
outliers.
• Bar Plots: Represent categorical data or compare values between categories.
• Area Plots: Similar to line plots, but emphasize the area between the line and the
axis.
• Pie Charts: Represent proportions of a whole using slices.
DATA VISUALIZATION WITH PANDAS
• Ex:
import pandas as pd
import matplotlib.pyplot as plt
# Sample Dataframe
data = {'col1': [1, 2, 3, 4], 'col2': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)
# Create a scatter plot
df.plot(kind='scatter', x='col1', y='col2')
plt.show()
DATA VISUALIZATION WITH NUMPY
• WhileNumPy are fundamental libraries for scientific computing in
Python, data visualization isn't their primary focus.
• However, they can be a powerful foundation for creating custom
visualizations.
DATA VISUALIZATION WITH NUMPY
• Ex:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 5, 100) # Create x-axis data (100 points from 0 to 5)
y = x**2 # Create y-axis data (square the x values)
plt.plot(x, y) # Plot the line
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot (NumPy Data)')
plt.show()
DATA VISUALIZATION WITH SCIPY
• While SciPy is a powerful library for scientific computing in Python,
it's not primarily designed for data visualization.
• SciPy's Role in Data Visualization Workflow:
• Data Preparation: SciPy functions can help you clean, transform, and analyze
your data before visualization. For instance, you can use SciPy for outlier
detection, filtering, and data smoothing.
• Statistical Analysis: SciPy provides functions for statistical calculations like
finding correlations or fitting distributions. These results can be incorporated
into your visualizations to add context and insights.
DATA VISUALIZATION WITH SCIPY
• Ex:
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
# Generate noisy data
x = range(100)
y = np.sin(x) + np.random.randn(100)
# Smooth the data using Savitzky-Golay filter
y_smooth = savgol_filter(y, 51, 3)
# Plot the original and smoothed data
plt.plot(x, y, label='Original Data')
plt.plot(x, y_smooth, label='Smoothed Data')
plt.legend()
plt.show()
DATA VISUALIZATION WITH MATPLOTLIB
• Line Charts: Ideal for showcasing trends or changes over time (e.g.,
temperature fluctuations, stock prices).
• Bar Charts: Effective for comparing categories or quantities (e.g.,
sales figures across regions, customer satisfaction ratings).
• Scatter Plots: Used to explore relationships between two variables
(e.g., correlation between weight and height, relationship between
study hours and exam scores).
• PieCharts: Useful for representing proportions of a whole (e.g.,
budget allocation, market share distribution).
DATA VISUALIZATION WITH MATPLOTLIB
• Ex: Data visualization with Matplotlib that creates a line chart.
import matplotlib.pyplot as plt
# Sample temperature data for a week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
temperatures = [18, 21, 24, 22, 20, 19, 17]
# Create the line chart
plt.plot(days, temperatures)
# Add labels and title
plt.xlabel('Days')
plt.ylabel('Temperature (°C)')
plt.title('Weekly Temperature Variation')
# Display the chart
plt.show()
DATA ANALYSIS AND VISUALIZATION
• Data Loading: Reading data from CSV files.
• Data Cleaning: Checking for missing values and data types.
• Data Analysis: Grouping, sorting, and summarizing data.
• Data Visualization: Creating charts to explore trends.
DATA ANALYSIS AND VISUALIZATION
• Ex:
import pandas as pd
# Read the data from the CSV file
df = pd.read_csv("website_traffic.csv")
print(df.head())
print(df.info())
DATA ANALYSIS AND VISUALIZATION
• Ex:
# Group data by source and sort by visits in descending order
source_visits =
df.groupby('source')['visits'].sum().sort_values(ascending=False)
# Get the source with the most visits
top_source = source_visits.index[0]
# Get the number of visits from that source
top_visits = source_visits.iloc[0]
print(f"Top traffic source: {top_source} with {top_visits} visits")
DATA ANALYSIS AND VISUALIZATION
• Ex:
import matplotlib.pyplot as plt
# Plot a bar graph of source vs visits
source_visits.plot(kind='bar')
plt.xlabel("Traffic Source")
plt.ylabel("Visits")
plt.title("Website Traffic by Source")
plt.show()
SCIKIT-LEARN
• Free and open-source library for machine learning in Python
• User-friendly interface for various machine learning tasks
• Wide range of algorithms for classification, regression, clustering, and
more
SCIKIT-LEARN
• Classification: Categorizes data points (e.g., spam filtering)
• Regression: Predicts continuous values (e.g., stock price prediction)
• Clustering: Groups similar data points (e.g., customer segmentation)
• Dimensionality Reduction: Reduces features while preserving
information (e.g., image compression)
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Import libraries:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import pairplot
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Load data:
iris = load_iris()
x = iris.data # Features
y = iris.target # Target labels (species)
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Perform K-Means clustering:
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
DATA VISUALIZATION WITH SCIKIT-LEARN
• Ex: Visualizing K-Means Clustering with scikit-learn
#Visualize clusters:
plt.figure(figsize=(8, 6))
pairplot(X, labels=kmeans.labels_, hue=kmeans.labels_)
plt.title("Iris Dataset - KMeans Clustering")
plt.show()
SUMMARY

You might also like