Learning Python for Data Analysis and Visualization
Learning Python for Data Analysis and Visualization
Start
• Motivation: Understand why you want to learn Python for data analysis (e.g., career growth, solving
business problems, academic research).
3. Visualization Basics
• Learn to visualize data effectively:
o Install libraries: matplotlib, seaborn.
o Types of visualizations:
▪ Bar plots, line graphs, scatter plots.
▪ Pair plots, correlation heatmaps, histograms.
o Customize plots:
▪ Titles, axis labels, legends, color schemes.
Practice: Recreate visualizations from sample datasets.
End
• Continuously learn new techniques, tools, and best practices in data analysis and visualization.
Tutorial: Learn Python Basics
1. Variables, Data Types, and Operators
Concept: Variables store data, and data types define the kind of data you’re working with. Operators are used to
perform operations on variables and values.
Example:
# Variables and Data Types
name = "Alice" # String
age = 25 # Integer
height = 5.6 # Float
is_student = True # Boolean
# Operators
x = 10
y = 3
print(x + y) # Addition
print(x - y) # Subtraction
print(x * y) # Multiplication
print(x / y) # Division
print(x % y) # Modulus (remainder)
Practice:
1. Create variables to store your name, age, and a hobby.
2. Perform arithmetic operations on two numbers.
# while loop
counter = 0
while counter < 5:
print("Counter:", counter)
counter += 1
Example (Conditionals):
# if-else example
age = 20
if age >= 18:
print("You are an adult.")
else:
print("You are a minor.")
Practice:
1. Write a loop to print numbers from 1 to 10.
2. Create a script to check if a number is even or odd.
# Traversing a list
fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
print(fruit)
while Loop
o Use When:
The number of iterations is not known in advance and depends on a condition that must be
checked continuously. This is useful for scenarios where the loop continues until a specific
condition is met.
o Examples:
▪ Waiting for user input to meet a certain condition.
▪ Running a program until a computation converges.
▪ Polling or monitoring a condition.
Example Code:
python
CopyEdit
# Condition-based iteration
counter = 0
while counter < 5:
print("Counter:", counter)
counter += 1
Key Differences
Feature for Loop while Loop
Iterations Fixed or predictable. Unknown or condition-dependent.
Best For Iterating over sequences. Condition-driven loops.
Risk Hard to create infinite loops. Can lead to infinite loops if the condition isn’t properly handled.
Guidelines:
o Use for loops when iterating through data structures or a predictable range.
o Use while loops when looping depends on external factors, such as user input or computational
states.
1. When to Use if, elif, and else in Python
1. if Conditional
o Purpose: Use if to check a condition. If the condition evaluates to True, the code inside the if
block is executed.
o When to Use: Use if for the first condition or the primary condition you want to check.
Example:
python
CopyEdit
temperature = 30
if temperature > 25:
print("It's a hot day.")
2. elif Conditional
o Purpose: Use elif (short for "else if") to check additional conditions if the previous if or
elif conditions are False.
o When to Use: Use elif for multiple mutually exclusive conditions.
Example:
python
CopyEdit
temperature = 15
if temperature > 25:
print("It's a hot day.")
elif temperature > 15:
print("It's a warm day.")
elif temperature > 5:
print("It's a cool day.")
3. else Conditional
o Purpose: Use else as a catch-all when all the previous if and elif conditions are False.
o When to Use: Use else as the default case if no other conditions apply.
Example:
python
CopyEdit
temperature = 5
if temperature > 25:
print("It's a hot day.")
elif temperature > 15:
print("It's a warm day.")
elif temperature > 5:
print("It's a cool day.")
else:
print("It's a cold day.")
python
CopyEdit
age = 18
if age >= 18:
print("You are eligible to vote.")
python
CopyEdit
number = 5
if number % 2 == 0:
print("The number is even.")
else:
print("The number is odd.")
python
CopyEdit
grade = 85
if grade >= 90:
print("You got an A!")
elif grade >= 80:
print("You got a B!")
elif grade >= 70:
print("You got a C!")
else:
print("You need to work harder.")
Practice Exercises
2. Code Organization
o When: If a section of code performs a specific task, consider wrapping it in a function.
o Why: This improves readability by breaking code into smaller, understandable pieces.
Example:
python
CopyEdit
# Without function
age = 20
if age >= 18:
print("You are an adult.")
else:
print("You are a minor.")
# With function
def check_adulthood(age):
if age >= 18:
print("You are an adult.")
else:
print("You are a minor.")
check_adulthood(20)
3. Generalizing Logic
o When: If the logic can be applied to multiple inputs or scenarios, make it a function.
o Why: This avoids hardcoding and makes your code flexible.
Example:
python
CopyEdit
# Without function: fixed logic
print(5 ** 2) # Square of 5
print(10 ** 2) # Square of 10
print(calculate_square(5))
print(calculate_square(10))
4. Large Scripts
o When: If your script is growing too large or has many interdependent sections, use functions to
separate them logically.
o Why: Functions make your code modular and easier to debug.
Example:
python
CopyEdit
# Large script
def get_user_input():
return int(input("Enter a number: "))
def check_even_or_odd(number):
if number % 2 == 0:
return "even"
else:
return "odd"
Best Practices
o Keep functions small and focused (each function should do one thing well).
o Use descriptive names that reflect the function's purpose.
o Add docstrings to explain what the function does.
Example with Best Practices:
python
CopyEdit
def greet_user(name):
"""
Greets the user with their name.
:param name: The user's name (string).
:return: A greeting string.
"""
return f"Hello, {name}!"
# Input list
numbers = [10, 20, 30, 40, 50]
print("Average:", calculate_average(numbers))
Next Steps
Once you’ve mastered these basics, move on to working with Python libraries like pandas and matplotlib for
more advanced data analysis and visualization.
# Create a Series
s = pd.Series([1, 2, 3, 4, 5])
print(s)
# Create a DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
print(df)
2. Importing Datasets (e.g., CSV files):
3. # Read a CSV file into a DataFrame
4. df = pd.read_csv("data.csv")
print(df.head()) # Display the first 5 rows
5. Data Cleaning Techniques:
o Drop rows/columns:
o # Drop rows with missing values
o df = df.dropna()
o
o # Drop a specific column
df = df.drop("Column_Name", axis=1)
o Handle missing values:
o # Fill NaN values with a default value
df = df.fillna(0)
6. Manipulating DataFrames:
o Filter rows:
o filtered_df = df[df["Age"] > 30]
print(filtered_df)
o Sort data:
o sorted_df = df.sort_values(by="Age", ascending=False)
print(sorted_df)
o Group by column criteria:
o grouped = df.groupby("City").mean()
print(grouped)
Practice Exercises
1. Create a DataFrame from scratch with columns for "Product", "Price", and "Quantity". Perform the
following:
o Filter products priced above $20.
o Calculate the total value for each product (Price × Quantity).
2. Load a sample CSV file, clean any missing data, and sort it by one of the columns.
Next Steps
Once you’re familiar with these libraries, move on to data visualization with matplotlib and seaborn to create
insightful plots and charts.
# Perform filtering
filtered_data = df[df["Age"] > 30]
# Create an array
array = np.array([1, 2, 3, 4, 5])
# Generate a dataset
data = np.random.normal(loc=0, scale=1, size=1000)
Key Differences
Feature Pandas Numpy Scipy
Primary Use Tabular data manipulation Numerical computations Advanced scientific tasks
Data Structure DataFrames, Series Arrays Builds on Numpy arrays
Examples Filtering, Grouping Linear algebra, stats Optimization, signal proc.
How to Choose
1. Start with Pandas for data analysis if your data is tabular (rows and columns).
2. Use Numpy for high-performance numerical computations or when dealing with multi-dimensional arrays.
3. Incorporate Scipy when you need advanced mathematical computations, such as solving differential
equations or performing statistical tests
7. Visualization Basics
Install Libraries for Visualization
To create visualizations, install the following libraries:
• matplotlib: For basic visualizations like line graphs, bar plots, and scatter plots.
• seaborn: For advanced visualizations with improved aesthetics and functionality.
Installation Command:
pip install matplotlib seaborn
Types of Visualizations
1. Bar Plots and Line Graphs:
o Use bar plots to show comparisons between categories.
o Use line graphs to show trends over time.
Example:
import matplotlib.pyplot as plt
# Bar Plot
categories = ["A", "B", "C"]
values = [10, 20, 15]
plt.bar(categories, values)
plt.title("Bar Plot Example")
plt.xlabel("Categories")
plt.ylabel("Values")
plt.show()
# Line Graph
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y, marker="o")
plt.title("Line Graph Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
2. Scatter Plots:
o Use scatter plots to examine the relationship between two variables.
Example:
# Scatter Plot
plt.scatter(x, y, color="blue", label="Data points")
plt.title("Scatter Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.show()
3. Pair Plots and Correlation Heatmaps (Seaborn):
o Pair plots show pairwise relationships between variables in a dataset.
o Heatmaps visualize the correlation between numerical columns.
Example:
import seaborn as sns
import pandas as pd
# Example DataFrame
data = {
"A": [1, 2, 3, 4],
"B": [2, 4, 6, 8],
"C": [5, 3, 4, 7]
}
df = pd.DataFrame(data)
# Pair Plot
sns.pairplot(df)
plt.show()
# Correlation Heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
4. Histograms:
o Use histograms to understand the distribution of a dataset.
Example:
# Histogram
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5]
plt.hist(data, bins=5, color="green", alpha=0.7)
plt.title("Histogram Example")
plt.xlabel("Bins")
plt.ylabel("Frequency")
plt.show()
Customizing Visualizations
• Add titles, axis labels, legends, and customize color schemes for better readability and aesthetics.
Practice Exercises
1. Create a bar plot showing sales of 5 products.
2. Plot a line graph showing monthly revenue for a year.
3. Use a dataset to create a pair plot and a correlation heatmap to analyze relationships between variables.
Next Steps
Once you master visualization basics, explore advanced techniques such as interactive dashboards with libraries like
Plotly and Dash.
2. Seaborn
o Best for: High-level, aesthetically pleasing visualizations.
o Features:
▪ Built on top of Matplotlib, with more attractive default styles.
▪ Excellent for statistical visualizations like pair plots, heatmaps, and violin plots.
o When to Use:
▪ When working with dataframes (e.g., from Pandas).
▪ For quick visualizations of data relationships and distributions.
▪ When you need visualizations tailored for statistical data.
Example Use Case:
python
CopyEdit
import seaborn as sns
import pandas as pd
data = {"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}
df = pd.DataFrame(data)
sns.pairplot(df)
plt.show()
3. Plotly
o Best for: Interactive and dynamic visualizations.
o Features:
▪ Offers interactive features (hovering, zooming, etc.).
▪ Supports dashboards and web integration.
▪ Handles 3D plots and advanced visualizations like maps.
o When to Use:
▪ For web-based or interactive dashboards.
▪ To create visually engaging presentations.
▪ For data exploration where interactivity adds value.
Example Use Case:
python
CopyEdit
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length",
color="species")
fig.show()
4. Other Libraries
o Bokeh:
▪ Focused on creating interactive visualizations for the web.
▪ Great for building dashboards.
o Altair:
▪ Declarative library for statistical visualizations.
▪ Excellent for smaller datasets and quick visualizations.
o GGPlot (Plotnine):
▪ Inspired by R's ggplot2 library.
▪ Ideal for users familiar with ggplot2.
Decision Guidelines
Requirement Library
Static and simple plots Matplotlib
Aesthetic and statistical visualizations Seaborn
Interactive or web-based visualizations Plotly, Bokeh
Declarative syntax Altair, Plotnine
Requirement Library
High customizability Matplotlib
Each library has its strengths and caters to different visualization needs. Often, you’ll use a combination of
them for various tasks depending on the requirements of your project.
Example Projects
1. Analyze Air Quality Datasets:
o Download an air quality dataset.
o Perform EDA to identify patterns such as seasonal trends or anomalies in pollutant levels.
o Visualize data using line plots for trends and heatmaps for correlations.
Example Code:
import pandas as pd
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_csv("air_quality.csv")
# Perform EDA
print(df.info())
print(df.describe())
# Visualize trends
plt.plot(df["Date"], df["PM2.5"], label="PM2.5 Levels")
plt.xlabel("Date")
plt.ylabel("PM2.5")
plt.title("Air Quality Trends")
plt.legend()
plt.show()
2. Sales Trends and Performance Dashboards:
o Use a sales dataset to analyze monthly or yearly sales trends.
o Create dashboards with bar plots for product performance and line graphs for revenue trends.
Example Code:
# Load sales data
df = pd.read_csv("sales_data.csv")
scikit-learn is a powerful library for machine learning. It provides easy-to-use tools for classification, regression, and
clustering.
# Sample data
X = np.array([[1], [2], [3], [4], [5]]) # Independent variable
y = np.array([2, 4, 5, 4, 5]) # Dependent variable
# Predict
y_pred = model.predict(X)
print("Predicted values:", y_pred)
b) statsmodels: Statistical Modeling
import statsmodels.api as sm
Plotly and Altair enable interactive visualizations that enhance data exploration.
import plotly.express as px
import pandas as pd
# Sample data
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [2, 4, 5, 4, 5]})
fig = px.line(df, x="x", y="y", title="Interactive Line Plot")
fig.show()
2. Learning Automation
Automation helps streamline workflows and eliminate repetitive tasks.
import pandas as pd
# Sample data
data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [25, None, 30]}
df = pd.DataFrame(data)
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
pdf.cell(200, 10, txt="Automated Report", ln=True, align='C')
pdf.output("report.pdf")
3. Practicing with Python Notebooks
Jupyter Notebook is an excellent tool for dynamic exploration of Python code.
By practicing with these tools, you can significantly enhance your Python skills and improve your efficiency in data
science and analytics.
4. Interpreting Results and Sharing Insights
a) Learn Storytelling
Effective data storytelling involves interpreting visualizations and analysis within the context of your goals.
Consider:
b) Share Findings
# Save a plot
plt.plot([1, 2, 3, 4, 5], [2, 4, 5, 4, 5])
plt.title("Simple Plot")
plt.savefig("plot.png")
Use Platforms like Tableau (Optional)
Tableau is a powerful tool for creating dashboards and interactive reports. Consider using it to present findings
effectively.
By mastering these skills, you can effectively interpret results and communicate your findings in a compelling way.