Unit 5
Unit 5
SciPy
SciPy is a scientific computation library.SciPy stands for Scientific Python.
It builds on NumPy and provides additional functionality for various tasks, including
Optimization:
It is a function to find minimums and maximums of objective functions.
Integration:
It is a tool for numerical integration of functions and ordinary differential
equations.
Interpolation:
It is a technique for estimating values between known data points.
Linear Algebra:
It is a function used for solving linear equations and matrix operations.
Statistics:
A wide range of statistical tests and probability distributions can be performed.
Jupyter
Jupyter is an open-source project that provides a web-based interactive computing
environment, allowing users to create and share documents that contain live code,
equations, visualizations, and narrative text.
Jupyter key features
• Jupyter Notebooks:
A web based interface to programming environments of Python, Julia, R and many
others.
• Interactive Widgets:
Jupyter supports interactive elements like sliders and buttons, making it easier to create
dynamic visualizations and user interfaces.
• Kernel Support:
Jupyter can run code in various programming languages through different kernels, with
the most popular being the IPython kernel for Python.
• Export Options:
Notebooks can be exported to various formats, including HTML, PDF, and Markdown,
making it easy to share results and documentation.
• Collaboration:
Jupyter Notebooks can be shared and collaborated on through platforms like JupyterHub
or cloud services like Google Colab.
STATSMODEL
● Statsmodels is a Python library built specifically for statistics.
● Statsmodels is built on top of NumPy, SciPy, and matplotlib, but it contains more
advanced functions for statistical testing and modeling that is not found in numerical
libraries like NumPy or SciPy.
● It provides a complement to scipy for statistical computations including descriptive
statistics and estimation and inference for statistical models.
Key features
● Regression Analysis: Supports ordinary least squares (OLS) and generalized linear
models (GLMs).
● Time Series: Offers methods for ARIMA and seasonal decomposition.
● Statistical Tests: Includes t-tests, ANOVA, and other hypothesis tests.
● Model Diagnostics: Tools for checking assumptions and validating models.
● Formula Interface: Allows R-like formula syntax for model specification.
Pandas
● Pandas is a Python library used for working with data sets and it is a data
manipulation package in Python for tabular data.
● That is, data in the form of rows and columns, also known as DataFrames.
● Pandas has functions for analyzing, cleaning, exploring,and manipulating data.
Pandas key features
Data Structures:
● Series: One-dimensional labeled array capable of holding any data type.
● DataFrame: Two-dimensional labeled data structure, similar to a table or
spreadsheet, with columns of potentially different types.
Data Manipulation:
● Indexing and Selection: Easily access rows and columns using labels, conditions,
and various slicing methods.
● Filtering and Subsetting: Select specific data based on conditions.
What is pandas used for?
● Import datasets from databases, spreadsheets, comma-separated values (CSV)
files, and more.
● Clean datasets, for example, by dealing with missing values.
● Tidy datasets by reshaping their structure into a suitable format for analysis.
● Aggregate data by calculating summary statistics such as the mean of columns,
correlation between them, and more.
● Visualize datasets and uncover insights.
AGGREGATE FUNCTION
● The aggregate function in Pandas allows to compute summary statistics for data
groups within a DataFrame.
● An aggregate is a function where the values of multiple rows are grouped to form a
single summary value.
● descriptions:
Function Description
count() Counts the number of non-NA/null entries.
sum() Computes the sum of values.
mean() Calculates the average of values.
median() Computes the median of values.
min() Returns the minimum value.
max() Returns the maximum value.
std() Computes the standard deviation.
var() Calculates the variance.
quantile(q) Returns the value at the given quantile (0-1).
first() Returns the first value in the group.
last() Returns the last value in the group.
unique() Returns unique values in a Series.
value_counts() Returns a Series containing counts of unique values.
Example Program
import pandas as pd
data = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70],
'Quantity': [1, 2, 3, 4, 5, 6, 7]
}
df = pd.DataFrame(data)
grouped = df.groupby('Category')
result = grouped.agg(
count=('Value', 'count'),
size=('Value', 'size’),
total_sum=('Value', 'sum’),
mean_value=('Value', 'mean'),
std_dev=('Value', 'std'),
variance=('Value', 'var'),
min_value=('Value', 'min'),
max_value=('Value', 'max'),
first_value=('Value', 'first'),
last_value=('Value', 'last'),
nth_value=('Value', lambda x: x.iloc[1] if len(x) > 1 else None)
)
description = grouped['Value'].describe()
# Display results
print("Aggregate Functions Result:")
print(result)
print("\nDescription Statistics:")
print(description)
DATA OBJECTS IN PANDAS
1. Series
Definition: A one-dimensional labeled array capable of holding any data type (integers,
strings, floating-point numbers, Python objects, etc.).
Indexing: Each element in a Series has an associated index label, which can be either a
default integer index or a custom index.
Uses: Useful for representing a single column of data or a single variable.
Example
import pandas as pd
# Creating a Series with default integer index
s1 = pd.Series([10, 20, 30, 40])
print(s1)
# Creating a Series with a custom index
s2 = pd.Series([1.5, 2.5, 3.5], index=['a', 'b', 'c'])
print(s2)
2. DataFrame
Description: A two-dimensional labeled data structure with columns that can hold
different types of data (similar to a table).
Index: Contains both row and column indices.
Uses: Suitable for storing and manipulating datasets with multiple variables, akin to a
spreadsheet.
Example
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Write a Python program using Pandas to analyze a dataset. Perform the following
tasks:
Load a dataset from a CSV file into a Pandas DataFrame.
Calculate and display the mean and standard deviation of a specific numeric column.
Group the data by a categorical column and compute the average of the numeric column for each
group.
pip install pandas
import pandas as pd
# Load the dataset from a CSV file
df = pd.read_csv('data.csv’)
# Display the first few rows of the DataFrame
print("Dataset:")
print(df)
# Calculate and display the mean and standard deviation of the 'Salary' column
mean_salary= df['Salary'].mean()
std_salary = df['Salary'].std()
print(f"\nMeanSalary:{mean_salary}")
print(f"Standard Deviation of Salary: {std_salary}")
# Group the data by 'City' and compute the average 'Salary' for each group
average_salary_by_city= df.groupby('City')['Salary'].mean()
print("\nAverage Salary by City:")
print(average_salary_by_city)
Data Munging: Introduction to Data Munging
Data munging, sometimes called data wrangling or data cleaning, is converting and
mapping unprocessed data into a different format to improve its suitability and value for
various downstream uses, including analytics. This procedure entails preparing raw data
for analysis by cleaning, organizing, and enriching it in a readable format.
Why is Data Munging important?
Data munging holds immense significance in the field of data analysis, playing a crucial
role in ensuring the quality and reliability of the data used for making informed decisions.
Several key aspects highlight the significance of data munging in the data
analysis process:
• Accuracy and Precision: Data munging addresses discrepancies and errors in raw
data, leading to more accurate and precise analyses. Cleaning and organizing data
ensure that the insights derived are trustworthy and dependable.
• Quality Improvement: By cleaning and preprocessing data, errors and inconsistencies
are reduced, improving overall data quality.
• Compatibility: Data munging ensures that data from different sources can be
integrated and used together effectively.
• Analysis Readiness: Properly formatted data is essential for accurate analysis and
modeling, enabling organizations to make data-driven decisions.
• Facilitation of Decision-Making: Clean and well-structured data, resulting from
effective data munging, facilitates the decision-making process. Decision-makers can
rely on accurate insights derived from trustworthy data.
Techniques and Steps in Data Munging
Data munging involves a series of techniques and steps to transform raw data into a
usable form. Here are some common techniques used:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in
the data. Common tasks include handling missing values, correcting data formats
(e.g., dates, numeric values), and removing duplicates.
2. Data Transformation: Data often needs to be transformed to fit the analytical
requirements. This may include converting categorical data into numerical format
(encoding), normalizing or scaling numeric data, and aggregating or summarizing
data.
3. Handling Missing Data: Techniques such as imputation (replacing missing values
with estimated ones) or deletion (removing rows or columns with missing data) are
used to handle missing data appropriately.
4. Data Integration: Combining data from multiple sources involves aligning
schemas, resolving inconsistencies, and merging datasets to create a unified view.
5. Feature Engineering: Creating new features or variables from existing data that
can enhance the predictive power of machine learning models.
6. Data Validation: Checking data integrity to ensure it meets expected standards and
business rules.
Data Pipeline in Machine Learning
A data pipeline in machine learning is a series of data processing steps that are
designed to transform raw data into a format suitable for training machine learning
models. It automates the workflow, ensuring that data is collected, processed, and fed
into models in a consistent and repeatable manner.
Benefits of a Data Pipeline
• Automation: Reduces manual intervention, making the workflow efficient and
reproducible.
• Scalability: Easily adapts to larger datasets or more complex data processing steps.
• Consistency: Ensures that data is processed in the same way every time, which is
crucial for model performance.
• Collaboration: Makes it easier for teams to work together, as the pipeline provides a
clear structure.
Example of a Simple Data Pipeline in Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Step 1: Data Ingestion
data = pd.read_csv('data.csv')
# Step 2: Data Cleaning (handling missing values)
data.fillna(method='ffill', inplace=True)
# Step 3: Splitting the Data
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Creating a Pipeline
pipeline = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())])
# Step 5: Training the Model
pipeline.fit(X_train, y_train)
# Step 6: Evaluating the Model
accuracy = pipeline.score(X_test, y_test)
print(f'Model Accuracy: {accuracy:.2f}')
Write a program by loading the Iris dataset, split it into train and test sets, and
compute the accuracy score of a pipeline on the test data.
pip install pandas scikit-learn
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Step 1: Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Create a pipeline
pipeline = Pipeline(steps=[
('scaler', StandardScaler()), # Data scaling
('classifier', RandomForestClassifier(random_state=42)) # Model training
])
# Step 4: Train the model
pipeline.fit(X_train, y_train)
# Step 5: Make predictions on the test set
y_pred = pipeline.predict(X_test)
# Step 6: Compute the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')
Output:
Model Accuracy: 1.00
7. Model Evaluation
• Description: Assess the model's performance using appropriate metrics (accuracy,
precision, recall, etc.).
• Tools:
o Scikit-learn: Use accuracy_score, classification_report, or
confusion_matrix.
Types of Matplotlib
1. Line Chart
• The line chart is represented by a series of datapoints connected with a straight line.
• It is used to represent a relationship between two data X and Y on a different axis.
• Generally line charts are used to display trends over time.
• A line chart or line graph can be created using the plot() function available in pyplot
library.
import matplotlib.pyplot as plt
# initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# plotting the data
plt.plot(x, y)
# Adding title to the plot
plt.title("Line Chart")
# Adding label on the y-axis
plt.ylabel('Y-Axis')
# Adding label on the x-axis
plt.xlabel('X-Axis')
plt.show()
2. BAR Chart
• A bar chart is a graph that represents the category of data with rectangular bars with
lengths and heights that is proportional to the values which they represent.
• The bar plots can be plotted horizontally or vertically.
• A bar chart describes the comparisons between the discrete categories.
• It can be created using the bar() method.
import matplotlib.pyplot as plt
import pandas as pd
# Reading the tips.csv file
data = pd.read_csv('tips.csv')
# initializing the data
x = data['day']
y = data['total_bill']
# plotting the data
plt.bar(x, y)
# Adding title to the plot
plt.title("Tips Dataset")
# Adding label on the y-axis
plt.ylabel('Total Bill')
# Adding label on the x-axis
plt.xlabel('Day')
plt.show()
3. HISTOGRAM
• A histogram is basically used to represent data provided in a form of some groups.
• It is a type of bar plot where the X-axis represents the bin ranges while the Y-axis
gives information about frequency.
• The hist() function is used to compute and create histogram of x.
import matplotlib.pyplot as plt
import pandas as pd
# Reading the tips.csv file
data = pd.read_csv('tips.csv')
# initializing the data
x = data['total_bill']
# plotting the data
plt.hist(x)
# Adding title to the plot
plt.title("Tips Dataset")
# Adding label on the y-axis
plt.ylabel('Frequency')
# Adding label on the x-axis
plt.xlabel('Total Bill')
plt.show()
4.Scatter Plot
• Scatter plots are ideal for visualizing the relationship between two continuous
variables.
• A scatter plot uses dots to represent values for two different numeric variables.
• The position of each dot on the horizontal and vertical axis indicates values for an
individual data point.
import matplotlib.pyplot as plt
# data to display on plots
x = [3, 1, 3, 12, 2, 4, 4]
y = [3, 2, 1, 4, 5, 6, 7]
# This will plot a simple scatter chart
plt.scatter(x, y)
# Adding legend to the plot
plt.legend("A")
# Title to the plot
plt.title("Scatter chart")
plt.show()
5. Box Plot
• A box plot, also known as a box-and-whisker plot, provides a visual summary of the
distribution of a dataset.
• It represents key statistical measures such as the median, quartiles, and potential
outliers in a concise and intuitive manner.
• Box plots are particularly useful for comparing distributions across different groups or
identifying anomalies in the data.
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
6. Pie Chart
• A Pie Chart is a circular statistical plot that can display only one series of data.
• The area of the chart is the total percentage of the given data.
• The area of slices of the pie represents the percentage of the parts of the data. The
slices of pie are called wedges.
import matplotlib.pyplot as plt
# data to display on plots
x = [1, 2, 3, 4]
# this will explode the 1st wedge
# i.e. will separate the 1st wedge
# from the chart
e =(0.1, 0, 0, 0)
# This will plot a simple pie chart
plt.pie(x, explode = e)
# Title to the plot
plt.title("Pie chart")
plt.show()
2. Line Plot
A line plot isn’t typically used for this dataset, but we can visualize trends by plotting
petal length over sepal length for each species.
plt.figure(figsize=(10, 6))
for species in iris_data['species'].unique():
subset = iris_data[iris_data['species'] == species]
plt.plot(subset['sepal_length'], subset['petal_length'], marker='o', linestyle='',
label=species)
plt.title('Petal Length vs. Sepal Length')
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.legend()
plt.grid()
plt.show()
3. Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=iris_data, x='sepal_length', y='sepal_width', hue='species',
style='species', markers=["o", "s", "D"])
plt.title('Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.grid()
plt.show()
4. Bar Chart
plt.figure(figsize=(10, 6))
mean_petal_length = iris_data.groupby('species')['petal_length'].mean()
mean_petal_length.plot(kind='bar', color='orange')
plt.title('Average Petal Length by Species')
plt.xlabel('Species')
plt.ylabel('Average Petal Length')
plt.grid()
plt.show()
5. Histogram
plt.figure(figsize=(10, 6))
plt.hist(iris_data['petal_width'], bins=15, color='purple', alpha=0.7)
plt.title('Distribution of Petal Width')
plt.xlabel('Petal Width')
plt.ylabel('Frequency')
plt.grid()
plt.show()
6. Box Plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='species', y='petal_length', data=iris_data)
plt.title('Box Plot of Petal Length by Species')
plt.xlabel('Species')
plt.ylabel('Petal Length')
plt.grid()
plt.show()
Bokeh is a versatile and powerful Python library for creating interactive visualizations.
It’s designed to enable users to create elegant and informative graphics that can be
embedded in web applications
Key Features of Bokeh
• Interactivity: Easily add widgets like sliders, dropdowns, and buttons to create
interactive plots.
• Web-Based: Bokeh visualizations can be embedded in web applications or exported
as standalone HTML files.
• Rich Output: Supports various types of visualizations including line plots, scatter
plots, bar charts, heatmaps, and more.
• Customizable: Offers extensive customization options for aesthetics and
functionality.
Interactive Scatter Plot with Bokeh
pip install bokeh
from bokeh.plotting import figure, show, output_file
from bokeh.models import ColumnDataSource, Slider, HoverTool
from bokeh.layouts import column
import numpy as np
# Step 1: Generate sample data
N = 100
x = np.random.rand(N) * 10
y = np.random.rand(N) * 10
sizes = np.random.randint(5, 50, size=N)
# Create a ColumnDataSource
source = ColumnDataSource(data=dict(x=x, y=y, size=sizes))
# Step 2: Prepare output file
output_file("interactive_scatter_plot.html")
# Step 3: Create a figure
p = figure(title="Interactive Scatter Plot", tools="")
# Add scatter renderer
scatter = p.scatter('x', 'y', source=source, size='size', alpha=0.6)
# Add a hover tool
hover = HoverTool()
hover.tooltips = [("X", "@x"), ("Y", "@y")]
p.add_tools(hover)
# Step 4: Create a slider for point size
size_slider = Slider(start=5, end=50, value=10, step=1, title="Point Size")
# Update function for the slider
def update_size(attr, old, new):
source.data['size'] = np.random.randint(5, size_slider.value, size=N)
# Link the slider to the update function
size_slider.on_change('value', update_size)
# Step 5: Layout and show the plot
layout = column(size_slider, p)
show(layout)
Visualizations - Visual data analysis techniques
Visual data analysis techniques involve the use of graphical representations to explore,
understand, and communicate data insights. These techniques leverage visual elements to
highlight patterns, trends, relationships, and outliers within datasets.
Charts and Graphs
• Bar Charts: Display categorical data with rectangular bars. Useful for comparing
quantities across different categories.
• Histograms: Show the distribution of numerical data by grouping values into bins.
Ideal for visualizing frequency distributions.
• Line Charts: Connect data points with lines to illustrate trends over time. Commonly
used in time series analysis.
• Pie Charts: Represent parts of a whole. Useful for displaying proportions of
categories.
2. Statistical Visualizations
• Box Plots: Summarize data through quartiles, highlighting the median and potential
outliers. Useful for comparing distributions.
• Scatter Plots: Display relationships between two continuous variables, helping to
identify correlations and trends.
• Heatmaps: Use color gradients to represent values in a matrix format. Effective for
visualizing correlation matrices or frequency counts.
3. Multivariate Analysis
• Pair Plots: Show scatter plots for multiple pairs of variables, allowing for a
comprehensive view of relationships.
• 3D Scatter Plots: Visualize relationships involving three variables. Useful for
exploring complex datasets.
• Faceted Plots: Create a grid of plots based on the values of another variable, enabling
comparisons across groups.
4. Geospatial Visualization
• Choropleth Maps: Use color shading to represent data values across geographical
regions. Effective for showing demographic or economic data.
• Point Maps: Display individual data points on a map, often used for location-based
analysis.
• Heat Maps: Show density of events over a geographical area, highlighting hotspots.
5. Interactive Visualizations
• Dashboards: Combine multiple visualizations into one interface for dynamic
analysis. Useful for monitoring key performance indicators (KPIs).
• Tooltips and Annotations: Provide additional context when hovering over data
points or elements, enhancing understanding.
• Filters and Sliders: Allow users to manipulate data views dynamically, facilitating
exploratory analysis.
6. Advanced Visualizations
• Tree Maps: Visualize hierarchical data using nested rectangles, representing
proportions within a category.
• Sankey Diagrams: Show flow and relationships between categories using arrows
whose width indicates the flow quantity.
• Network Graphs: Represent relationships between entities using nodes and edges,
useful for social network analysis.
7. Data Storytelling
• Narrative Visualizations: Combine visual elements with narrative techniques to tell
a data-driven story, enhancing engagement and understanding.
Interaction techniques
Interaction techniques in visual data analysis are designed to enhance user engagement
and facilitate a deeper exploration of data. These techniques allow users to interact with
visualizations in various ways, making it easier to analyze and interpret complex datasets.
1. Tooltips
Tooltips are small pop-up boxes that display additional information when a user hovers
over a data point or visual element. It provide context-specific details without cluttering
the visualization. They can show values, labels, or descriptions that enhance
understanding. It is typically implemented using mouseover events in visualization
libraries, allowing for dynamic content display.
2. Hover Effects
Hover effects involve changing the appearance of data points or elements when the
mouse hovers over them, such as changing color, size, or opacity. To improve visibility
and focus attention on specific data elements, making it clear which data point is being
examined. It is Often achieved with CSS styling or JavaScript event listeners that trigger
visual changes on hover.
3. Filters and Selection
Filters allow users to narrow down the data displayed based on specific criteria (e.g.,
categories, date ranges, or numeric values. It is used to enable targeted analysis, helping
users focus on relevant data subsets and reducing information overload.
4.Sliders
Sliders are graphical controls that allow users to adjust numerical values or date ranges
dynamically.It provide a way to explore how changes in parameters affect the data
visualization in real-time.s