DAV Notes
DAV Notes
1. NumPy: Provides support for large, multi-dimensional arrays and matrices, and includes a
vast collection of high-level mathematical functions to efficiently manipulate and process
numerical data. NumPy serves as the foundation of most scientific computing in Python.
2. SciPy: Offers a comprehensive suite of functions for scientific and engineering applications,
including signal processing, linear algebra, optimization, statistics, and more. SciPy provides
modules for tasks such as image processing, sparse matrix operations, and Fourier
transforms, making it an indispensable tool for scientists and engineers.
1. Pandas: Provides high-performance, easy-to-use data structures and data analysis tools,
including Series and DataFrames. Pandas supports operations such as filtering, sorting,
grouping, merging, reshaping, and pivoting, making it a powerful tool for efficiently handling
and analyzing large datasets.
2. Statsmodels: Includes a wide range of statistical techniques, such as hypothesis testing,
confidence intervals, and regression analysis. Statsmodels provides models for linear
regression, generalized linear models, discrete models, robust linear models, and time series
analysis, making it a comprehensive tool for statistical analysis and modeling.
Interactive Environment
1. Jupyter: Provides an interactive environment for working with code, data, and visualizations,
supporting over 40 programming languages. Jupyter includes tools for creating and sharing
documents that contain live code, equations, visualizations, and narrative text, making it an
ideal platform for data science, scientific computing, and education.
DATA SCIENCE:
Data science is a multidisciplinary field that extracts insights and knowledge from structured and
unstructured data using various techniques, tools, and technologies. It combines aspects of
mathematics, statistics, computer science, domain knowledge, and data analysis to solve complex
problems and drive decision-making.
Business Benefits
Customer Benefits
Social Benefits
USES
1. Predictive analytics - Data science can help businesses predict demand and make better
business decisions.
2. Machine learning - Data science can help businesses understand customer behavior and
operational functions.
3. Recommendation systems - Data science can help businesses recommend products to
customers based on their preferences.
4. Fraud detection - Data science can help businesses identify and mitigate fraud.
5. Sentiment analysis - Data science can help businesses understand customer feedback and
sentiment.
FACETS OF DATA:
1. Structured
2.Semi structured
3. Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
STRUCTURED DATA refers to highly organized and formatted data that can be easily stored,
queried, and analyzed in databases, accounting for only 5-10% of all informatics data. It is neatly
organized into rows, columns, and tables, making it easily searchable and manageable. This type of
data is well-defined, formatted, and efficiently stored, allowing for efficient analysis and decision-
making.
Employee
Name Department Job Title Salary
ID
101 John Smith Sales Sales Manager 50000
UNSTRUCTURED DATA makes up approximately 80% of all data and refers to information that
lacks a predefined format or structure. It is typically more difficult to process, analyze, and store
compared to structured data.
Satellite images: This includes weather data or the data that the government captures in its
satellite surveillance imagery. Just think about Google Earth, and you get the picture.
Photographs and video: This include security, surveillance, and traffic video.
Radar or sonar data: This includes vehicular, meteorological, and Seismic oceanography.
The following list shows a few examples of human-generated unstructured data:
Social media data: This data is generated from the social media platforms such as YouTube,
Facebook, Twitter, LinkedIn, and Flickr.
Mobile data: This includes data such as text messages and location information.
website content: This comes from any site delivering unstructured content, like YouTube, Flickr,
or Instagram.
i)Natural Language:
Natural language is a special type of unstructured data; it’s challenging to process because it
requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
ii)Graph based or Network Data:
In graph theory, a graph is a mathematical structure to model pair-wise relationships between
objects.
Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical data. Graph -
based data is a natural way to represent social networks.
iii)Audio, Image & Video:
Audio, image, and video are data types that pose specific challenges to a data scientist.
MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game analytics. High-speed
cameras at stadiums will capture ball and athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.
iv) Streaming Data:
Streaming data is data that is generated continuously by thousands of data sources, which typically
send in the data records simultaneously, and in small sizes (order of Kilobytes).
Examples are the-Log files generated by customers using your mobile or web applications, online
game activity, “What’s trending” on Twitter, live sporting or music events, and the stock market.
1. Business: Helps businesses make better decisions by understanding customer behavior and
market trends.
2. Healthcare: Used to predict diseases, optimize treatments, and improve patient care.
3. Finance: Helps detect fraud, assess risks, and make financial decisions.
4. Marketing: Improves marketing strategies by understanding customer preferences and
campaign performance.
5. Supply Chain: Optimizes inventory, deliveries, and product demand.
6. Sports: Analyzes player performance and improves strategies.
7. Retail: Enhances customer experience and boosts sales through personalized
recommendations.
8. Government: Improves public services, traffic management, and crime prevention.
9. Energy: Helps in energy forecasting, maintenance, and cost savings.
10. Education: Tracks student performance and improves learning outcomes.
The data science process is a systematic approach to extracting insights and knowledge from data. It
involves a series of steps that help data scientists identify problems, collect and analyze data, and
develop predictive models.
Step 1: Define the Problem and Create a Project Charter - Clearly defining the research goals
is the first step in the Data Science Process. A project charter outlines the objectives, resources,
deliverables, and timeline, ensuring that all stakeholders are aligned.
Step 2: Retrieve Data - Data can be stored in databases, data warehouses, or data lakes within an
organization. Accessing this data often involves navigating company policies and requesting
permissions.
Step 3: Data Cleansing, Integration, and Transformation - Data cleaning ensures that errors,
inconsistencies, and outliers are removed. Data integration combines datasets from different sources,
while data transformation prepares the data for modeling by reshaping variables or creating new features.
Step 4: Exploratory Data Analysis (EDA)- During EDA, various graphical techniques like scatter plots,
histograms, and box plots are used to visualize data and identify trends. This phase helps in selecting the
right modeling techniques.
Step 5: Build Models - In this step, machine learning or deep learning models are built to make
predictions or classifications based on the data. The choice of algorithm depends on the complexity of the
problem and the type of data.
Step 6: Present Findings and Deploy Models - Once the analysis is complete, results are
presented to stakeholders. Models are deployed into production systems to automate decision-
making or support ongoing analysis.
To understand the project, three key concepts must be explored: What, Why, and How.
The goal of the first phase is to answer these three questions. During this phase, the data science
team must investigate the problem, gain context, and understand the necessary data sources.
1. Learning the Business Domain - Understanding the domain area of the problem is essential for
data scientists to apply their computational and quantitative knowledge.
2. Resources - Assessing available resources, including technology, tools, systems, data, and people,
is crucial during the discovery phase.
3. Frame the Problem - Framing involves clearly stating the analytics problem to be solved and
sharing it with key stakeholders.
4. Identifying Key Stakeholders - Identifying key stakeholders, including those who will benefit
from or be impacted by the project, is vital for project success.
5. Interviewing the Analytics Sponsor - Collaborating with the analytics sponsor helps clarify and
frame the analytics problem, ensuring alignment with project goals.
6. Developing Initial Hypotheses - Developing initial hypotheses involves forming ideas to test
with data, providing a foundation for analytical tests in later phases.
7. Identifying Potential Data Sources - Identifying potential data sources requires considering the
volume, type, and time span of required data to support hypothesis testing.
RETRIEVING DATA
Retrieving required data is the second phase of a data science project, which may involve designing a
data collection process or obtaining data from existing sources. For instance, a company like
Amazon may collect customer purchase data to analyze buying patterns.
Types of Data Repositories - Data repositories can be classified into data warehouses, data lakes,
data marts, metadata repositories, and data cubes, each serving distinct purposes. For instance, a data
warehouse like Amazon Redshift can store and analyze large amounts of data from various sources.
Advantages and Disadvantages of Data Repositories - Data repositories offer advantages such as
data preservation, easier data reporting, and simplified problem tracking, but also pose disadvantages
like system slowdowns, data breaches, and unauthorized access. For example, a company like
Equifax experienced a massive data breach in 2017, compromising sensitive customer information.
Working with Internal Data - Data scientists start by verifying internal data stored within the
company, assessing its relevance and quality, and utilizing official data repositories such as
databases, data marts, data warehouses, and data lakes. For example, a retail company like Walmart
uses internal data to track sales, inventory, and customer behavior. Example: Loading Internal
Data.
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('internal_data.csv')
External Data Sources - When internal data is insufficient, data scientists can explore external
sources like other companies, social media platforms, and government organizations, which may
provide high-quality data for free or at a cost. For instance, the US Census Bureau provides free
demographic data that can be used for market research. Example: Loading External Data.
import pandas as pd
# Load data from an API
data = pd.read_json('https://api.example.com/data')
Data Quality Checks - Data scientists perform data quality checks to ensure accuracy,
completeness, and consistency.
NumPy: Performs statistical analysis and data quality checks.
Pandas: Performs data quality checks and handles missing data.
Real-Time Example: Predicting Customer Churn - A telecom company wants to predict customer
churn using data on call usage, billing, and customer complaints. The data scientist collects internal
data from the company's database, supplements it with external data from social media and market
research reports, and performs data quality checks to ensure accuracy and completeness. The cleaned
data is then used to train a machine learning model that predicts customer churn with high accuracy.
import pandas as pd
import numpy as np
# Load data
data = pd.read_csv('customer_data.csv')
# Perform data quality checks
print(data.isnull().sum()) # Check for missing values
print(data.duplicated().sum()) # Check for duplicate rows
DATA PREPARATION - Data preparation involves data cleansing, integrating, and transforming
data.
1. Handling missing values: Replacing missing values with mean, median, or mode.
2. Smoothing noisy data: Removing outliers and handling inconsistencies.
3. Correcting errors: Fixing data entry errors, whitespace errors, and capital letter mismatches.
Outlier Detection
Outlier detection involves identifying data points that deviate drastically from the norm.
Dealing with Missing Values
1. Reducing the number of variables: Using techniques like PCA or feature selection.
2. Turning variables into dummies: Converting categorical variables into binary variables.
import pandas as pd
# Load data
data = pd.read_csv('customer_data.csv')
# Remove outliers
data = data[(np.abs(data) < 3).all(axis=1)]
1. EDA is an approach to exploring datasets using summary statistics and visualizations to gain
a deeper understanding of the data.
2. EDA helps determine how best to manipulate data sources to get the answers needed.
3. EDA makes it easier for data scientists to discover patterns, spot anomalies, test a hypothesis,
or check assumptions.
Methods of EDA
1. Univariate analysis provides summary statistics for each field in the raw data set, such as
central tendency and variability.
Example: Analyzing the distribution of exam scores using a histogram.
2. Bivariate analysis is performed to find the relationship between each variable in the dataset
and the target variable of interest.
Example: Analyzing the relationship between exam scores and study hours using a scatter
plot.
3. Multivariate analysis is performed to understand interactions between different fields in the
dataset.
Example: Analyzing the relationship between exam scores, study hours, and attendance using
a multiple regression analysis.
Box Plots
1. A box plot is a type of chart used in EDA to visually show the distribution of numerical data
and skewness.
2. A box plot displays the data quartiles or percentile and averages.
3. A box plot is useful for detecting and illustrating location and variation changes between
different groups of data.
Example:
Suppose we have a dataset of exam scores for four groups of students. We can use a box plot to
compare the scores for each group.
import pandas as pd
import matplotlib.pyplot as plt
# Univariate analysis
for column in data.columns:
print(f"Univariate analysis for {column}:")
print(data[column].describe())
plt.hist(data[column], bins=10)
plt.title(f"Histogram for {column}")
plt.show()
# Bivariate analysis
for column1 in data.columns:
for column2 in data.columns:
if column1 != column2:
print(f"Bivariate analysis for {column1} and {column2}:")
plt.scatter(data[column1], data[column2])
plt.xlabel(column1)
plt.ylabel(column2)
plt.title(f"Scatter plot for {column1} and {column2}")
plt.show()
# Multivariate analysis
from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(10, 8))
plt.show()
# Box plots
data.boxplot(figsize=(10, 8))
plt.show()
BUILDING MODELS
1. Model building involves selecting the right model and variables, executing the model, and
evaluating its performance.
2. The three components of model building are:
a. Selection of model and variable
b. Execution of model
c. Model diagnostics and comparison
Model Execution
1. Model diagnostics and comparison involve evaluating the performance of the model and
comparing it to other models.
2. Techniques include:
a. Holdout method
b. Cross-validation
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Make predictions
y_pred = model.predict(X_test)
A data science team at a telecom company has been tasked with predicting customer churn. After
collecting and analyzing the data, the team has developed a predictive model that identifies
customers who are likely to churn.
Presenting Findings
The team presents their findings to the stakeholders in a clear and concise manner:
1. "Our analysis shows that customers who have been with the company for less than 6 months
are more likely to churn."
2. "We have developed a predictive model that can identify customers who are likely to churn
with an accuracy of 85%."
Building Applications
The team then builds an application that uses the predictive model to identify customers who are
likely to churn. The application provides recommendations to the customer service team on how to
retain these customers.
Key Benefits
1. Improved customer retention: The application helps the company to identify and retain
customers who are likely to churn.
2. Increased revenue: By retaining more customers, the company can increase its revenue.
3. Better customer service: The application provides recommendations to the customer service
team on how to improve customer service and retain customers.
Data warehousing is a crucial concept in data management that involves collecting and storing data
from various sources in a centralized repository. This enables businesses to make informed decisions
and gain valuable insights through data analysis.
Key Contributors
Bill Inmon: Known as the "Father of Data Warehousing," Inmon introduced the concept of data
warehousing in the 1980s.
Ralph Kimball: A pioneer in data warehousing, Kimball developed foundational methods and ideas
that shaped modern data warehousing and business intelligence.
Data Sources - Data warehouses collect data from various systems and platforms, including:
ETL Process
The ETL (Extract, Transform, Load) process is used to validate and format data before loading it into
a Staging Area.
ETL Steps
Staging Area (Landing Zone) - After the ETL process, the data is stored in a staging area, a
temporary storage zone where the extracted data is:
Operational Data Store (ODS) - After initial processing in the staging area, data moves to the ODS
(Operational Data Store). The ODS supports OLTP (Online Transaction Processing) for real-time
operational tasks, enabling instantaneous processing of transactions. This makes it ideal for tasks
such as handling current customer orders, tracking real-time inventory levels, managing ongoing
sales transactions, and updating customer information in real-time.
Data Warehouse (DWH) - After the Operational Data Store (ODS), data flows into the Data
Warehouse (DWH), a centralized, historical repository. It supports OLAP (Online Analytical
Processing), enabling deeper analysis and decision-making. Popular databases for DWH include
Oracle, SQL Server, MongoDB, Snowflake, and MySQL. Designed for scalability and efficiency,
the DWH allows for seamless querying of large datasets and can store multiple data marts, each
focused on a specific business area, such as sales, finance, or human resources.Step 6: Data Marts
Data Marts - Data in the warehouse is further segmented into Data Marts, which are specialized
subsets tailored to specific business areas, including:
Sales: Focuses on sales metrics like revenue, units sold, and more.
HR: Focuses on employee data, payroll, performance, and other HR-related metrics.
Finance: Tracks financial records, budgets, and other financial data.
Reporting and Visualization - After data is segmented into Data Marts, it is utilized by
reporting tools such as Power BI and Tableau to create informative outputs. These tools generate
dashboards for quick insights, reports with filtering capabilities, and visualizations to identify
trends, patterns, and correlations, ultimately supporting informed decision-making.
Introduction
Data mining is a powerful technique used to extract valuable insights from large datasets. A data
mining system architecture consists of several key components that work together to facilitate this
process.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will be
collected from various data sources, and only the data of interest will have to be selected and passed
to the server. These procedures are not as easy as we think. Several methods may be performed on
the data as part of selection, integration, and cleaning.
Data Mining Engine: The data mining engine is the core of the data mining system. It contains
multiple modules that perform various mining tasks, such as: (What are the common methods used
in Data Mining?)
The data mining engine uses algorithms and models to perform these tasks and extract meaningful
patterns from the data. It is the heart of the system where the actual data mining processes occur.
For data preprocessing to be successful, it's essential to have an overall picture of the data, and basic
statistical descriptions play a crucial role in this process. These descriptions help identify properties
of the data and highlight noise or outliers. Measures of central tendency include mean, median,
mode, and midrange. Measures of data dispersion include quartiles, interquartile range (IQR), and
variance. By leveraging these descriptive statistics, we can gain a deeper understanding of the data
distribution, ultimately informing data preprocessing tasks and driving more informed decision-
making.
These measures show where the "center" of a dataset is, or the typical value.
a) Mean (Average): The sum of all data points divided by the number of data points.
b) Median: The middle value when the data is sorted. If there’s an even number of data points,
it’s the average of the two middle numbers.
Dataset: [2,3,5,7,11] Median = 5 (since it’s the middle value)
2. Measures of Dispersion
b) Variance: The average squared difference from the mean. It tells us how spread out the data
is.
c) Standard Deviation: The square root of variance, showing how much individual data points
differ from the mean.
3. Outliers
Outliers are values that are much larger or smaller than most of the data points.
IQR measures the spread of the middle 50% of the data. It is the difference between the third quartile
(Q3) and the first quartile (Q1).
b) Outlier Detection:
Outliers are data points that fall outside the range defined by:
Page No. 1
UNIT II: DESCRIBING DATA
Basics of Numpy Arrays
Aggregations and computations on arrays
Comparisons, masks, and boolean logic
Fancy indexing and structured arrays
Data Manipulation with Pandas
Types of data
Types of variables
Describing data with tables and graphs
Describing data with averages
Describing variability
Normal distributions and standard (z) scores
Describing Relationships
Correlation:
o Scatter plots
o Correlation coefficient for quantitative data
o Computational formula for correlation coefficient
Regression:
o Regression line
o Least squares regression line
o Standard error of estimate
o Interpretation of r2r^2r2
o Multiple regression equations
o Regression towards the mean
WHAT IS NUMPY?
NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, fourier transform, and matrices. NumPy was created in 2005 by Travis
Oliphant. It is an open source project and you can use it freely. NumPy stands for Numerical
Python.
1. Checking NumPy Version
NumPy provides a way to check its installed version using np.__version__. This helps ensure
compatibility with other libraries and features.
import numpy as np
# Checking NumPy version
print("NumPy Version:", np.__version__)
Output:
NumPy Version: 1.23.0
# 1D array
one_d = np.array([1, 2, 3, 4, 5])
print("1D Array:", one_d)
# 2D array
two_d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", two_d)
# Ones array
ones_arr = np.ones((2, 3)) # 2x3 matrix filled with ones
print("Ones Array:\n", ones_arr)
Output:
0D Array: 42
1D Array: [1 2 3 4 5]
2D Array:
[[1 2 3]
[4 5 6]]
Ones Array:
[[1. 1. 1.]
[1. 1. 1.]]
# Indexing
print("First Element:", arr[0])
print("Last Element:", arr[-1])
# Slicing
print("First Three Elements:", arr[:3])
print("Elements from Index 2 to End:", arr[2:])
print("Alternate Elements:", arr[::2])
Output:
First Element: 10
Last Element: 50
First Three Elements: [10 20 30]
Elements from Index 2 to End: [30 40 50]
Alternate Elements: [10 30 50]
4. Element-wise Operations
NumPy supports operations like addition, subtraction, multiplication, and division between
arrays and scalars.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
# Element-wise operations
print("Addition:", arr1 + arr2)
print("Subtraction:", arr1 - arr2)
print("Multiplication:", arr1 * arr2)
print("Division:", arr1 / arr2)
# Scalar operation
print("Multiply by 2:", arr1 * 2)
Output:
Addition: [5 7 9]
Subtraction: [-3 -3 -3]
Multiplication: [ 4 10 18]
Division: [0.25 0.4 0.5 ]
Multiply by 2: [2 4 6]
5. Aggregation Functions
NumPy provides aggregation functions like sum, mean, and standard deviation.
arr = np.array([10, 20, 30, 40, 50])
# Aggregation operations
print("Sum:", np.sum(arr))
print("Mean:", np.mean(arr))
print("Standard Deviation:", np.std(arr))
Sum: 150
Mean: 30.0
Standard Deviation: 14.142135623730951
6. Boolean Masking
Boolean masking filters elements based on conditions.
arr = np.array([10, 25, 30, 45, 50])
# Boolean masking
mask = arr > 30
print("Mask:", mask)
print("Filtered Elements:", arr[mask])
Output:
Mask: [False False False True True]
Filtered Elements: [45 50]
7. Fancy Indexing
Fancy indexing selects elements based on an array of indices.
arr = np.array([10, 20, 30, 40, 50])
# Fancy Indexing
indices = [0, 2, 4]
print("Selected Elements:", arr[indices])
Output:
Selected Elements: [10 30 50]
8. Reshaping Arrays
Reshaping changes the shape of an array without altering data.
arr = np.array([1, 2, 3, 4, 5, 6])
# Reshape into 2D array
reshaped_arr = arr.reshape(2, 3)
print("Reshaped Array:\n", reshaped_arr)
Output:
Reshaped Array:
[[1 2 3]
[4 5 6]]
9. Structured Arrays
Structured arrays allow storing different data types within the same array.
# Creating a structured array
data_type = [('age', int), ('score', float)]
structured_arr = np.array([(25, 89.5), (30, 95.2)], dtype=data_type)
# Accessing elements
print("Structured Array:\n", structured_arr)
print("Ages:", structured_arr['age'])
print("Scores:", structured_arr['score'])
Output:
Structured Array:
[(25, 89.5) (30, 95.2)]
Ages: [25 30]
Scores: [89.5 95.2]
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
Name Salary
0 John 50000
1 Alice 60000
2 Bob 70000
3 Charlie 80000
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Select a row by label using loc
print(df.loc[2])
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Select the 'Salary' of the row where 'Name' is 'Bob' using loc
print(df.loc[2, 'Salary'])
70000
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Select rows where Age is greater than 30 and Salary is greater than 60000
print(df.query('Age > 30 and Salary > 60000'))
Output:
Name Age Salary
2 Bob 45 70000
3 Charlie 40 80000
7. Selecting Specific Rows and Columns Using .loc[] and .iloc[]
You can use .loc[] and .iloc[] for selecting specific rows and columns. This is useful when you
want to slice both rows and columns.
Syntax:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
4. Renaming Columns
Columns can be renamed using the rename() method.
Syntax:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Rename 'Salary' column to 'Income'
df.rename(columns={'Salary': 'Income'}, inplace=True)
print(df)
Output:
Name Age Income
0 John 28 50000
1 Alice 34 60000
2 Bob 45 70000
3 Charlie 40 80000
5. Dropping Columns
Columns can be removed using the drop() method.
Syntax:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
6. Dropping Rows
Rows can be removed based on their index using the drop() method.
Syntax:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
df = pd.DataFrame(data)
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
3. Sorting by Index
In addition to sorting by columns, data can also be sorted based on the DataFrame's index using
sort_index(). This is particularly useful when the index is meaningful, such as dates or custom
identifiers.
Syntax:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
df = pd.DataFrame(data)
# Sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie'],
'Age': [28, 34, 45, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
Output:
Name Age Salary
0 John 28 50000
1 Alice 34 60000
3 Charlie 40 80000
2 Bob 45 70000
E) Grouping Data in Pandas
Grouping data in Pandas allows efficient aggregation, summarization, and analysis. The
groupby() function is used to group data based on one or more columns.
1. Grouping by a Single Column
The groupby() function can group a DataFrame by a single column and apply aggregate
functions like sum(), mean(), count(), etc.
Syntax:
import pandas as pd
# Sample DataFrame
data = {
'Department': ['HR', 'IT', 'IT', 'HR', 'Finance', 'Finance'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Salary': [50000, 60000, 70000, 55000, 80000, 90000]
}
df = pd.DataFrame(data)
Department: HR
Department Employee Salary
0 HR Alice 50000
3 HR David 55000
Department: IT
Department Employee Salary
1 IT Bob 60000
2 IT Charlie 70000
# Sample DataFrames
df1 = pd.DataFrame({
'Employee_ID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'IT', 'Finance', 'IT']
})
df2 = pd.DataFrame({
'Employee_ID': [101, 102, 103, 105],
'Salary': [50000, 60000, 70000, 80000]
})
left Returns all rows from the left DataFrame, matching where possible
right Returns all rows from the right DataFrame, matching where possible
# Merging on indexes
merged_df = pd.merge(df5, df6, left_index=True, right_index=True)
print(merged_df)
Output:
Name Department Salary
Employee_ID
101 Alice HR 50000
102 Bob IT 60000
103 Charlie Finance 70000
# Sample DataFrame
data = {
'Age': [23, 28, 34, 45, 28, 23],
'Name': ['John', 'Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Salary': [50000, 60000, 70000, 80000, 55000, 45000],
'Department': ['HR', 'IT', 'IT', 'HR', 'HR', 'IT']
}
df = pd.DataFrame(data)
1. Types of Data
Data can be classified into various types based on its structure, nature, and measurement levels.
Structured Data: Well-organized data stored in tabular formats like databases and
spreadsheets (e.g., sales records, student databases).
Unstructured Data: Data without a predefined format (e.g., images, videos, social media
posts).
Semi-Structured Data: A mix of both structured and unstructured data, such as JSON
and XML files.
2. Types of Variables
Categorical Variables
Nominal Variables: Categories without a meaningful order (e.g., blood groups, colors).
Ordinal Variables: Categories with a meaningful order but unknown differences
between values (e.g., education levels, customer ratings).
Numerical Variables
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000],
'Department': ['HR', 'IT', 'Finance', 'IT']}
df = pd.DataFrame(data)
# Displaying summary statistics
print(df.describe())
# Displaying frequency count for categorical data
print(df['Department'].value_counts())
Output
A histogram displaying the distribution of ages.
A frequency table showing the count of employees in each department.
mean_age = df['Age'].mean()
print(f"Mean Age: {mean_age}")
median_salary = df['Salary'].median()
print(f"Median Salary: {median_salary}")
mode_department = df['Department'].mode()[0]
print(f"Mode Department: {mode_department}")
5. Describing Variability
Range
variance_salary = df['Salary'].var()
print(f"Variance in Salary: {variance_salary}")
Standard Deviation
std_salary = df['Salary'].std()
print(f"Standard Deviation of Salary: {std_salary}")
Normal Distribution
A normal distribution follows a bell curve, where most values are concentrated around the mean.
df['Salary_Zscore'] = zscore(df['Salary'])
print(df[['Salary', 'Salary_Zscore']])
Interpreting Z-Scores
Problem Statement:
A university conducted an entrance exam where scores follow a normal distribution with a mean
(𝜇) of 70 and a standard deviation (𝜎) of 10. Answer the following questions based on this
information:
For X = 60:
From the z-table, the probability corresponding to Z = -1.0 is 0.1587.
Interpretation: 15.87% of students scored below 60.
The top 10% corresponds to the 90th percentile, which has a Z-score of 1.28 (from the z-table).
Using the formula:
For X = 40:
Interpretation: Since Z = -3.0, this score could be an outlier as it falls three standard deviations below
the mean.
2. Scatter Plots
Scatter plots visually represent the relationship between two variables.
Example Problem 1:
A researcher collects data on temperature (°C) and ice cream sales (units sold per month):
Jan 5 50
Feb 7 60
Mar 12 85
Apr 18 120
May 24 200
Jun 30 300
Jul 35 400
Aug 33 380
Sep 28 290
Oct 20 180
Nov 12 90
Dec 6 55
data = {
"Month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov",
"Dec"],
"Temperature": [5, 7, 12, 18, 24, 30, 35, 33, 28, 20, 12, 6],
"Ice_Cream_Sales": [50, 60, 85, 120, 200, 300, 400, 380, 290, 180, 90, 55]
}
df = pd.DataFrame(data)
2.
3. Compute sum of squares for XXX and YYY.
4. Apply the correlation formula.
Solution: Using Numpy
python
CopyEdit
import numpy as np
X = df["Temperature"]
Y = df["Ice_Cream_Sales"]
# Step 5: Compute r
r = numerator / denominator
print(f"Correlation coefficient (computed manually): {r:.2f}")
Expected Output:
Correlation coefficient (computed manually): 0.98
This confirms a strong positive correlation between temperature and ice cream sales.
5. Interpretation of Correlation Results
Example Problem 4:
A store manager wants to decide whether to stock more ice creams based on temperature data.
How should they interpret the correlation result?
Solution:
Since r=0.98r = 0.98r=0.98, there is a very strong positive correlation between
temperature and ice cream sales.
As temperature increases, ice cream sales increase.
REGRESSION ANALYSIS AND MODELING CONCEPTS
1. Regression
Regression is a statistical technique used to model relationships between a dependent variable
and one or more independent variables. It helps in prediction and understanding the impact of
variables on an outcome.
Regression Line
A regression line is a straight line that best represents the relationship between the dependent
and independent variables in a regression model. It is used in statistical modeling to predict the
value of the dependent variable based on the independent variable(s).Least Squares Regression
Line.
The Standard Error of Estimate (SE or SEE) measures the accuracy of predictions made by a
regression model. It quantifies the dispersion of observed values around the regression line.
Coefficient of determination
The coefficient of determination measures the proportion of variance in the dependent variable
that is explained by the independent variable(s) in a regression model. Multiple Regression
Equations
Regression Towards the Mean
This concept states that extreme values tend to move closer to the mean in subsequent
observations.
Example: If a student scores exceptionally high in one test, their next score is likely to be
2. Bivariate Analysis
Logistic regression is a statistical method used for classification when the dependent variable is
categorical, estimating the probability of an event occurring using the logistic function:
Where P(Y=1) represents the probability of the event, aaa is the intercept, and bbb is the
coefficient of the independent variable xxx. It is widely used in applications such as disease
prediction, credit risk assessment, and customer behavior analysis.
Problem Statement
A company wants to analyze the relationship between advertising expenditure (in $1000s) and
sales revenue (in $1000s). The following data is collected:
1 2
2 2.5
3 3.2
4 4.1
5 4.8
Matplotlib
Matplotlib is a powerful Python library used for creating static, animated, and interactive
visualizations in Python. It is widely used for data visualization and plotting because of its
flexibility and ease of use.
1. Basic Structure
Plotting: You can use matplotlib.pyplot (commonly imported as plt) to plot data. This
module provides functions to create simple line plots, bar charts, histograms, and
scatter plots.
Figure and Axes: A plot in Matplotlib is made up of a Figure (the whole image or
canvas) and Axes (the actual plotting area, where data is visualized). You can create
multiple Axes in a Figure to make subplots.
2. Core Components
Figure: Represents the overall window or image where everything will be drawn.
You can create a figure using plt.figure().
Axes: The area of the figure where the data is displayed (the plot itself). You can add
axes using plt.subplot() or fig.add_subplot().
Axis: The x and y axes that represent the data’s coordinates.
Lines and Markers: These represent the data points and the lines connecting them
(for line plots). You can modify them with various styles, colors, and markers.
3. Creating Plots
4. Customizing Plots
plt.title("Sample Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.legend()
plt.grid(True)
plt.show()
5. Subplots
axes[0].plot(x, y)
axes[0].set_title("First Plot")
axes[1].bar(x, y)
axes[1].set_title("Second Plot")
plt.show()
6. Saving Plots
Once you create a plot, you can save it to a file (e.g., PNG, PDF, etc.) using
plt.savefig('filename.png').
8. Interactive Plots
Matplotlib also allows for interactive plotting (with plt.ion() to enable interactive
mode).
You can zoom, pan, or even create dynamic plots with the help of libraries like
matplotlib.animation.
Example
To import Matplotlib in Python, you typically use the following line of code:
This imports the pyplot module of Matplotlib, which is commonly used for plotting graphs
and charts.
pip install matplotlib
plt.plot(x, y)
plt.title("Simple Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
Line Plot
A line plot is a type of graph that displays data points along a continuous line. It's commonly
used to visualize trends, relationships, or patterns over a period of time or across ordered
categories. In a line plot, individual data points are connected by straight lines, allowing you
to easily identify changes or trends.
1. X-axis: Represents the independent variable, often time, categories, or ordered data.
2. Y-axis: Represents the dependent variable, or the values corresponding to the data
points on the x-axis.
3. Data Points: Represent individual values (x, y) on the graph.
4. Line: Connects the data points, which helps to highlight the trend or pattern.
Trends Over Time: Line plots are particularly useful for showing how data changes
over time, like stock prices, temperature variations, or sales over months.
Patterns: They help in identifying patterns, such as increases, decreases, or cycles in
the data.
Comparing Multiple Data Sets: You can plot multiple lines on the same graph to
compare different data sets.
1. Data Points: Represented as dots or markers at specific coordinates (x, y). Each point
corresponds to a value from your dataset.
2. Line: A continuous line connecting the data points, which helps to visualize trends.
3. Axes:
o X-axis: Usually represents categories or a continuous variable (like time).
o Y-axis: Represents the numerical values associated with the data points.
Basic Line Plot Example:
plt.title("Weekly Temperature")
plt.ylabel("Temperature (°C)")
plt.legend()
plt.show()
Explanation:
1. Time Series Data: When you want to track changes in a variable over time (e.g.,
temperature, stock market prices, or website traffic).
2. Comparing Multiple Data Sets: Multiple lines can be drawn on the same plot to
compare different datasets (for example, comparing the temperatures of two cities
over the same time period).
3. Finding Trends: A line plot helps in recognizing if the data is increasing, decreasing,
or following a cyclical pattern.
plt.plot(x, y, color='green', linestyle='--', marker='o') # green dashed line with circle markers
You can plot multiple lines on the same graph by calling plt.plot() multiple times:
y2 = [1, 2, 3, 4, 5]
y3 = [1, 8, 27, 64, 125]
plt.plot(x, y, label="x^2")
plt.plot(x, y2, label="x")
plt.plot(x, y3, label="x^3")
plt.title("Multiple Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend() # Display legend for the lines
plt.show()
3. Adding Grid:
plt.plot(x, y)
plt.grid(True) # Show gridlines
plt.show()
Scatter plots
A scatter plot is used to display the relationship between two variables by plotting data points
on a Cartesian plane. In Matplotlib, you can create scatter plots using the scatter() function
from the pyplot module.
Explanation:
plt.scatter(x, y, color='red', s=100) # Red color and size of 100 for the points
color: Changes the color of the points. You can use color names like 'red', 'blue',
'green', or hex color codes like '#FF5733'.
s: Controls the size of the points. Larger values mean bigger dots.
If you have a third variable, you can color the points based on it:
for i in range(len(x)):
plt.text(x[i], y[i], f'({x[i]}, {y[i]})', fontsize=9)
You can plot multiple scatter plots on the same graph by calling plt.scatter() multiple times:
y2 = [1, 2, 3, 4, 5]
y3 = [5, 10, 15, 20, 25]
plt.scatter(x, y, color='blue', label='First set')
plt.scatter(x, y2, color='red', label='Second set')
plt.scatter(x, y3, color='green', label='Third set')
plt.legend() # Display legend
plt.show()
Visualizing errors
Visualizing errors is a key part of data analysis and model evaluation. It helps you
understand how far off your predictions or measurements are from the true values and
provides insight into the performance of a model or the reliability of data.
There are several ways to visualize errors in data, and Matplotlib, along with other libraries
like Seaborn and NumPy, can be used to plot and analyze them.
1. Prediction Errors: The difference between predicted and actual values (e.g., in
machine learning or regression tasks).
2. Measurement Errors: The difference between observed measurements and the true
values in experimental data.
3. Residuals: In regression analysis, residuals are the differences between the observed
and predicted values.
1. Error Bars: Error bars represent the uncertainty or variability of a data point. They
show the range within which the true value is expected to lie. Error bars can be added
to plots to visually represent how much variation there is in each data point.
o How to use in Matplotlib: The plt.errorbar() function allows you to add
vertical and/or horizontal error bars to a plot.
In this example, the error bars indicate the uncertainty in the y values. The parameter
yerr defines how much error is associated with each y data point. The fmt='o'
argument ensures that the data points are marked as circles.
2. Residual Plots: Residuals are the differences between the predicted values and the
actual values. A residual plot shows the residuals on the y-axis and the predicted
values or input features on the x-axis.
Residual plots help assess whether a regression model is appropriate for the data.
Ideally, residuals should be randomly scattered around 0, which indicates a good fit.
Patterns in residuals can indicate problems such as non-linearity or heteroscedasticity.
import numpy as np
import matplotlib.pyplot as plt
# Example data: y = 2x + 1
x = np.array([1, 2, 3, 4, 5])
y_actual = 2 * x + 1
y_predicted = 2 * x + 0.5 # Slightly incorrect predictions
# Residuals (error)
residuals = y_actual - y_predicted
# Create a residual plot
plt.scatter(x, residuals, color='red')
plt.axhline(0, color='black', linestyle='--') # Horizontal line at y=0 for reference
plt.title('Residual Plot')
plt.xlabel('X')
plt.ylabel('Residuals (Actual - Predicted)')
plt.show()
The red points show the residuals for each data point. The dashed black line at y = 0
represents where the residuals should ideally be if the predictions were perfect.
3. Absolute Error Plot: In some cases, it's useful to visualize the absolute error (the
absolute difference between the predicted and actual values) to better understand how
much error is present at each data point. This removes the sign of the error and
focuses only on the magnitude.
Here, the absolute error removes any negative signs, showing only the magnitude of
the error.
Key Takeaways:
Density and contour plots are two common types of visualizations used to represent the
distribution and relationships of data, particularly when dealing with continuous variables.
Both of these plots provide valuable insights into the structure, density, and patterns in
multivariate data.
1. Density Plots
A density plot is a smooth curve that represents the distribution of data. It’s an alternative to
a histogram and is often used to visualize the probability density function (PDF) of a
continuous random variable. Essentially, a density plot shows where data is concentrated and
the overall distribution pattern.
Key Features of a Density Plot:
In Matplotlib, you can create density plots using plt.hist() with the density=True argument or
by using the seaborn library, which has built-in support for density plots via sns.kdeplot().
2. Contour Plots
A contour plot is a graphical representation of data where lines are drawn to connect points
of equal value. Contour plots are often used to represent 3D data on a 2D surface, where the
third dimension is represented as contour lines or filled contours on the plot.
To create contour plots in Matplotlib, you can use the plt.contour() or plt.contourf()
functions (the latter fills the regions between contour lines).
In this example:
X and Y are the grid points, representing the two variables.
Z is the function (e.g., sin(sqrt(X² + Y²))), representing the third variable that will be
visualized through contours.
The contour lines represent different levels of the function Z.
If you want to fill the regions between the contour lines, you can use plt.contourf() instead of
plt.contour(). This creates a filled contour plot.
A histogram is a type of graph that represents the distribution of a dataset by grouping data
points into bins and showing how many data points fall into each bin. This makes it useful for
understanding the frequency distribution of a dataset, such as how many values are in a
specific range.
Histograms are primarily used for continuous data, but they can also be applied to discrete
data. They provide insights into the data’s central tendency, spread, and potential outliers.
1. Bins (Intervals):
o The range of data is divided into intervals, also known as bins. Each bin
represents a specific range of values.
o The width of the bins determines the granularity of the data visualization.
2. Frequency:
o The height of each bar represents the frequency or count of data points within
that bin's range.
3. X-axis:
o Represents the range of values or data points (the variable being analyzed).
4. Y-axis:
o Represents the frequency or count of data points in each bin.
5. Bars:
o Each bar represents a bin, and its height corresponds to the number of data
points that fall within the interval for that bin.
Matplotlib provides a simple function, plt.hist(), to create histograms. Here’s how you can
create one:
Explanation:
plt.hist(data): This function creates a histogram for the data provided. By default, it
divides the data into 10 bins.
bins=30: Specifies the number of bins (intervals) for the histogram. More bins give
finer details, while fewer bins offer a more general view.
color='blue': Specifies the color of the bars.
edgecolor='black': Specifies the color of the borders around the bars.
alpha=0.7: Sets the transparency level of the bars (a value between 0 and 1).
Customizing a Histogram:
Changing Bin Size: You can adjust the number of bins based on how detailed you
want the plot to be.
Logarithmic Scale: In case of highly skewed data, you may want to plot the
histogram using a logarithmic scale for better visualization of differences in
frequencies.
Multiple Histograms: You can plot multiple histograms on the same graph for
comparison.
You can also plot multiple datasets on the same histogram for comparison. For example,
comparing two different distributions:
In this example:
Interpreting a Histogram:
2. Normalized Histograms: In some cases, you may want to normalize the histogram so
that the total area under the bars sums to 1, representing a probability distribution.
A legend in data visualization is a key component that helps viewers understand the meaning
of various graphical elements (such as lines, colors, markers, or bars) in a plot. Legends
provide labels for different components in a plot, clarifying which color or symbol
corresponds to which data series, category, or variable. They are essential for making the plot
interpretable and accessible to the audience.
In Matplotlib, legends are easily added using the plt.legend() function. This function uses
labels defined in the plot’s elements (e.g., lines, bars, etc.) to create the legend.
Explanation:
label='y = x^2' and label='y = x': These labels define what the legend will show.
They correspond to the blue and red lines, respectively.
plt.legend(): This adds the legend to the plot, which will display the labels and colors
that match the respective lines.
1. Positioning the Legend: By default, plt.legend() places the legend in the "best"
location (where it doesn’t overlap with the data). However, you can manually position
the legend using the loc parameter.
2. Adding Multiple Legends: If you have multiple series or elements, you can add more
than one legend. You can also use the bbox_to_anchor argument to place the legend
in specific coordinates.
3. Changing Legend Font Size and Style: You can customize the appearance of the
legend, such as changing the font size, font family, and color.
4. Using Custom Markers: In some cases, you might want to specify a marker style for
your legend, which can differ from the one used in the plot.
1. Legend with a Bar Plot: In a bar plot, each bar can have a label, and the legend will
display the label for each series.
2. Legend with a Scatter Plot: In scatter plots, each set of points can be represented by
different colors, and the legend will explain what those colors represent.
1. Custom Legend Markers: You can specify custom markers for the legend using the
Line2D objects from matplotlib.lines for more control.
2. Legend Outside the Plot: Legends can be placed outside the plot to avoid cluttering
the graph, using bbox_to_anchor and loc parameters.
Colors play a crucial role in data visualization, helping to differentiate various elements,
convey meaning, and make plots more aesthetically appealing. In Matplotlib, you have a
wide range of options for customizing the colors of different plot elements, such as lines,
bars, markers, text, and backgrounds.
1. Named Colors: Matplotlib supports a variety of named colors. These are predefined
color names such as 'red', 'blue', 'green', etc. You can directly use these names in your
plots.
plt.plot(x, y, color='blue')
2. Hexadecimal Colors: You can specify colors using hexadecimal codes (often used
in web design). A hex code starts with a #, followed by six digits or letters
representing red, green, and blue (RGB) components.
3. RGB Tuples: Colors can also be specified as a tuple of RGB values, where each
value is between 0 and 1.
4. RGBA: If you want to set the opacity (alpha channel), you can use RGBA tuples.
The fourth value represents the alpha (transparency), where 1 is fully opaque, and 0 is
fully transparent.
5. Grayscale: You can specify a color in grayscale by passing a single float value
between 0 and 1. 0 represents black, and 1 represents white.
plt.plot(x, y, color='0.5') # Mid-gray color
6. Colormap (Cyclic Colors): You can use colormaps (gradient color scales) for
continuous data. Colormaps are often used in heatmaps, surface plots, and contour
plots to represent the intensity or value of data points.
Matplotlib allows you to customize various elements of your plot with colors:
In a line plot, you can set the color of the line using the color parameter.
For bar plots, you can assign different colors to each bar or all bars by setting the color
parameter.
In a scatter plot, you can assign a specific color to each point, or color them according to
another variable (e.g., z values for a colormap).
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
# Scatter plot with specific color for each point
plt.scatter(x, y, color='purple')
# Scatter plot with colormap based on values
z = [10, 20, 30, 40, 50] # Some values to map to color
plt.scatter(x, y, c=z, cmap='plasma')
plt.colorbar() # Show color scale
plt.title("Scatter Plot with Colors")
plt.show()
For pie charts, you can specify custom colors for each slice.
A colormap is a range of colors used to represent data values. You can apply colormaps to
continuous data like heatmaps, 2D histograms, and surface plots.
Matplotlib offers several colormaps like 'viridis', 'plasma', 'inferno', 'cividis', etc.
import numpy as np
import matplotlib.pyplot as plt
# Create a 2D dataset (for example, a random dataset)
data = np.random.rand(10, 10)
# Create a heatmap using a colormap
plt.imshow(data, cmap='viridis', interpolation='nearest')
plt.colorbar() # Add color bar to show the mapping
plt.title("Heatmap with Colormap")
plt.show()
Matplotlib offers a variety of colormaps for different types of data visualization. Here are a
few popular ones:
Sequential Colormaps (used for data that goes from low to high values, e.g.,
temperature): 'Blues', 'Greens', 'Purples', 'Oranges'
Diverging Colormaps (used for data with both low and high extremes, such as
temperature deviations): 'coolwarm', 'PiYG', 'Spectral'
Qualitative Colormaps (used for categorical data): 'Set1', 'Set2', 'Paired'
Cyclic Colormaps (used for cyclic data like angles): 'hsv', 'twilight'
import numpy as np
import matplotlib.pyplot as plt
# Create a 2D dataset with both negative and positive values
data = np.random.randn(10, 10)
# Create a heatmap using a diverging colormap
plt.imshow(data, cmap='coolwarm', interpolation='nearest')
plt.colorbar()
plt.title("Heatmap with Diverging Colormap")
plt.show()
Sometimes, you may want to define your own set of colors for elements (such as lines) to
cycle through in your plot.
Subplots in Matplotlib
In data visualization, subplots allow you to arrange multiple plots in a grid, within a single
figure. This is useful when you want to compare different datasets or visualize multiple
aspects of a dataset simultaneously, without creating separate figures. Matplotlib provides
the plt.subplots() function to create subplots.
# Data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [1, 2, 3, 4, 5]
y3 = [25, 20, 15, 10, 5]
y4 = [10, 15, 20, 25, 30]
Explanation:
# Data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [1, 2, 3, 4, 5]
# Adding legends
axes[0].legend()
axes[1].legend()
# Adjusting layout
plt.tight_layout()
Explanation:
figsize=(12, 6) controls the overall size of the figure (width, height in inches).
plt.tight_layout() ensures proper spacing of elements within the figure.
You can share the x-axis or y-axis between subplots to make comparisons easier.
# Data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [1, 2, 3, 4, 5]
Explanation:
sharey=True ensures that both subplots share the same y-axis, making it easier to
compare data across them.
You can also share the x-axis (sharex=True) in a similar way.
You can use different plot types (e.g., line plot, bar plot, scatter plot) in different subplots.
# Data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [1, 2, 3, 4, 5]
y3 = [25, 20, 15, 10, 5]
y4 = [10, 15, 20, 25, 30]
Explanation:
Each subplot can display a different plot type (line, bar, scatter, histogram) on a 2x2
grid.
We use tight_layout() to optimize the space and avoid overlapping labels.
If you want to customize specific subplots (e.g., adjusting their ticks, labels, etc.), you can
access each subplot individually using the axes object.
# Data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [1, 2, 3, 4, 5]
# Plotting
axes[0, 0].plot(x, y1)
axes[0, 1].plot(x, y2)
# Add titles
axes[0, 0].set_title('Plot 1')
axes[0, 1].set_title('Plot 2')
plt.tight_layout()
plt.show()
Explanation:
You can customize individual axes like adjusting axis limits (set_xlim(), set_ylim())
or adding axis labels (set_xlabel(), set_ylabel()).
plt.tight_layout() is used to automatically adjust spacing.
Adding text and annotations to your plots can make them much more informative by
labeling points, adding explanations, and highlighting important areas. Matplotlib provides
several functions for adding text and annotations, which are useful for highlighting key
features in your visualizations.
You can place text anywhere on the plot with plt.text(). It allows you to specify the x and y
coordinates where the text should appear.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
plt.text(x, y, 'text') places the text 'This is a point' at position (2, 10) on the plot.
The fontsize parameter controls the size of the text, and color controls the text color.
Annotations are used to highlight specific points or areas of interest within the plot. The
function plt.annotate() not only places text but can also draw arrows pointing to specific
points in the plot.
Explanation:
You can add multiple annotations to a plot to highlight different points or areas.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
Two annotations are added: one for the "Start" point and one for the "End" point.
Each annotation has a different arrow and text style.
You can also use the Axes object (ax) to add text to a specific subplot when using subplots.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
Matplotlib allows you to include mathematical expressions in text by using LaTeX syntax.
import matplotlib.pyplot as plt
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
The r before the string indicates a raw string, which is required for LaTeX formatting.
The LaTeX syntax ($\frac{1}{2}x^2$) is used to display mathematical formulas.
You can customize the text's appearance by adjusting its properties, such as font size, color,
font weight, and style.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
fontsize, color, fontweight, and style control the appearance of the text.
You can annotate by adding lines (horizontal or vertical) to highlight regions of interest.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
You can make the text stand out by adding a background box around it.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
The bbox parameter adds a background box behind the text. You can adjust the box’s
appearance with parameters like facecolor (background color) and alpha (opacity).
Customizing your plots helps in making them clearer, more aesthetically pleasing, and more
informative. Matplotlib provides a wide range of options to customize every aspect of your
plot. Below are some common customization techniques you can apply.
You can control the appearance of lines, markers, and colors using various parameters.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
# Show plot
plt.show()
Explanation:
You can easily customize titles, axis labels, and tick labels with various font properties.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
plt.title(): Set the title with font size, color, and location (loc='center' positions it in the
center).
plt.xlabel() and plt.ylabel(): Customize axis labels.
plt.xticks() and plt.yticks(): Customize tick labels (font size, rotation, and color).
3. Grid Customization
You can add or customize grids to make your plots easier to read.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
plt.grid(True): Enables the grid.
which='both': Applies the grid to both major and minor ticks.
color='gray': Sets the grid color.
linestyle='-': Sets the grid line style.
linewidth=0.5: Sets the grid line width.
To adjust the figure size, you can set it using the figsize parameter when creating a plot.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
figsize=(10, 5): Adjusts the figure size to 10 inches by 5 inches (width by height).
5. Customizing Legend
You can add and customize legends to label different data series.
# Data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [2, 3, 5, 7, 11]
# Show plot
plt.show()
Explanation:
plt.legend(): Adds a legend with various customizations.
o loc='upper left': Sets the position of the legend.
o fontsize=12: Customizes the font size of the legend.
o frameon=False: Removes the box around the legend.
Matplotlib allows you to apply predefined styles to your plots. You can set a style globally or
apply it to a single plot.
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
When working with multiple plots (subplots), you can control their appearance individually.
# Data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [25, 20, 15, 10, 5]
# Show plot
plt.show()
Explanation:
You can adjust the limits of the axes using set_xlim() and set_ylim().
# Data
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Explanation:
You can change the background color of the plot or individual axes.
Explanation:
Matplotlib allows you to create 3D plots using the mpl_toolkits.mplot3d module. This is
useful for visualizing data with three variables or when you need to represent spatial
relationships in three dimensions. The most common 3D plots are:
1. 3D Line Plot
2. 3D Scatter Plot
3. 3D Surface Plot
4. 3D Wireframe Plot
5. 3D Contour Plot
1. Setting up a 3D Plot
To create a 3D plot, you first need to import the Axes3D class from mpl_toolkits.mplot3d.
Then, you can create a 3D axis object and plot your data.
# Create a figure
fig = plt.figure()
# Show plot
plt.show()
2. 3D Line Plot
A 3D line plot is used to visualize a set of data points in three dimensions, connected by lines.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Add labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
# Add title
ax.set_title('3D Line Plot')
# Show legend
ax.legend()
# Show plot
plt.show()
Explanation:
ax.plot(x, y, z): Plots the data as a 3D line, with x, y, and z being the coordinates in
3D space.
ax.set_xlabel(), ax.set_ylabel(), ax.set_zlabel(): Set labels for the x, y, and z axes.
ax.legend(): Adds a legend.
3. 3D Scatter Plot
A 3D scatter plot is useful for visualizing the relationship between three continuous variables.
# Add labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
# Add title
ax.set_title('3D Scatter Plot')
# Show plot
plt.show()
Explanation:
4. 3D Surface Plot
A surface plot represents a 3D surface, and it is often used for visualizing functions with two
variables (like z=f(x,y)z = f(x, y)z=f(x,y)).
# Add labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
# Add title
ax.set_title('3D Surface Plot')
# Show plot
plt.show()
Explanation:
np.meshgrid(x, y): Creates a grid of x and y values for plotting the surface.
ax.plot_surface(x, y, z): Plots the surface using the values of x, y, and z.
cmap='viridis': Specifies a colormap to color the surface.
5. 3D Wireframe Plot
A wireframe plot is similar to a surface plot, but it shows only the edges of the surface, not
the filled faces.
# Add labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
# Add title
ax.set_title('3D Wireframe Plot')
# Show plot
plt.show()
Explanation:
6. 3D Contour Plot
A 3D contour plot shows contour lines on a 3D surface, which can be useful for visualizing
the gradients of a surface.
# Add labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
# Add title
ax.set_title('3D Contour Plot')
# Show plot
plt.show()
Explanation:
7. Customizing 3D Plots
Basemap is a powerful library used for plotting geographic data with Matplotlib. It provides
tools for creating maps, displaying geographical data, and adding various features to maps,
like coastlines, country boundaries, and more. Basemap is part of the mpl_toolkits module,
but note that Basemap has been officially deprecated in favor of other libraries like Cartopy.
However, Basemap is still widely used, and you can install it and start working with it for
various mapping tasks.
1. Installing Basemap
To use Basemap, you'll need to install it, and this can be done with pip (or conda for
Anaconda users).
For pip:
To get started, you need to create a map projection, and then you can add features like
coastlines, countries, etc.
Explanation:
projection='ortho': Sets the map projection. The ortho projection creates a view
similar to a globe.
lat_0=0, lon_0=0: The center of the map is set at latitude 0 and longitude 0.
m.drawcoastlines(): Draws the coastlines on the map.
m.drawcountries(): Draws country borders.
You can plot geographic points or draw lines and polygons on the map.
To plot a point on the map, use m.plot() or m.scatter() to display geographic coordinates.
# Draw coastlines
m.drawcoastlines()
# Add legend
plt.legend()
Explanation:
m.scatter(): Plots a point at the specified x and y coordinates (converted from latitude
and longitude).
color='red', marker='D': Sets the color and shape of the marker.
s=100: Sets the size of the marker.
plt.legend(): Adds a legend to the plot.
You can plot a line (e.g., flight path) between two geographic points.
# Draw coastlines
m.drawcoastlines()
Explanation:
m.plot(): Plots a line between two points (New York and London in this case).
marker='o': Adds markers to each endpoint.
color='blue', linewidth=2: Customizes the line style.
You can add additional map features such as rivers, lakes, or political boundaries.
Explanation:
Shapefiles are commonly used formats for geographic data (such as countries, states, cities,
etc.). You can load shapefiles using Basemap and plot them.
Explanation:
m.readshapefile(): Loads a shapefile and plots its data. You need to replace
'path_to_shapefile/shapefile_name' with the actual path to your shapefile.
6. Map Projections
You can change the projection by specifying it when creating the Basemap object:
python
Copy
# Example of Mercator projection
m = Basemap(projection='merc', lat_0=0, lon_0=0)
You can plot geographic data using color maps to represent values over geographic locations
(e.g., population density, temperature).
Explanation:
m.pcolormesh(): Plots the data as a color mesh on the map using a colormap
(cmap='coolwarm').
data: Represents the values to be mapped (e.g., temperature, population).
shading='auto': Adjusts shading to improve map visualization.
Seaborn is a Python data visualization library built on top of Matplotlib that provides a high-
level interface for drawing attractive and informative statistical graphics. Seaborn simplifies
creating complex visualizations like heatmaps, time series, categorical plots, and regression
plots, among others.
1. Installing Seaborn
2. Basics of Seaborn
Seaborn works well with Pandas DataFrames and can automatically handle the creation of
plots based on DataFrame columns. Here's a basic overview of the core plot types and how to
use them.
These plots help visualize the distribution of a dataset (e.g., histograms, KDE plots, etc.).
# Example dataset
data = sns.load_dataset("tips")
Explanation:
sns.histplot(): Plots the histogram with optional kernel density estimation (KDE).
kde=True: Adds the KDE plot (a smooth version of the histogram).
A KDE plot smooths out the histogram and shows the probability density of the data.
sns.kdeplot(data["total_bill"], shade=True)
plt.title("KDE Plot of Total Bill Amounts")
plt.show()
Explanation:
Seaborn provides various plots to visualize relationships in categorical data, such as bar plots,
box plots, and violin plots.
A box plot is useful for visualizing the distribution of data and identifying outliers.
sns.boxplot(): Creates a box plot with x being the categorical variable and y being the
numerical variable.
data=data: Specifies the dataset.
A violin plot combines aspects of a box plot and a KDE plot. It provides a more detailed view
of the distribution.
Explanation:
sns.violinplot(): Creates a violin plot that shows both the distribution and summary
statistics of the data.
Explanation:
sns.barplot(): Creates a bar plot where the heights of the bars represent the average of
the numerical variable for each category.
The pair plot is one of the most powerful plots in Seaborn, allowing you to visualize
relationships between multiple variables at once.
sns.pairplot(data)
plt.show()
Explanation:
4.4. Heatmap
A heatmap visualizes matrix-like data, with colors representing the values in the matrix. This
is useful for visualizing correlations between variables or displaying a confusion matrix in
machine learning.
Explanation:
A regression plot is used to visualize the relationship between two variables, with an optional
fitted regression line.
Explanation:
sns.regplot(): Plots data points along with a regression line and confidence interval.
Explanation:
sns.scatterplot(): Plots individual data points (no regression line) for two continuous
variables.
5. Customization
Titles and Axis Labels: You can set titles and labels using Matplotlib commands.
plt.title("Your Title")
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")
Themes: Seaborn provides several themes to improve the aesthetics of your plots.
sns.set_palette("pastel")
Plotting Multiple Plots: You can easily create multiple subplots by using
Matplotlib's plt.subplots().
6. Advanced Plots
Violin Plot Similar to box plot, but with a density plot overlaid sns.violinplot()
Bar Plot Shows average values of a numerical variable per category sns.barplot()
Data wrangling is an essential step in the data analysis pipeline, where raw data is
transformed into a usable format for analysis. Python offers a variety of libraries to help with
this process, each designed to handle specific tasks efficiently.
One of the most widely used libraries for data wrangling is Pandas, which provides flexible
data structures like DataFrames and Series. Pandas enables users to manipulate, clean, and
analyze structured data with ease. It offers powerful functions for handling missing data,
filtering and transforming columns, merging datasets, and reading from multiple file formats
like CSV, Excel, and SQL. For example, it can automatically handle date parsing, group data
for aggregation, or reshape data using pivot tables, making it a central tool for data
wrangling.
NumPy is another fundamental library for working with arrays and numerical data. Although
it is more focused on mathematical operations, it plays a critical role in data wrangling when
dealing with large, multidimensional datasets. NumPy allows for the efficient handling of
arrays and matrices and provides functions for manipulating elements across these arrays,
which is particularly useful when working with large amounts of numerical data or when
performing statistical operations.
For tasks related to visualizing the data and identifying issues like outliers or missing values,
Matplotlib and Seaborn are commonly used. These libraries are designed for creating a wide
range of static, animated, and interactive visualizations. Seaborn, built on top of Matplotlib,
offers a higher-level interface for drawing attractive and informative statistical graphics. By
plotting histograms, box plots, and scatter plots, these libraries help in visually detecting
anomalies and distributions in the data, which is an important part of the wrangling process.
When working with Excel files, Openpyxl is a go-to library. It allows for reading and writing
Excel files (specifically .xlsx format) and enables users to automate tasks like modifying cell
values, creating new sheets, or applying formulas. This is especially helpful for data analysts
who work with reports or need to automate the handling of Excel-based data.
For large-scale data processing, Dask provides a parallel computing framework that scales
Pandas and NumPy operations to larger datasets. Dask enables efficient data wrangling by
handling data that doesn't fit into memory by breaking it into smaller chunks and processing
them in parallel. This can be especially useful for big data tasks and can significantly speed
up operations.
Another helpful library is Pyjanitor, an extension of Pandas, which simplifies common data
cleaning tasks like renaming columns, removing unwanted columns, and applying functions
across datasets in a more readable way. This makes the wrangling process more concise and
code easier to manage.
Regular expressions (with the re module) are another valuable tool for wrangling textual data.
Regular expressions allow for complex pattern matching, string extraction, and replacement,
making it easier to clean or parse data from sources that contain free-form text, like logs or
user inputs.
For web scraping and cleaning data from HTML or XML sources, BeautifulSoup and lxml
are popular libraries. They enable users to parse and extract information from web pages or
XML documents, which is especially useful for gathering raw data from the web before
processing it into a structured format for analysis.
When dealing with databases, SQLAlchemy is a robust library for interacting with relational
databases. It allows Python code to perform SQL queries directly on databases, providing an
object-relational mapping (ORM) that simplifies database operations and enables efficient
data wrangling directly from database tables.
For those handling large, hierarchical datasets like those used in scientific research or big
data applications, Pytables is a library that supports efficient storage and retrieval of large
datasets in formats like HDF5, enabling quick access to data even when it exceeds memory
limitations.
Lastly, Geopandas extends Pandas to handle geospatial data, making it easier to work with
data that has a geographic component, such as maps or location-based information.
Geopandas integrates well with spatial databases and can perform complex spatial operations,
like geometric transformations or spatial joins, which are essential when wrangling geospatial
datasets.
1. Pandas
Purpose: It is the most popular library for data manipulation and analysis. It provides
data structures like DataFrame and Series, which are ideal for handling structured
data.
Key features:
o Data cleaning (handling missing values, duplicates, etc.)
o Data transformation (reshaping, merging, aggregating, etc.)
o Data filtering and indexing
o Reading/writing from/to various file formats (CSV, Excel, SQL, etc.)
Example:
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace=True) # Drop missing values
df['column'] = df['column'].apply(lambda x: x.lower()) # Transform column values
2. NumPy
Purpose: While primarily known for numerical computing, NumPy is essential for
manipulating arrays and performing element-wise operations efficiently.
Key features:
o Handling large multi-dimensional arrays and matrices
o Mathematical functions (linear algebra, statistics, etc.)
o Supports data transformations and conversions
Example:
import numpy as np
arr = np.array([1, 2, np.nan, 4])
arr = np.nan_to_num(arr) # Replace NaN values with 0
3. Matplotlib / Seaborn
Purpose: While these libraries are mainly used for data visualization, they also play a
role in exploring data, which is an important part of the wrangling process (detecting
outliers, understanding distributions, etc.).
Key features:
o Creating visualizations (scatter plots, bar charts, histograms, etc.)
o Analyzing relationships between features
o Customizing plots for clarity
Example:
4. Openpyxl
Purpose: Openpyxl is used for reading and writing Excel files (.xlsx format).
Key features:
o Working with Excel files
o Editing sheets, cells, and formatting
o Automating Excel tasks
Example:
5. Dask
Purpose: Dask is used for parallel computing and handling larger-than-memory data.
It is a more scalable version of Pandas and NumPy.
Key features:
o Parallel and distributed computing for big data
o Supports out-of-core computations
o DataFrame and array-like operations
Example:
import dask.dataframe as dd
df = dd.read_csv('large_data.csv')
df = df[df['column'] > 10] # Filter data efficiently on a large scale
6. Pyjanitor
Example:
import janitor
df = df.clean_names() # Standardizes column names
7. Regex (re)
Purpose: The re module helps with working with regular expressions, useful for
extracting, searching, or replacing patterns in string data.
Key features:
o Pattern matching and string manipulation
o Extracting information based on patterns
Example:
import re
df['phone_number'] = df['phone_number'].apply(lambda x: re.sub(r'\D', '', x)) # Remove non-
digit characters
8. BeautifulSoup / lxml
Purpose: For parsing and cleaning data from HTML and XML documents.
Key features:
o Extract data from HTML/XML
o Clean up messy web data
Example:
9. SQLAlchemy
Purpose: SQLAlchemy allows you to work with SQL databases directly from Python,
often used to retrieve and manipulate data from databases.
Key features:
o Interfacing with relational databases
o Handling queries, joins, and aggregations directly within Python
Example:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df = pd.read_sql('SELECT * FROM table', engine)
10. Pytables
Example:
import tables
h5file = tables.open_file('data.h5', mode='r')
11. Geopandas
Example:
Indexing and selection are fundamental concepts in data wrangling and manipulation,
especially when working with data structures like Pandas DataFrames and Series. These
concepts allow you to access and modify specific rows, columns, or subsets of data based on
certain conditions or labels. Here’s a breakdown of how indexing and selection work in
Python, particularly using Pandas:
The .loc method in Pandas is used for label-based indexing. It allows you to access data by
specifying the row and column labels. The main advantage of .loc is that it lets you work with
the explicit row and column labels rather than integer-based positions.
Selecting a single row by label: To select a single row, pass the row label inside the
.loc[] method.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['row1', 'row2', 'row3'])
Selecting a single column by label: You can also select a column by passing its
label.
Selecting multiple rows and columns: You can specify multiple row labels and
column labels within the .loc[] method.
# Selecting rows 'row1' and 'row2', and columns 'A' and 'B'
subset = df.loc[['row1', 'row2'], ['A', 'B']]
print(subset)
The .iloc method is used for position-based indexing. It allows you to select data based on
integer positions (i.e., row and column indices) rather than labels.
3. Conditional Selection
Pandas also supports conditional selection, which allows you to filter data based on certain
conditions or boolean expressions.
Selecting rows based on a condition: You can apply conditions to a DataFrame and
use the resulting boolean Series to filter the rows.
Using multiple conditions: You can combine multiple conditions using & (AND) or |
(OR), with the conditions enclosed in parentheses.
# Selecting rows where column 'A' is greater than 1 and column 'B' is less than 6
filtered_df = df[(df['A'] > 1) & (df['B'] < 6)]
print(filtered_df)
Sometimes you may want to select a specific value from a DataFrame, based on both row and
column labels or indices.
Using .loc for specific value selection: You can specify both row and column labels
to get a specific value.
Using .iloc for specific value selection: You can specify the row and column
positions (integer indices).
# Selecting the value at position (1, 0) which corresponds to 'row2' and column 'A'
value = df.iloc[1, 0]
print(value)
Indexing and selection not only help retrieve data but also modify it. After selecting rows or
columns, you can modify values directly.
Modifying a column: You can assign a new value to a column after selecting it by
label.
Modifying rows based on a condition: You can modify specific rows based on a
condition.
# Setting the value in column 'B' to 100 for rows where 'A' is greater than 10
df.loc[df['A'] > 10, 'B'] = 100
print(df)
You can slice rows and columns in a DataFrame similar to how you slice lists in Python.
Slicing is useful when you need to work with a continuous range of rows or columns.
7. Multi-Indexing
For more complex datasets, you might encounter MultiIndex, where rows and/or columns
are labeled with more than one level. Pandas provides support for multi-level indexing to
access more complex datasets.
Operating on data involves various actions like transforming, cleaning, aggregating, and
manipulating data to make it suitable for analysis. In Python, particularly with libraries like
Pandas and NumPy, a wide range of operations can be performed on data, including
mathematical, statistical, and logical operations. Here's an overview of common operations
you might need to perform on data using these libraries:
1. Mathematical Operations
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
import numpy as np
df['sqrt_A'] = np.sqrt(df['A']) # Calculate the square root of column 'A'
print(df)
2. Aggregation Operations
Using groupby() for aggregation: Grouping data by one or more columns allows
you to apply aggregation functions like sum(), mean(), count(), etc.
df = pd.DataFrame({
'Category': ['A', 'B', 'A', 'B', 'A'],
'Values': [10, 20, 30, 40, 50]
})
3. Filtering Data
Filtering data is essential for selecting rows based on specific conditions or criteria.
Filtering rows using boolean indexing: You can filter data by applying logical
conditions to columns.
df = pd.DataFrame({
'A': [10, 20, 30, 40],
'B': [1, 2, 3, 4]
})
Using query() for more complex conditions: The query() method allows for more
readable code when filtering data.
Missing data is a common issue in real-world datasets. Pandas provides methods to handle
missing values by either removing or filling them.
Identifying missing data: You can detect missing values using isnull() or notnull()
methods.
df = pd.DataFrame({
'A': [1, 2, None, 4],
'B': [None, 5, 6, 7]
})
Removing missing data: You can remove rows or columns that contain missing
values using dropna().
Filling missing data: Pandas also allows you to fill missing values using fillna(). You
can fill with a constant value or use forward/backward filling.
Merging DataFrames: You can merge two DataFrames based on a common column
(like SQL joins).
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [24, 27, 22]
})
6. String Operations
Pandas provides various string methods for manipulating text data in columns.
String operations with .str: You can perform operations like string matching,
replacement, and splitting on columns containing strings.
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie']
})
Regex-based string operations: You can also use regular expressions for more
complex string manipulation.
7. Data Transformation
Sometimes, you need to apply custom transformations to your data, such as scaling,
normalizing, or applying mathematical functions.
Applying custom functions: You can use the apply() method to apply a function
across a DataFrame or Series.
df = pd.DataFrame({
'A': [1, 2, 3, 4]
})
Lambda functions: Lambdas can be used for more compact and simple
transformations.
Missing Data
Handling missing data is one of the most critical tasks in data wrangling, as real-world
datasets often contain missing or incomplete values. In Python, Pandas provides several
ways to identify, handle, and deal with missing data efficiently. Let's go through some of the
common techniques and methods for working with missing data:
In Pandas, missing data is represented by NaN (Not a Number) for numerical data or None
for other data types like strings. You can easily identify missing data using the following
functions:
isnull() and notnull(): The isnull() method returns a DataFrame of boolean values,
where True represents missing values, and False represents non-missing values. The
notnull() method is the inverse (returns True for non-missing values).
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, None, 4],
'B': [None, 2, 3, 4]
})
isna(): The isna() method works exactly like isnull(). It checks for missing values
(NaN or None) in a DataFrame.
print(df.isna())
There are situations where it might be best to remove rows or columns that contain missing
data, especially if they are not critical for your analysis.
dropna(): The dropna() method removes missing data from a DataFrame. You can
remove rows or columns with missing values depending on your use case.
o Removing rows with missing data:
The default behavior of dropna() is to remove rows that contain any NaN or
None values.
df_cleaned = df.dropna()
print(df_cleaned)
o Removing columns with missing data: You can remove columns with
missing values by setting the axis argument to 1.
df_cleaned = df.dropna(axis=1)
print(df_cleaned)
In many cases, it might be more appropriate to fill the missing values rather than remove
them, especially when the missing data is meaningful (like time-series data). Pandas provides
several ways to fill missing values:
fillna(): The fillna() method is used to fill missing values with a specified value or
method.
o Filling with a constant value:
You can replace all missing values with a specific value, such as zero or the
mean of a column.
In certain scenarios, such as time series analysis, you might want to fill missing data using
interpolation, which estimates missing values based on neighboring data points.
df = pd.DataFrame({
'A': [1, None, 3, 4],
'B': [None, 2, 3, 4]
})
You can also use other methods of interpolation, such as polynomial or spline
interpolation, by adjusting the method parameter.
# Polynomial interpolation of degree 2
df_interpolated = df.interpolate(method='polynomial', order=2)
print(df_interpolated)
In time series data, missing values can be handled in a way that considers the sequential
nature of the data. The fillna() and interpolate() methods can be particularly useful for this.
In machine learning and statistics, imputation refers to filling missing data with statistically
significant values (e.g., mean, median, or using regression). This is commonly done when
preparing data for predictive modeling.
Imputation using mean, median, or mode: You can fill missing data with the mean,
median, or mode of a column using fillna().
Alternatively, for more complex imputation methods (like KNN imputation), you can
use libraries like Scikit-learn to perform imputation with machine learning
algorithms.
For categorical data, missing values can be imputed with the mode (the most frequent
category), or they can be replaced with a new category, such as 'Unknown'.
df['Category'] = df['Category'].fillna(df['Category'].mode()[0])
print(df)
Replacing with a new category: You can also replace missing values in categorical
columns with a placeholder category like "Unknown".
df['Category'] = df['Category'].fillna('Unknown')
print(df)
Hierarchical indexing
Hierarchical indexing, also known as MultiIndexing in Pandas, allows you to work with
high-dimensional data in a 2D DataFrame or Series. Instead of having just one level of
indexing (such as a single row or column index), hierarchical indexing enables you to have
multiple levels of indexes for rows or columns. This feature is particularly useful when you're
dealing with complex data structures like time series, panel data, or multi-dimensional
datasets.
You can create a MultiIndex by passing a list of tuples to the index parameter when creating
a DataFrame, or by using the pd.MultiIndex.from_tuples() method.
import pandas as pd
Output:
Price
Category Fruit
A apple 1.2
orange 2.3
B banana 0.8
grape 1.0
In this example, the rows are indexed by two levels: Category and Fruit.
In Pandas, hierarchical indexing allows you to add multiple levels to the row and/or column
index. This enables more flexible querying and analysis of the data.
print(data)
Output:
Price Quantity
Category Fruit
A apple 1.2 10
orange 2.3 15
banana 0.8 5
B apple 1.0 12
orange 1.4 7
banana 0.9 8
Hierarchical indexes make it easy to select subsets of data using one or more levels of the
index. You can use loc or xs to access data:
Output:
Price Quantity
Fruit
apple 1.2 10
orange 2.3 15
banana 0.8 5
The xs() (cross-section) method is used to access data at a particular level of the index. You
can specify the level using the level parameter.
Output:
Price Quantity
Category
A 1.2 10
B 1.0 12
You can swap the levels of a hierarchical index using the swaplevel() method. This allows
you to rearrange the levels of the index.
Output:
Price Quantity
Fruit Category
apple A 1.2 10
B 1.0 12
orange A 2.3 15
B 1.4 7
banana A 0.8 5
B 0.9 8
5. Sorting by Index
With a hierarchical index, you can sort the data by one or more levels using the sort_index()
method. By default, sort_index() sorts by the outermost level, but you can specify the levels
to sort by.
Output:
Price Quantity
Category Fruit
A apple 1.2 10
banana 0.8 5
orange 2.3 15
B apple 1.0 12
banana 0.9 8
orange 1.4 7
If you no longer need the hierarchical index, you can reset it using the reset_index() method.
This will convert the index levels back into regular columns.
Output:
You can use stacking and unstacking to reshape the data. Stacking moves one level of the
column index to the row index, while unstacking moves one level of the row index to the
column index.
Stacking:
Output:
Category Fruit
A apple Price 1.2
Quantity 10
orange Price 2.3
Quantity 15
banana Price 0.8
Quantity 5
B apple Price 1.0
Quantity 12
orange Price 1.4
Quantity 7
banana Price 0.9
Quantity 8
dtype: float64
Unstacking:
Output:
Price Quantity
Fruit apple orange banana apple orange banana
Category
A 1.2 2.3 0.8 10 15 5
B 1.0 1.4 0.9 12 7 8
Combining datasets
Combining datasets is a fundamental aspect of data wrangling, especially when dealing with
large datasets that are split across multiple files or tables. In Python, Pandas provides a
variety of methods to combine datasets, including concatenation, merging, and joining.
These operations allow you to combine datasets along different axes (rows or columns) or by
matching rows based on keys.
The concat() function in Pandas is used to concatenate or stack multiple DataFrames along a
particular axis (either vertically or horizontally). This is often used when the datasets have the
same structure (same columns) but come from different sources.
Concatenating along rows (axis=0): This stacks DataFrames vertically (one on top
of the other), which is useful when you have multiple datasets with the same columns
but different rows (e.g., data from different months or years).
import pandas as pd
Output:
A B
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12
Concatenating along columns (axis=1): This stacks DataFrames side by side, which
is useful when the datasets have different columns but the same rows (e.g., different
features or variables).
# Concatenate along columns (axis=1)
concatenated_df_columns = pd.concat([df1, df2], axis=1)
print(concatenated_df_columns)
Output:
A B A B
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
Ignoring the index: If you don't want to keep the original index, you can reset it by
passing the ignore_index=True argument.
Output:
A B
0 1 4
1 2 5
2 3 6
3 7 10
4 8 11
5 9 12
Merging datasets is a more complex operation and is often used when you need to combine
datasets based on common columns or indexes (i.e., joining tables on a key). The merge()
function in Pandas is similar to SQL joins, and it allows for different types of joins: inner,
outer, left, and right.
Inner Join: Combines only the rows with matching keys in both datasets (default
behavior).
Output:
ID Name Age
0 2 Bob 24
1 3 Charlie 25
Left Join: Keeps all rows from the left DataFrame and matches the rows from the
right DataFrame where possible (if no match, NaN values are used).
Output:
ID Name Age
0 1 Alice NaN
1 2 Bob 24.0
2 3 Charlie 25.0
Right Join: Keeps all rows from the right DataFrame and matches the rows from the
left DataFrame where possible.
Output:
ID Name Age
0 2 Bob 24
1 3 Charlie 25
2 4 NaN 23
Outer Join: Keeps all rows from both DataFrames, filling in NaN where there is no
match.
Output:
ID Name Age
0 1 Alice NaN
1 2 Bob 24.0
2 3 Charlie 25.0
3 4 NaN 23.0
Output:
Name Age
1 Alice NaN
2 Bob 24.0
3 Charlie 25.0
Joining on specific columns: If you want to join on columns rather than the index,
you can specify the column name with on.
Output:
Name Age
ID
1 Alice NaN
2 Bob 24.0
3 Charlie 25.0
The append() function is similar to concat() but is primarily used to add rows from one
DataFrame to another. It is a simpler way to concatenate DataFrames vertically (along rows).
Output:
A
0 1
1 2
2 3
3 4
4 5
5 6
Aggregation and Grouping are essential techniques in data analysis that allow you to
perform calculations over subsets of your data, which is particularly useful when working
with large datasets. In Pandas, the groupby() function provides an easy and efficient way to
group data based on one or more columns and then apply aggregation functions such as sum,
mean, count, etc., to those groups.
1. GroupBy Basics
The groupby() function in Pandas is used to split the data into groups based on some criteria.
After splitting, you can perform aggregation or transformation operations on each group.
Basic Syntax:
df.groupby('column_name')
This returns a GroupBy object, which you can then use to apply aggregation functions.
Let's say we have the following DataFrame with sales data, and we want to group the data by
the 'Region' column:
import pandas as pd
# Sample data
data = {'Region': ['North', 'South', 'East', 'North', 'South', 'East'],
'Sales': [250, 150, 300, 450, 200, 500]}
df = pd.DataFrame(data)
# Grouping by 'Region'
grouped = df.groupby('Region')
print(grouped)
This will create a GroupBy object, which you can further apply operations to.
2. Aggregation Functions
Once the data is grouped, you can perform aggregation operations to summarize the data.
You can use functions such as sum(), mean(), count(), min(), max(), etc., to compute
summary statistics on each group.
Output:
Region
East 800
North 700
South 350
Name: Sales, dtype: int64
In this example, we grouped the data by Region and then calculated the sum of Sales for each
region.
Output:
Region
East 400.0
North 350.0
South 175.0
Name: Sales, dtype: float64
Output:
Region
East 2
North 2
South 2
Name: Sales, dtype: int64
This gives you the count of non-null values for the Sales column in each group.
You can also apply multiple aggregation functions to the grouped data using the agg()
function. This allows you to compute multiple statistics simultaneously.
Output:
In this example, for each region, we calculated the sum, mean, and max of the Sales column.
You can group by more than one column by passing a list of column names to the groupby()
function. This allows you to perform more detailed grouping.
Let’s say you have a dataset with both 'Region' and 'Product' columns, and you want to group
by both columns:
df = pd.DataFrame(data)
Region Product
East A 300
B 500
North A 250
B 450
South A 200
B 150
Name: Sales, dtype: int64
This shows the sum of sales for each combination of Region and Product.
Transformations: You can use the transform() method to apply a function to each
group independently, but unlike aggregation, it returns a DataFrame with the same
shape as the original.
# Applying a transformation: Standardize sales (subtract the mean and divide by the standard
deviation)
sales_standardized = grouped['Sales'].transform(lambda x: (x - x.mean()) / x.std())
print(sales_standardized)
Output:
0 -0.707107
1 0.707107
2 -0.707107
3 0.707107
4 -0.707107
5 0.707107
Name: Sales, dtype: float64
Filtering: You can filter groups based on a condition using the filter() method. For
example, to keep only those regions where the sum of sales is greater than 400:
Output:
In this example, we keep only the groups where the sum of Sales is greater than 400.
6. Using pivot_table() for Aggregation
For more complex aggregation operations, Pandas provides the pivot_table() function, which
is similar to Excel pivot tables. It allows you to aggregate data by multiple columns and apply
various aggregation functions.
Output:
Product A B
Region
East 300 500
North 250 450
South 200 150
In this case, we created a pivot table where Region is used as the index, Product as the
columns, and Sales as the values. The sum aggregation function was used to calculate the
total sales.
When working with real-world data, it's common to encounter missing values (NaN). Pandas
provides options for handling these missing values during aggregation.
Ignoring NaN values: By default, Pandas ignores NaN values during aggregation
functions. For example, the sum() function will ignore NaN values and compute the
sum of the available data.
Filling NaN values: You can fill NaN values before aggregation using the fillna()
method.
Pivot Tables
A pivot table is a data summarization tool that is commonly used in data analysis to
automatically organize and aggregate data. In Pandas, the pivot_table() function provides a
powerful way to reshape and summarize datasets by grouping data, applying aggregation
functions, and transforming the data into a more readable and useful format. Pivot tables are
particularly useful when you need to summarize and compare information across multiple
dimensions (rows and columns).
Let’s start with a simple dataset and create a pivot table that summarizes sales data by region
and product:
import pandas as pd
# Sample data
data = {
'Region': ['North', 'South', 'East', 'North', 'South', 'East'],
'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Sales': [250, 150, 300, 450, 200, 500],
'Quantity': [30, 20, 40, 60, 25, 70]
}
df = pd.DataFrame(data)
Output:
Product A B
Region
East 300 500
North 250 450
South 200 150
You can use multiple aggregation functions in a single pivot table by passing a list of
functions to the aggfunc parameter.
# Create a pivot table with multiple aggregation functions
pivot_table_multi_agg = pd.pivot_table(df, values='Sales', index='Region',
columns='Product', aggfunc=['sum', 'mean'])
print(pivot_table_multi_agg)
Output:
sum mean
Product A B A B
Region
East 300 500 300.0 500.0
North 250 450 250.0 450.0
South 200 150 200.0 150.0
Here, we used both sum and mean as aggregation functions, which gives us both the total and
the average sales by region and product.
You can group by multiple columns for both the rows (index) and the columns (columns) to
get more detailed summaries.
df = pd.DataFrame(data)
Output:
Copy
Product A B
Region Category
East Electronics 300 NaN
Furniture NaN 500
North Electronics 250 NaN
Furniture NaN 450
South Electronics 200 NaN
Furniture NaN 150
In this case, we used both Region and Category as index columns and Product as columns.
This allows us to see the sales for each combination of Region and Category by product type.
Sometimes, a pivot table may contain missing data (NaN) when certain combinations of the
index and column values don’t exist in the original data. You can fill or handle these missing
values using the fill_value parameter in the pivot_table() function.
Output:
Product A B
Region
East 300 0
North 250 450
South 200 150
In this example, missing sales values for Product B in East and Product A in South were
replaced with 0 using the fill_value parameter.
You can also aggregate multiple columns of data using the values parameter. Let’s say we
want to aggregate both Sales and Quantity:
Output:
Quantity Sales
Product A B A B
Region
East 40 70 300 500
North 30 60 250 450
South 25 20 200 150
This creates a pivot table with Sales and Quantity as the values, and we can aggregate both
columns separately for each group.
df = pd.DataFrame(data)
Output:
Product A B
Year
2021 750 600
Here, we created a pivot table that summarizes the total sales for each product by year.
In addition to pivot_table(), Pandas also provides the pivot() function, which is a simpler
version of pivot tables but less flexible. It requires the data to be unique for each combination
of the index and columns. If there are duplicates in the data, you will encounter an error.
# Using pivot() function (simpler than pivot_table, but only for unique combinations)
pivot_simple = df.pivot(index='Region', columns='Product', values='Sales')
print(pivot_simple)
Output:
Product A B
Region
East 300 500
North 250 450
South 200 150
DESCRIPTIVE ANALYTICS AND INFERENTIAL STATISTICS
Descriptive Analytics and Inferential Statistics are two key areas in data analysis, each
serving different purposes but complementing each other. Here's an overview of both:
Descriptive Analytics
Descriptive analytics focuses on summarizing and interpreting data to understand its past and
present state. It does not make predictions or generalizations but helps you understand trends,
patterns, and distributions within the data. The goal is to describe the main features of a
dataset in a simple and interpretable way.
Inferential Statistics
Inferential statistics involves using data from a sample to make generalizations or predictions
about a larger population. Unlike descriptive analytics, which simply describes data,
inferential statistics draws conclusions based on the sample data, accounting for the inherent
uncertainty in making those conclusions.
1. Purpose:
o Descriptive Analytics aims to summarize and describe data.
o Inferential Statistics aims to make predictions, generalizations, or test
hypotheses about a population based on sample data.
2. Techniques:
o Descriptive analytics uses measures like mean, median, mode, and graphs to
describe the data.
o Inferential statistics uses probability theory, hypothesis testing, and regression
analysis to draw conclusions about a population.
3. Output:
o Descriptive analytics provides straightforward summaries and visualizations.
o Inferential statistics provides estimates, predictions, and tests of significance.
Example:
Descriptive Analytics: You might calculate the average score of students in a class,
or create a histogram showing how the scores are distributed.
Inferential Statistics: You might use a t-test to compare the average scores of two
different classes and determine if the difference is statistically significant, making an
inference about the entire student population.
Frequency distributions
A frequency distribution is a way to organize and summarize data to show how often each
value or range of values occurs in a dataset. It is a key concept in descriptive statistics
because it helps you understand the distribution of data in a clear and structured manner.
1. Class Intervals (for grouped data): These are the ranges of values in which the data
points fall. For example, if you are looking at ages, class intervals might be 20–29,
30–39, etc.
2. Frequency: This is the count of how many data points fall within each class interval
or value.
3. Relative Frequency: The proportion of the total number of observations that fall
within each class interval, calculated as:
Relative Frequency=Frequency of classTotal number of data points\text{Relative
Frequency} = \frac{\text{Frequency of class}}{\text{Total number of data
points}}Relative Frequency=Total number of data pointsFrequency of class
4. Cumulative Frequency: The running total of frequencies, which shows how many
data points fall below a certain value or class interval. It's helpful for understanding
how data accumulates.
Example:
Age Frequency
22 2
25 2
27 1
28 1
30 3
31 1
Example: For a dataset of exam scores: [55, 73, 85, 60, 48, 91, 77, 63, 71, 82, 49, 58]
1. Organize the Data: List the data points (either individually or grouped) in ascending
order.
2. Decide on Class Intervals (for grouped data): If you are working with continuous
data, choose appropriate class intervals (e.g., 10–20, 21–30) based on the range of
your data.
3. Count the Frequency: Count how many data points fall within each class interval or
unique value.
4. Calculate Relative Frequency (optional): Divide each class frequency by the total
number of observations to get the relative frequency.
5. Create a Table or Histogram: Summarize the results in a table or use a histogram to
visually represent the frequency distribution.
Data: [55, 73, 85, 60, 48, 91, 77, 63, 71, 82, 49, 58, 64, 72, 79, 66, 87, 92, 90, 56, 68, 77, 81,
70, 75, 83, 61, 59, 74, 69, 78, 62, 88, 93, 64, 76, 57, 89, 74, 80, 65, 79, 84, 65, 70, 81, 83, 79,
91, 85]
Graphical Representation:
Histogram: A bar graph where each bar represents a class interval, with the height of
the bar corresponding to the frequency of that interval.
Pie Chart: Used to represent relative frequencies visually as segments of a circle.
Ogive (Cumulative Frequency Graph): A graph of cumulative frequencies plotted
against the upper class boundary.
Outliers
Outliers are data points that differ significantly from the rest of the data in a dataset. They
are values that are much smaller or larger than the other observations, which can potentially
distort statistical analysis or lead to incorrect conclusions.
Types of Outliers:
1. Univariate Outliers: These are outliers in a single variable. For example, in a dataset
of exam scores, a score of 1000 might be an outlier if the other scores are mostly
between 50 and 90.
2. Multivariate Outliers: These outliers appear when considering multiple variables
together. For instance, a person who is unusually tall and also weighs very little
compared to the rest of a population might be a multivariate outlier.
Impact on Analysis: Outliers can skew results and affect the calculation of statistical
measures such as the mean, variance, and standard deviation. For example, an
extremely high salary in a dataset can make the average salary seem much higher than
most of the individuals in the dataset.
Indication of Data Issues: Outliers could be errors, misreported data, or data entry
mistakes. Alternatively, they could represent important variations that need further
investigation.
Influence on Model Performance: In predictive modeling or machine learning,
outliers can have a significant effect on model accuracy, particularly if the model is
sensitive to extreme values (e.g., linear regression).
Identifying Outliers:
Outliers can be identified using various methods, including visual inspection, statistical
techniques, and algorithms.
Once outliers are identified, it’s important to decide how to handle them. Here are some
common approaches:
1. Remove Outliers:
o If the outliers are due to data entry errors or are irrelevant to the analysis, they
might be removed from the dataset.
o This is particularly useful in cases where the outliers distort the results of
statistical tests or predictive models.
2. Transform the Data:
o Applying transformations like log transformations, square roots, or
winsorization (replacing extreme values with the nearest valid data point) can
reduce the effect of outliers.
3. Keep Outliers and Investigate Further:
o If the outliers represent rare but important variations, keeping them and further
investigating might lead to valuable insights. For example, outliers in medical
research could represent rare diseases or unique cases.
4. Use Robust Methods:
o Some statistical techniques (such as robust regression or tree-based methods
like decision trees) are less sensitive to outliers. These methods can be used to
avoid the influence of outliers in your analysis.
5. Impute Values:
o Instead of removing outliers, you might choose to replace them with a more
typical value (such as the median, mean, or a model-based imputation).
Interpreting Distributions
Interpreting distributions is essential for understanding the underlying patterns, trends, and
characteristics of a dataset. A distribution shows how data points are spread across different
values or intervals, and interpreting it helps reveal important insights about the data.
Here’s a guide to interpreting different types of distributions and understanding the key
features:
1. Types of Distributions:
Range: The difference between the maximum and minimum values. It gives a sense
of the overall spread but is sensitive to outliers.
Variance and Standard Deviation: Measure how spread out the values are from the
mean. A larger variance or standard deviation indicates a more dispersed distribution,
while a smaller one indicates that the values are clustered around the mean.
Interquartile Range (IQR): The range between the first (Q1) and third (Q3)
quartiles (middle 50% of the data), which is less affected by outliers than the range.
Skewness:
Positive Skew (Right Skew): The right tail is longer than the left, and most data
points are concentrated on the lower side. The mean is greater than the median.
Negative Skew (Left Skew): The left tail is longer than the right, and most data
points are concentrated on the higher side. The mean is less than the median.
Kurtosis:
Kurtosis describes the "tailedness" of the distribution. It tells you how much data is
in the tails and how sharp or flat the peak is.
o Leptokurtic: High kurtosis (peaked with heavy tails), indicating outliers are
more likely.
o Platykurtic: Low kurtosis (flatter distribution), indicating fewer outliers.
Histograms:
o A histogram is a bar chart where each bar represents the frequency of data
points within a certain range or bin.
o It is used to visually examine the distribution of data—whether it’s normal,
skewed, bimodal, etc.
o Interpretation: Look at the shape of the histogram. Is it symmetric, skewed,
or does it have multiple peaks? The width and number of bins can affect the
interpretation.
Box Plot (Box-and-Whisker Plot):
o A box plot shows the distribution of data based on five key summary statistics:
minimum, Q1 (lower quartile), median, Q3 (upper quartile), and maximum.
o Outliers are typically represented as individual points outside the "whiskers"
of the box plot.
o Interpretation: The spread of the box shows how data is distributed, and the
length of the whiskers helps identify possible outliers.
Density Plot:
o A smoothed version of a histogram, showing the estimated probability density
of the variable.
o It provides a clearer view of the distribution, especially for continuous data.
o Interpretation: You can observe the shape of the distribution more clearly,
such as whether it is skewed or has multiple peaks.
Q-Q Plot (Quantile-Quantile Plot):
o A Q-Q plot compares the quantiles of the dataset with the quantiles of a
theoretical distribution (often the normal distribution).
o Interpretation: If the data points form a straight line, the data follows the
theoretical distribution. Deviations from the line indicate departures from that
distribution (e.g., skewness, kurtosis).
Data: [50, 52, 55, 55, 56, 58, 60, 60, 70, 75, 100]
Range = 100 - 50 = 50
Variance and Standard Deviation give a sense of how spread out the data is around
the mean.
A histogram shows a possible right-skew, with a long tail at the higher end (due to the
score of 100).
Step 4: Check for Skewness:
The data seems right-skewed because the mean (61.1) is higher than the median (56),
and the higher scores are influencing the mean.
The score of 100 could be considered an outlier because it’s far above the other
scores. You could investigate whether this is an error or if it represents a truly
exceptional student.
Graphs
Graphs are powerful tools for visualizing data, allowing us to quickly understand patterns,
trends, and relationships in a dataset. Different types of graphs serve different purposes
depending on the kind of data and the insights you're looking for. Below are some of the most
commonly used graphs and charts for data visualization, along with guidance on when to use
each.
1. Histogram
When to Use:
Example Graph:
Exam scores: [50, 52, 53, 60, 65, 70, 75, 78, 80, 85]
Group data into bins like 50–60, 61–70, etc., and plot the frequency of scores within
each bin.
2. Bar Chart
When to Use:
Example Graph:
3. Pie Chart
When to Use:
Example Graph:
4. Line Chart
Purpose: A line chart is used to display trends over time or continuous data points.
Use Case: Best for time series data or data that represents continuous values.
Interpretation:
o The x-axis typically represents time (e.g., months, years), and the y-axis
represents the data values.
o The line connects data points, showing trends, patterns, and fluctuations.
Example: Showing stock prices over a year.
When to Use:
Example Graph:
Purpose: A scatter plot shows the relationship between two variables by plotting
points on a two-dimensional plane.
Use Case: Best for understanding correlations or relationships between variables.
Interpretation:
o Each point represents an observation.
o The x and y axes represent two variables, and the spread of points shows the
relationship between them (e.g., positive, negative, or no correlation).
Example: Showing the relationship between study hours and exam scores.
When to Use:
Example Graph:
When to Use:
Example Graph:
7. Area Chart
Purpose: An area chart is similar to a line chart, but the area beneath the line is filled
in, emphasizing the magnitude of changes over time.
Use Case: Best for showing cumulative data or how different categories contribute to
the total over time.
Interpretation:
o Like a line chart, but with the area beneath each line filled.
o Helps to visualize the relative contributions of different data series.
Example: Showing total sales of products, with each product represented by a
different color area.
When to Use:
Example Graph:
8. Heatmap
Purpose: A heatmap uses color to represent the values of a matrix or table, showing
the intensity or concentration of values.
Use Case: Best for showing the relationship between two categorical variables or
visualizing correlations in large datasets.
Interpretation:
o Each cell in the matrix is colored based on its value, with color intensity
representing magnitude.
Example: Displaying correlations between different variables.
When to Use:
Example Graph:
9. Bubble Chart
When to Use:
When you have three variables and want to visualize their relationships.
To highlight the magnitude of one of the variables.
Example Graph:
Purpose: A violin plot combines aspects of a box plot and a density plot, showing the
distribution of data across different categories.
Use Case: Best for comparing distributions of numerical data across different groups
or categories.
Interpretation:
o The "violin" shape shows the distribution of the data, with thicker areas
representing where the data is more concentrated.
o The vertical line inside the violin represents the median.
Example: Comparing the distribution of exam scores for different groups.
When to Use:
Example Graph:
Averages
Averages are a central concept in statistics used to describe the central or typical value of a
dataset. There are several ways to compute averages, and the specific type used depends on
the nature of the data and the insights you're seeking. The most common types of averages
are:
Definition: The mean is the sum of all data points divided by the number of data
points.
Formula:
Mean=∑Data Points/N
where N is the number of data points and Data Points is the sum of all values in the
dataset.
Use Case: The mean is widely used in datasets with no extreme outliers. It provides a
good measure of central tendency when the data is symmetric.
Example:
o Data: [2, 4, 6, 8, 10]
o Mean = (2 + 4 + 6 + 8 + 10) / 5 = 30 / 5 = 6.
Advantages:
It uses all the data points, so it provides a good overall summary of the data.
Easy to calculate.
Disadvantages:
2. Median
Definition: The median is the middle value of a dataset when the data is ordered from
least to greatest. If there is an even number of data points, the median is the average
of the two middle numbers.
Use Case: The median is more robust to outliers and skewed distributions. It is used
when the dataset contains outliers or is highly skewed.
Steps to Calculate:
If the number of data points (N) is odd: The median is the value in the middle.
If N is even: The median is the average of the two middle values.
Example:
o Data (odd number): [1, 3, 5, 7, 9]
o Median = 5 (middle value).
o Data (even number): [1, 3, 5, 7]
o Median = (3 + 5) / 2 = 4.
Advantages:
Disadvantages:
It may not represent the data as well when the dataset is symmetric and lacks outliers.
Less sensitive to the shape of the distribution compared to the mean.
3. Mode
Definition: The mode is the value that occurs most frequently in a dataset. A dataset
can have more than one mode if multiple values appear with the same highest
frequency (bimodal or multimodal).
Use Case: The mode is useful for categorical data or when you are interested in the
most common value.
Example:
Data: [2, 4, 4, 6, 8, 10]
Mode = 4 (appears twice).
Data: [1, 2, 2, 3, 3, 4]
Mode = 2 and 3 (bimodal).
Advantages:
Disadvantages:
There might be no mode if all values occur with the same frequency.
It may not provide much insight in datasets with a lot of variation.
4. Weighted Mean
Definition: The weighted mean is a version of the mean where each data point is
given a weight, reflecting its importance or frequency.
Formula:
Weighted Mean=∑(xi⋅wi)\∑wi
where xi is the data value and wi is the weight associated with that value.
Use Case: Useful when some data points are more important or frequent than others.
Example:
o Data: [2, 4, 6] with weights [1, 2, 3]
o Weighted Mean = (2×1 + 4×2 + 6×3) / (1 + 2 + 3) = (2 + 8 + 18) / 6 = 28 / 6 ≈
4.67.
Advantages:
Provides a more accurate measure when some data points carry more significance
than others.
Disadvantages:
5. Geometric Mean
Definition: The geometric mean is the nth root of the product of all values in a
dataset, where n is the number of values. It is commonly used for data that involves
growth rates or percentages.
Formula:
where xi are the data points and n is the number of data
points.
Use Case: Common in finance (e.g., average rate of return) and when dealing with
data that involves multiplication or growth.
Advantages:
Disadvantages:
6. Harmonic Mean
Definition: The harmonic mean is the reciprocal of the arithmetic mean of the
reciprocals of the data points. It is particularly useful when averaging rates or ratios.
Formula:
where n is the number of data points and xi are the individual data points.
Use Case: Commonly used in averaging rates, speeds, or other quantities where the
reciprocal is more meaningful.
Example:
Advantages:
Useful for rates or ratios, especially when dealing with large variations in values.
Disadvantages:
The choice of average depends on the nature of the data and the question being asked:
Mean: Use when data is symmetric and free from extreme outliers.
Median: Use when the data is skewed or contains outliers that may distort the mean.
Mode: Use when identifying the most frequent or popular value is important,
especially for categorical data.
Weighted Mean: Use when certain data points are more important than others.
Geometric Mean: Use when dealing with growth rates, percentages, or data that
involves multiplicative processes.
Harmonic Mean: Use when averaging rates or ratios, particularly for speed or time-
based calculations.
Describing Variability
Describing variability is an essential part of statistical analysis, as it tells us how spread out
or dispersed the data is. Understanding the variability helps us comprehend the consistency or
inconsistency within a dataset and how much individual data points differ from a central
measure (such as the mean or median).
1. Range
2. Variance
Definition: Variance measures the average squared deviation of each data point from
the mean. It provides a more nuanced measure of variability by accounting for how
far data points are from the mean, squared to emphasize larger differences.
Use Case: Variance is useful when you need to quantify the overall spread or
dispersion of the data, especially in more advanced statistical analyses.
Example:
o Data: [1, 3, 5, 7]
o Mean = (1 + 3 + 5 + 7) / 4 = 4.
o Variance = ((1-4)² + (3-4)² + (5-4)² + (7-4)²) / 4 = (9 + 1 + 1 + 9) / 4 = 5.
Advantages:
Disadvantages:
The units of variance are squared, which can make it difficult to interpret directly in
the context of the original data (e.g., squared dollars or squared units).
3. Standard Deviation
Definition: The standard deviation is the square root of the variance. It is the most
commonly used measure of variability because it is in the same units as the original
data, making it more interpretable.
Use Case: The standard deviation is used when you want a measure of spread that is
interpretable in the same units as the original data, making it easier to understand than
variance.
Example:
o Data: [1, 3, 5, 7]
o Variance = 5.
o Standard Deviation = √5 ≈ 2.24.
Advantages:
Disadvantages:
Definition: The interquartile range is the difference between the 75th percentile (Q3)
and the 25th percentile (Q1) of a dataset. It is a measure of variability that captures
the spread of the middle 50% of the data, ignoring outliers.
IQR=Q3−Q1
Use Case: The IQR is used to describe the spread of the middle 50% of data and is
less influenced by extreme values or outliers than the range.
Example:
o Data: [1, 3, 5, 7, 9, 11, 13]
o Q1 = 3, Q3 = 9.
o IQR = 9 - 3 = 6.
Advantages:
Robust against outliers.
Focuses on the central distribution of the data.
Disadvantages:
Definition: The coefficient of variation is the ratio of the standard deviation to the
mean, expressed as a percentage. It is used to measure the relative variability of data,
especially when comparing datasets with different units or scales.
Advantages:
Useful for comparing variability across datasets with different means or units.
Disadvantages:
Normal Distributions
A normal distribution (also called a Gaussian distribution) is one of the most important
and widely used probability distributions in statistics. It is often used to represent real-valued
random variables whose distributions are symmetric and bell-shaped.
1. Symmetry: The distribution is symmetric about the mean, meaning the left side is a
mirror image of the right side.
2. Bell-shaped Curve: The shape of the normal distribution is bell-shaped, with the
highest point at the mean.
3. Mean, Median, Mode are Equal: In a perfectly normal distribution, the mean,
median, and mode are all the same value and are located at the center of the
distribution.
4. Defined by Mean and Standard Deviation: A normal distribution is fully described
by two parameters:
o Mean (μ\muμ): The average or central value.
o Standard Deviation (σ\sigmaσ): A measure of the spread or dispersion of the
distribution. A larger standard deviation means the data is spread out more
widely, while a smaller standard deviation means it is clustered around the
mean.
5. Empirical Rule (68-95-99.7 Rule):
o Approximately 68% of the data falls within 1 standard deviation of the
mean.
o Approximately 95% of the data falls within 2 standard deviations of the
mean.
o Approximately 99.7% of the data falls within 3 standard deviations of the
mean.
6. Tails: The tails of the distribution extend infinitely in both directions, but they get
closer and closer to the horizontal axis as they move away from the mean. This means
that extreme values (outliers) are possible, though they become less likely the further
from the mean you go.
Applications of the Normal Distribution
The normal distribution is used in many fields and is particularly important because many
natural phenomena and measurement errors tend to follow a normal distribution. Some
common examples of where normal distributions appear include:
Heights and weights of people: The distribution of human heights tends to be normal
in most populations.
IQ scores: Intelligence quotient scores are typically normally distributed.
Measurement errors: The errors in measurements (e.g., from instruments) are often
normally distributed.
Financial models: In finance, returns on assets are often assumed to follow a normal
distribution (although this assumption can be debated for extreme market
movements).
Correlation
Correlation is a statistical measure that expresses the extent to which two variables are
related. It quantifies the degree to which the variables move in relation to each other.
Types of Correlation:
1. Positive Correlation: As one variable increases, the other variable also increases. For
example, height and weight tend to have a positive correlation.
2. Negative Correlation: As one variable increases, the other decreases. For example,
the number of hours spent on a task and the time left to complete it might be
negatively correlated.
3. No Correlation: There is no predictable relationship between the variables.
Scatter Plots
1. Positive Relationship: If the points tend to rise from left to right, this suggests a
positive correlation. As one variable increases, the other also increases.
2. Negative Relationship: If the points tend to fall from left to right, this suggests a
negative correlation. As one variable increases, the other decreases.
3. No Relationship: If the points are scattered randomly with no clear pattern, this
suggests no correlation.
Example:
Suppose we plot the data of "hours studied" (X) vs. "exam scores" (Y). A scatter plot might
show a positive correlation if the points form an upward sloping pattern.
1. What is Regression?
Regression is a way of modeling the relationship between variables. It is used to predict the
value of a dependent variable (also known as the response variable) based on one or more
independent variables (also known as predictor variables).
Simple Linear Regression: Involves one independent variable and one dependent
variable.
Multiple Linear Regression: Involves multiple independent variables and one
dependent variable.
In this explanation, we will focus on Simple Linear Regression, where there is just one
independent variable.
2. Regression Line
A regression line (also known as a line of best fit) is a straight line that best represents the
relationship between the independent variable (X) and the dependent variable (Y). The line is
drawn through the scatter plot of the data points in such a way that it minimizes the errors in
prediction.
For simple linear regression, the relationship between X and Y can be modeled by the
equation of a straight line:
The least squares regression line is the line that minimizes the sum of the squared
differences (also called residuals) between the observed data points and the values predicted
by the regression line. These residuals represent the vertical distance between each data point
and the regression line.
The term "least squares" comes from the method used to calculate the line: minimizing the
sum of the squared residuals.
The Standard Error of Estimate (SEE) is a measure of the accuracy of predictions made by
a regression model. It quantifies how much the actual data points deviate from the values
predicted by the regression line. In other words, the SEE provides an estimate of the typical
error (or residual) in predicting the dependent variable using the regression model.
The Standard Error of Estimate tells you how close the data points are to the
regression line. A small SEE indicates that the data points are close to the line,
suggesting that the regression model is a good fit for the data. A large SEE indicates
a poor fit, with large deviations between the actual data and the predicted values.
SEE in context:
o A high SEE means the predictions made by the regression model have a lot of
error.
o A low SEE means the model is making more accurate predictions with smaller
residuals.
Unit of SEE: The SEE is in the same unit as the dependent variable YYY, which
makes it interpretable in the context of the data. For example, if you're predicting test
scores (in points), the SEE will tell you the average prediction error in points.
R2, also known as the coefficient of determination, is a key statistical measure that helps
evaluate the goodness of fit of a regression model. It represents the proportion of the variance
in the dependent variable that is explained by the independent variable(s) in the
regression model.
In simple terms, R2tells you how well the regression line (or model) fits the data.
Multiple Regression
Multiple Regression is an extension of simple linear regression where multiple independent
variables (predictors) are used to predict a dependent variable (outcome). It is a statistical
technique used to understand the relationship between two or more predictor variables and
the dependent variable.
In multiple regression, the goal is to model the dependent variable Y as a linear combination
of the independent variables X1,X2,...,Xp.
Regression Toward the Mean refers to the phenomenon in statistics where extreme values
(either very high or very low) in one variable tend to be closer to the average (mean) of the
population in subsequent measurements. In other words, extreme data points in the first
measurement are likely to be closer to the mean in future measurements, even without any
intervention.
This concept is particularly relevant in the context of multiple regression when you have
outliers or extreme values in your data.
Regression toward the mean occurs because extreme values are often the result of a
combination of factors, including random variation. These extreme values are likely to be
followed by values that are closer to the overall average, which reflects the inherent
variability in data.
For example, imagine that you are studying the relationship between height and weight of a
group of individuals, and you observe that someone is extremely tall and heavy (an extreme
outlier). If you then measure this person again (or predict their weight based on height), the
new measurement is likely to be closer to the average for the population of individuals,
because extreme deviations are less likely to persist in future data.
In the context of multiple regression, regression toward the mean can occur when:
For instance, if you have a regression model predicting student performance based on
variables like study hours and previous exam scores, and a student has extremely high
study hours, the predicted performance will still likely regress toward the mean of the class,
even if study hours were a significant predictor.
Group A has an extremely high average score due to random factors, such as luck or
an easier exam.
Group B has an extremely low average score, due to bad luck or other random
factors.
If you were to predict the scores of students in both groups in the following year, you might
expect that:
The students in Group A will have scores closer to the mean of the general
population (not as high as their initial extreme scores).
Similarly, students in Group B will likely improve their scores, moving toward the
mean, assuming their extremely low scores were not due to underlying systematic
issues.
This means that, despite their extreme performances, both groups are expected to "regress"
toward the average performance level, especially when random factors (such as difficulty of
the exam) were involved.
Over time, extreme values often normalize: In various real-world contexts, the
process of regression toward the mean means that outliers or extreme performances
often revert to a more typical, average level over time.
Caution in interpreting extreme predictions: When using regression models,
especially with extreme predictors, we must be cautious about over-interpreting
extreme predictions. These predictions might often regress toward the mean,
especially if the extreme values are due to randomness rather than actual underlying
trends.
Application in policy and decision-making: In policy analysis, education, and
economics, it is important to recognize that interventions or predictions based on
extreme cases may not reflect future outcomes as strongly as initially expected.
INFERENTIAL STATISTICS
Population: The entire group you're interested in studying (e.g., all the people in a
country).
Sample: A subset of the population, used to make inferences about the population.
2. Estimation
Point Estimation: A single value that estimates a population parameter (e.g., the
sample mean is a point estimate of the population mean).
Confidence Interval: A range of values within which the population parameter is
expected to lie, with a certain level of confidence (e.g., 95% confidence interval).
3. Hypothesis Testing
4. Types of Tests
t-test: Compares the means of two groups to determine if they are significantly
different.
Chi-square test: Tests the relationship between categorical variables.
ANOVA (Analysis of Variance): Compares the means of three or more groups.
Regression Analysis: Examines the relationship between dependent and independent
variables.
5. Sampling Methods
Random Sampling: Every member of the population has an equal chance of being
selected.
Stratified Sampling: Divides the population into subgroups (strata) and takes
samples from each.
Cluster Sampling: Divides the population into clusters and randomly selects clusters
to sample.
Type I Error (False Positive): Rejecting the null hypothesis when it is actually true.
Type II Error (False Negative): Failing to reject the null hypothesis when it is
actually false.
7. Power of a Test
Statistical Power: The probability of correctly rejecting the null hypothesis when it is
false. High power reduces the likelihood of Type II errors.
POPULATIONS
In statistics, a population refers to the entire set of individuals, items, or data points that are
of interest for a particular study or analysis. It represents the complete group that you want to
draw conclusions about. Understanding populations is fundamental in inferential statistics
because you typically can't study the entire population (due to time, cost, or other practical
limitations), so you take a sample from it and make inferences based on that.
SAMPLES
A sample in statistics refers to a subset of individuals or data points taken from a larger
population. Since it’s often impractical or impossible to collect data from an entire
population, a sample is used to make inferences about the population. The key is that the
sample should be representative of the population so that conclusions drawn from the sample
can reasonably be generalized to the entire population.
Sample size refers to the number of observations or data points included in a sample. It’s
denoted by n.
A larger sample size generally provides more reliable estimates of the population parameters,
reducing the chance of errors.
2. Sampling Methods
There are various techniques for selecting a sample from a population. Here are some
common sampling methods:
Simple Random Sampling:
o Every member of the population has an equal chance of being selected.
o This is the most straightforward method, ensuring that the sample is unbiased.
o Example: Randomly selecting names from a list of all students in a school.
Stratified Sampling:
o The population is divided into subgroups or strata that share a similar characteristic
(e.g., age, gender, income level), and then random samples are taken from each
subgroup.
o This method ensures that each subgroup is represented in the sample.
o Example: Sampling individuals from different income levels when studying the
purchasing behavior of consumers.
Cluster Sampling:
o The population is divided into clusters (often based on geography or other natural
divisions), and entire clusters are randomly selected for the sample.
o This method is cost-effective but can introduce bias if clusters aren't homogeneous.
o Example: Randomly selecting cities and studying all households within those cities.
Systematic Sampling:
o Every k-th element in the population is selected. For example, you might pick every
10th person on a list.
o This is simple to implement, but can introduce bias if there's an underlying pattern in
the population.
o Example: Surveying every 5th person entering a store.
Convenience Sampling:
o The sample is chosen based on convenience rather than randomness. This is a non-
random sampling method, which can introduce significant bias.
o Example: Surveying people who are easy to reach, like asking students in a classroom
for their opinions on a topic.
3. Representative Sample
4. Sampling Error
Sampling error refers to the difference between the sample statistic and the population
parameter it’s estimating.
It’s natural for a sample statistic (like the sample mean) to differ from the population
parameter (like the population mean), and this difference is known as sampling error.
Larger samples tend to have smaller sampling errors because they better reflect the
population.
Quantitative Data: Data that can be measured and expressed numerically (e.g., age, weight,
income).
Qualitative Data: Data that describes characteristics or categories (e.g., gender, color,
opinions).
The sample is a subset, and the population is the entire group you're trying to study.
For example, if you’re conducting a survey on customer satisfaction for a company, the
population might be all customers of the company, and the sample would be the subset of
customers you select to survey.
7. Importance of Sampling
Sampling allows you to make inferences about a population without the need to
survey every individual or collect every possible data point.
By carefully selecting a sample, you can estimate population parameters such as the
mean, variance, or proportion.
For example:
o If you want to estimate the average height of all high school students in a country,
surveying every student would be impractical. Instead, a well-chosen sample can
provide a reliable estimate of the population's average height.
8. Bias in Sampling
If the sample is not chosen properly, the results can be biased, meaning they don't accurately
reflect the population. Some common types of sampling bias include:
o Selection Bias: When certain members of the population are more likely to be
selected for the sample than others.
o Non-response Bias: When people selected for the sample do not respond, and their
non-response is related to the variables of interest.
o Survivorship Bias: Focusing on individuals or items that have "survived" a particular
process, while overlooking those that didn't.
9. Extrapolation
The process of using data from the sample to make predictions or generalizations about the
population is called extrapolation.
The validity of extrapolation depends on how well the sample represents the population.
RANDOM SAMPLING
1. Equal Probability: Each member of the population has the same chance of being
selected. This reduces bias and increases the likelihood that the sample will represent
the population accurately.
2. Unbiased Selection: Since the selection is random, no particular group or individual
is favored over another, helping to avoid systematic errors or bias in the sample.
There are different ways to implement random sampling depending on the structure of the
population and the research goals. Here are the most common types:
SAMPLING DISTRIBUTION
A sampling distribution is the probability distribution of a given statistic (such as the
sample mean, sample proportion, or sample standard deviation) based on random samples
drawn from a population. It's a key concept in inferential statistics because it provides the
foundation for making statistical inferences about population parameters using sample
statistics.
The sampling distribution describes how a sample statistic (e.g., sample mean) varies from
sample to sample.
It shows the distribution of a statistic over all possible random samples that could be drawn
from a population.
For example, if you were to repeatedly take random samples from a population and calculate
the sample mean for each sample, the sampling distribution of the sample mean would be
the distribution of all those sample means.
Imagine you are studying the average height of students at a university. You can't measure
every student (the population), so you take multiple random samples, each with a certain
number of students.
For each sample, you calculate the mean height.
The sampling distribution of the sample mean would be the distribution of all those sample
means, showing how the sample means differ from each other.
Mean of the Sampling Distribution (µₓ̄): The mean of the sample statistic (e.g.,
sample mean) will be equal to the population parameter (e.g., population mean). This
is known as the expected value of the statistic.
o Formula: The mean of the sampling distribution of the sample mean is equal to the
population mean:
o This means that the average of all possible sample means will be equal to the
population mean.
Standard Error (SE): The standard deviation of the sampling distribution is called
the standard error. It measures how much the sample statistic (like the sample mean)
is expected to vary from the population parameter.
The Central Limit Theorem (CLT) is one of the most important principles in statistics. It
states that, for large sample sizes (usually n ≥ 30), the sampling distribution of the sample
mean will be approximately normal, regardless of the original population distribution.
o Important points about the CLT:
If you take sufficiently large random samples from a population, the
sampling distribution of the sample mean will be approximately normal, even
if the population distribution is not normal.
The mean of the sampling distribution will be equal to the population mean.
The standard deviation (standard error) of the sampling distribution decreases
as the sample size increases.
Statistical Inference: The sampling distribution provides the foundation for making
inferences about population parameters based on sample statistics.
Confidence Intervals: It is used to calculate confidence intervals for population parameters,
giving us a range within which the population parameter is likely to fall.
Hypothesis Testing: The sampling distribution is crucial for hypothesis testing, allowing us
to assess the likelihood that a sample statistic could have occurred by chance.
Understanding Variation: It helps us understand how much variation to expect in sample
statistics and the degree of uncertainty associated with estimates.
Let's consider an example where the population is the scores of all students in a large
university's final exam, and we are interested in the sampling distribution of the sample
mean.
Population Parameters: Suppose the population mean is 75 and the population standard
deviation is 10.
Sample Size: We decide to take random samples of 30 students (n = 30) and calculate the
mean score for each sample.
Sampling Distribution: If we repeat this sampling process many times, the sampling
distribution of the sample mean will be approximately normal (thanks to the Central Limit
Theorem).
o The mean of the sampling distribution will be 75 (same as the population mean).
The standard error of the mean (SEM) is a measure of how much variability or uncertainty
there is in the sample mean as an estimate of the population mean. In other words, it indicates
how much the sample means are expected to vary from the true population mean if you were
to take many samples from the population.
The standard error of the mean helps us understand the precision of the sample mean. A
smaller standard error means that the sample mean is likely to be closer to the population
mean, while a larger standard error means there is more variability in the sample means, and
the estimate might be less precise.
HYPOTHESIS TESTING
1. Null Hypothesis (H ):
o The null hypothesis is a statement of no effect, no difference, or no
relationship. It’s the hypothesis that there is nothing happening and that any
observed effect is due to random chance.
o For example, in testing whether a new drug works, the null hypothesis might
state that the drug has no effect.
2. Alternative Hypothesis (H or Ha):
o The alternative hypothesis is the opposite of the null hypothesis. It suggests
that there is a real effect, difference, or relationship.
o For example, the alternative hypothesis might state that the drug does have an
effect.
3. Test Statistic:
o A test statistic is a standardized value used to decide whether to reject the null
hypothesis. It’s calculated from sample data and compares it to the distribution
of possible values under the null hypothesis.
o Common test statistics include:
t-statistic (for t-tests)
z-statistic (for z-tests)
chi-square statistic (for chi-square tests)
F-statistic (for ANOVA)
4. Significance Level (α):
o The significance level (denoted as α) is the threshold used to decide whether
to reject the null hypothesis. It represents the probability of rejecting the null
hypothesis when it is actually true (Type I error).
o Common significance levels are 0.05, 0.01, and 0.10. For example, if α = 0.05,
you are willing to accept a 5% chance of making a Type I error.
5. P-Value:
o The p-value is the probability of obtaining a test statistic at least as extreme as
the one calculated from the sample, assuming the null hypothesis is true.
o If the p-value is smaller than the significance level (α), you reject the null
hypothesis. If the p-value is larger, you fail to reject the null hypothesis.
o A small p-value indicates strong evidence against the null hypothesis, while a
large p-value suggests weak evidence.
6. Critical Value:
o The critical value is a point on the test statistic’s distribution that defines the
boundary for rejecting the null hypothesis. If the test statistic exceeds the
critical value, the null hypothesis is rejected.
7. Type I and Type II Errors:
o Type I Error (False Positive): Rejecting the null hypothesis when it is
actually true.
o Type II Error (False Negative): Failing to reject the null hypothesis when it
is actually false.
8. Power of a Test:
o The power of a test is the probability of correctly rejecting the null hypothesis
when it is false. A higher power means the test is more likely to detect an
effect if there is one.
One-sample t-test: Used to compare the sample mean to a known population mean
when the population standard deviation is unknown.
Z-test: Used to compare the sample mean to the population mean when the
population standard deviation is known.
Two-sample t-test: Used to compare the means of two independent samples.
Paired t-test: Used to compare means from the same group at different times or under
different conditions.
Chi-square test: Used for categorical data to test the association between two
variables or goodness of fit.
ANOVA (Analysis of Variance): Used to compare means among three or more
groups.
Types of Z-tests:
1. One-Sample Z-test: Used when comparing the sample mean to a known population
mean.
2. Two-Sample Z-test: Used when comparing the means of two independent groups.
3. Z-test for Proportions: Used when comparing proportions from two different
groups.
Two-sample Z-test: Used when comparing the means of two independent samples.
Hypothesis:
o Null hypothesis (H ): The means of the two groups are equal.
o Alternative hypothesis (H ): The means of the two groups are not equal.
Formula:
Hypothesis:
o Null hypothesis (H ): The proportion is equal to a specified value or the
proportions of two groups are equal.
o Alternative hypothesis (H ): The proportion is not equal to the specified
value or the proportions of two groups are not equal.
Formula:
Problem:
A manufacturer claims that the average lifetime of a particular type of battery is 500 hours. A
random sample of 40 batteries is selected, and the sample mean lifetime is found to be 485
hours. The population standard deviation is known to be 100 hours. Test the manufacturer's
claim at a 5% significance level.
Decision rule
Interpretations
one-tailed and two-tailed tests
In hypothesis testing, one-tailed and two-tailed tests are used to determine whether a sample
mean (or proportion) is significantly different from the population mean (or proportion) in a
particular direction or in both directions. The main difference between these two types of
tests lies in the directionality of the hypothesis.
1. One-Tailed Test
Right-Tailed Test (Upper-Tailed Test): This test checks if the sample mean is significantly
greater than the population mean. The critical region is in the right tail of the distribution.
o Example: You are testing if the average salary of employees in a company is greater
than $50,000.
Left-Tailed Test (Lower-Tailed Test):
This test checks if the sample mean is significantly less than the population mean. The
critical region is in the left tail of the distribution.
o Example: You are testing if the average temperature in a city is less than 30°C.
. Two-Tailed Test
A two-tailed test is used when we are interested in determining if the sample mean is
significantly different from the population mean, without specifying the direction of the
difference. The critical regions are in both the left and right tails of the distribution.
UNIT V ANALYSIS OF VARIANCE AND PREDICTIVE ANALYTICS
The F-test is a statistical test used to compare two or more population variances or to assess
the goodness of fit in models. It is commonly used in analysis of variance (ANOVA),
regression analysis, and to test hypotheses about variances.
Common Uses of the F-test:
1. ANOVA (Analysis of Variance):
o The F-test is used in ANOVA to determine if there are any significant
differences between the means of three or more groups.
o The null hypothesis typically assumes that all group means are equal, while
the alternative hypothesis suggests that at least one group mean differs from
the others.
2. Testing Equality of Variances:
o The F-test can be used to test if two populations have the same variance.
o The null hypothesis typically states that the variances are equal, while the
alternative hypothesis suggests that the variances are not equal.
3. Regression Analysis:
o In multiple regression, the F-test is used to test if the regression model as a
whole is a good fit for the data.
o The null hypothesis assumes that all regression coefficients are equal to zero
(i.e., the model has no explanatory power).
ANOVA
ANOVA (Analysis of Variance) is a statistical method used to test differences between the
means of three or more groups. It helps determine whether there are any statistically
significant differences between the means of the groups being compared.
Key Concepts of ANOVA:
Types of ANOVA:
1. One-Way ANOVA:
o Used when comparing the means of three or more independent groups based
on one factor (independent variable).
o Example: Comparing the test scores of students from three different teaching
methods.
2. Two-Way ANOVA:
o Used when comparing the means of groups based on two factors (independent
variables).
o It can also assess the interaction effect between the two factors.
o Example: Comparing the test scores of students based on both teaching
method and gender.
3. Repeated Measures ANOVA:
o Used when the same subjects are tested under different conditions or at
different times.
o Example: Measuring the effect of a drug on the same group of patients at
multiple time points.
Assumptions of ANOVA:
Independence: The samples or groups should be independent of each other.
Normality: The data in each group should be approximately normally distributed.
Homogeneity of Variances: The variances across the groups should be
approximately equal (this is known as homoscedasticity).
ANOVA Steps:
1. Calculate Group Means:
o Compute the mean for each group.
2. Calculate Overall Mean:
o Compute the overall mean (grand mean) of all the data combined.
3. Calculate the Sum of Squares:
o Total Sum of Squares (SST): Measures the total variation in the data.
o Between-Group Sum of Squares (SSB): Measures the variation due to the
differences between the group means and the overall mean.
o Within-Group Sum of Squares (SSW): Measures the variation within each
group (i.e., how individual observations vary from their group mean).
1. Make a Decision:
o Compare the calculated F-statistic to the critical value from the F-distribution
table at the desired significance level (usually 0.05).
o If the calculated F-statistic is greater than the critical value, reject the null
hypothesis (indicating that there is a significant difference between the group
means).
Example of One-Way ANOVA:
Imagine we have three groups of people who were given different diets, and we want to test if
their weight loss differs. The groups are:
Group 1 (Diet A)
Group 2 (Diet B)
Group 3 (Diet C)
We would:
1. Calculate the mean weight loss for each group.
2. Compute the overall (grand) mean of weight loss.
3. Calculate the sums of squares (SST, SSB, SSW).
4. Compute the F-statistic.
5. Compare the F-statistic with the critical value from the F-distribution table to decide
if the differences are significant.
Interpretation:
If the F-statistic is large, it suggests that the between-group variability is large relative
to the within-group variability, indicating that at least one group mean is different.
If the F-statistic is small, it suggests that the group means are not significantly
different.
Two-factor experiments
Two-factor experiments involve testing two independent variables (factors) simultaneously to
understand how they individually and interactively affect a dependent variable (response).
These types of experiments are especially useful when you want to assess not just the
individual effect of each factor, but also whether there is an interaction effect between the two
factors.
In a two-factor experiment, you have:
Two independent variables (factors): These could be categorical or continuous. For
example, in a study on plant growth, factors could be "soil type" and "fertilizer type."
Levels of the factors: Each factor will have different levels. For example, "soil type"
could have two levels (e.g., sandy, loamy), and "fertilizer type" could have three
levels (e.g., organic, chemical, none).
Response variable (dependent variable): This is the outcome or measurement you're
interested in, such as plant height or crop yield.
Example of a Two-Factor Experiment:
Imagine you want to study the effects of two factors on the growth of plants:
1. Factor 1: Type of fertilizer (with 2 levels: Organic, Synthetic)
2. Factor 2: Amount of water (with 3 levels: Low, Medium, High)
You would test the different combinations of these two factors:
Organic Fertilizer + Low Water
Organic Fertilizer + Medium Water
Organic Fertilizer + High Water
Synthetic Fertilizer + Low Water
Synthetic Fertilizer + Medium Water
Synthetic Fertilizer + High Water
Key Concepts:
1. Main Effects: These represent the individual effects of each factor (independent
variable) on the dependent variable.
o Main Effect of Factor 1 (Fertilizer): Does the type of fertilizer (organic vs.
synthetic) affect plant growth?
o Main Effect of Factor 2 (Water): Does the amount of water (low, medium,
high) affect plant growth?
2. Interaction Effect: This is the combined effect of the two factors on the dependent
variable. The interaction effect assesses whether the effect of one factor depends on
the level of the other factor.
o For example, the effect of fertilizer on plant growth might differ depending on
the amount of water. If plants with organic fertilizer grow well under high
water but poorly under low water, there is an interaction between the two
factors.
Types of Two-Factor Designs:
1. Two-Factor Design with Replication:
o In this design, each combination of the two factors is repeated multiple times
(replications) to reduce the impact of random variation. This helps provide
more reliable results.
2. Two-Factor Design without Replication:
o Each combination of the factors is tested only once. This design can be less
reliable because the results could be influenced by uncontrolled variables or
randomness.
Statistical Analysis of Two-Factor Experiments:
In a two-factor experiment, you typically perform a two-way analysis of variance (ANOVA).
This allows you to assess:
Main effects of the two factors: How each factor (independently) affects the
dependent variable.
Interaction effect: Whether the effect of one factor depends on the level of the other
factor.
Steps in Two-Way ANOVA:
1. Hypotheses:
o Null Hypothesis (H ): No effect from either factor or their interaction. (i.e.,
Factor 1 has no effect, Factor 2 has no effect, and there is no interaction
effect).
o Alternative Hypothesis (H ): At least one of the effects (main effects or
interaction) is significant.
2. Two-Way ANOVA Table: This table typically contains:
o Sum of Squares (SS): The variation attributable to each factor and the
interaction term.
o Degrees of Freedom (df): The number of levels minus one for each factor and
the interaction term.
o Mean Squares (MS): Sum of Squares divided by their respective degrees of
freedom.
o F-statistics: The ratio of the Mean Square for each effect divided by the Mean
Square for error (within-group variation).
3. Decision Rule:
o Compare the F-statistic for each effect (Factor 1, Factor 2, and Interaction)
with the critical value from the F-distribution.
o If the F-statistic is larger than the critical value, reject the null hypothesis for
that effect.
Example of Two-Way ANOVA Analysis:
Let’s continue with the plant growth example:
Factor 1 (Fertilizer): Organic vs. Synthetic
Factor 2 (Water): Low, Medium, High
The ANOVA table might look something like this (hypothetical data):
Interaction (Fertilizer *
50 2 25 1.2 0.30
Water)
Decision Rule: Compare the computed F-value with the critical F-value from the F-
distribution table. If the computed F-value is greater than the critical F-value, reject
the null hypothesis.
2. F-test in Analysis of Variance (ANOVA)
Purpose: To test if there are any significant differences between the means of three or
more groups.
Scenario: You want to determine whether different teaching methods lead to different
average scores among students.
Null Hypothesis (H ): All group means are equal.
o H : μ1 =μ2 =μ3 =...=μk
Alternative Hypothesis (H ): At least one group mean is different.
3. F-test in Regression Analysis (Overall Significance)
Purpose: To test if the overall regression model is significant. In other words,
whether at least one of the independent variables significantly explains the variability
in the dependent variable.
Scenario: You want to determine whether the combination of independent variables
(e.g., hours studied and number of practice tests taken) predicts the dependent
variable (e.g., exam scores).
Null Hypothesis (H ): All regression coefficients are equal to zero (i.e., the
independent variables have no effect).
Decision Rule: If the computed F-statistic exceeds the critical value from the F-
distribution table, reject the null hypothesis. This would indicate that the independent
variables collectively explain a significant portion of the variation in the dependent
variable.
Example: You might perform an F-test to evaluate whether the number of study hours and
practice tests together predict exam scores.
Visualizing F-tests:
F-distribution: The F-statistic follows the F-distribution, which is positively skewed
and depends on two degrees of freedom: one for the numerator and one for the
denominator.
Critical F-value: The critical value is determined based on the significance level
(e.g., 0.05) and the degrees of freedom for both the numerator and denominator. If the
F-statistic exceeds the critical value, the null hypothesis is rejected.
Applications:
Linear regression: Fit a line to a set of data points.
Curve fitting: Fit more complex models (e.g., polynomials) to data.
Signal processing: Estimate parameters of a model from noisy data.
Goodness Of Fit
Goodness of fit is a statistical measure used to assess how well a model (like a regression
model) fits the data. In the context of linear regression, the goodness of fit tells you how well
the predicted values from the model align with the observed data points.
Here are some key metrics commonly used to evaluate the goodness of fit:
Testing a linear model – weighted resampling
Testing a linear model using weighted resampling involves adjusting how data points are
sampled or weighted during the model evaluation process. This technique can be particularly
useful when dealing with imbalanced data or when certain observations are considered more
important than others.
Weighted Resampling and its Purpose
In a linear regression model (or any statistical model), we may want to:
Assign different importance (weights) to data points depending on factors like
reliability, frequency, or relevance.
Handle imbalanced data where some classes or regions of the data might be
underrepresented.
Perform resampling (such as bootstrap or cross-validation) in a way that gives more
influence to certain data points.
Weighted Resampling Process
Weighted resampling can be done in several ways, including:
1. Weighted Least Squares (WLS):
o This is a variant of ordinary least squares (OLS) where each data point is
given a weight. The idea is to give more importance to some points during the
fitting process. For example, points with smaller measurement errors might be
given higher weights, while noisy or less reliable data points might get lower
weights.
Performing regression using the Statsmodels library in Python is a common approach for
fitting and analyzing statistical models. Statsmodels provides a rich set of tools for linear
regression, generalized linear models, and other types of regression analysis.
Steps for Linear Regression using Statsmodels
Let’s walk through the basic steps for performing a linear regression using Statsmodels.
1. Install Statsmodels (if you haven't already):
You can install Statsmodels using pip:
pip install statsmodels
2. Import Required Libraries:
You'll need the following libraries:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
sm (from statsmodels.api): Used for general regression models and results.
smf (from statsmodels.formula.api): Allows for a higher-level interface for specifying
models using formulas (similar to R).
3. Prepare Your Data:
Let’s assume you have a dataset with some independent variables (features) and a dependent
variable (target). For this example, let’s create a simple synthetic dataset.
# Create a synthetic dataset
data = {
'X1': [1, 2, 3, 4, 5],
'X2': [2, 4, 6, 8, 10],
'Y': [3, 6, 7, 8, 11]
}
df = pd.DataFrame(data)
X1 and X2 are the independent variables (predictors).
Y is the dependent variable (response).
4. Linear Regression Model:
We will use sm.OLS (Ordinary Least Squares) to fit a linear regression model. Before doing
this, we need to add a constant (intercept) to the features.
# Add a constant (intercept) to the model
X = df[['X1', 'X2']] # Independent variables
X = sm.add_constant(X) # Adds a column of ones to the matrix for the intercept
# Make predictions
predictions = model.predict(new_data)
print(predictions)
7. Other Regression Types in Statsmodels:
Statsmodels also allows you to fit various other types of regression models, including:
Logistic Regression: For binary or categorical outcomes.
logit_model = smf.logit('Y ~ X1 + X2', data=df).fit()
print(logit_model.summary())
Poisson Regression: For count data.
poisson_model = smf.poisson('Y ~ X1 + X2', data=df).fit()
print(poisson_model.summary())
Robust Regression: To handle outliers or heteroskedasticity.
robust_model = smf.ols('Y ~ X1 + X2', data=df).fit(cov_type='HC3')
print(robust_model.summary())
8. Model Diagnostics:
You can check various diagnostic measures to assess the quality of the model:
# Residuals plot
import matplotlib.pyplot as plt
plt.scatter(model.fittedvalues, model.resid)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.show()
# Sample data
data = {
'X': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Y': [2, 4, 9, 16, 25, 36, 49, 64, 81, 100] # Quadratic relationship (Y = X^2)
}
# Create a DataFrame
df = pd.DataFrame(data)
where:
yi is the observed value at time iii,
t is the current time point.
Exponential Moving Average (EMA): The exponential moving average gives more
weight to more recent observations, making it more sensitive to recent changes in the data.
Survival analysis is a branch of statistics that deals with analyzing time-to-event data. The
primary goal is to understand the time it takes for an event of interest to occur. This type of
analysis is particularly useful when studying the duration until one or more events happen,
such as the time until a patient recovers from a disease, the time until a machine breaks down,
or the time until an individual defaults on a loan.
In survival analysis, the "event" typically refers to something of interest, like:
Death (in medical research),
Failure of a machine (in engineering),
Default on a loan (in finance),
Customer churn (in business).
1. Survival Function (S(t)): The survival function represents the probability that the
event of interest has not occurred by a certain time t. It is defined as:
1. Censoring: In survival analysis, censoring occurs when the event of interest has not
happened by the end of the observation period. There are two common types of
censoring:
o Right censoring: When the subject has not yet experienced the event by the
end of the study.
o Left censoring: When the event occurred before the subject entered the study.
Censoring is an important feature of survival analysis, as it reflects the fact that we
don't always know the exact time of the event for every individual.
2. Kaplan-Meier Estimator: The Kaplan-Meier estimator is a non-parametric method
used to estimate the survival function from observed survival times, especially when
there is censoring. It provides an empirical estimate of the survival function.
3. Cox Proportional Hazards Model: The Cox model is a regression model that relates
the survival time to one or more predictor variables. It assumes that the hazard at any
time ttt is a baseline hazard multiplied by an exponential function of the predictor
variables. The model does not require the assumption of a specific survival
distribution, making it a widely used approach.
4. Log-Rank Test: The log-rank test is a statistical test used to compare the survival
distributions of two or more groups. It is commonly used in clinical trials to test
whether different treatment groups have different survival experiences.
Applications of Survival Analysis
Medical Research: Estimating patient survival times after treatment or the time until
the onset of a disease.
Engineering: Predicting the time until failure of machinery or components, such as
the lifespan of a battery or mechanical part.
Business: Estimating the time until a customer churns or a product is returned.
Finance: Analyzing the time until a loan defaults or the bankruptcy of a company.
Survival Analysis Example in Python
Here’s a simple example using Kaplan-Meier estimator and Cox Proportional
Hazards Model in Python.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.datasets import load_rossi
# Example: Rossi dataset, a dataset on recidivism (criminal re-offending)
data = load_rossi()
# Kaplan-Meier Estimator: Estimate the survival function
kmf = KaplanMeierFitter()
kmf.fit(durations=data['week'], event_observed=data['arrest'])
# Plot the Kaplan-Meier survival curve
plt.figure(figsize=(10, 6))
kmf.plot_survival_function()
plt.title("Kaplan-Meier Survival Curve")
plt.xlabel("Weeks")
plt.ylabel("Survival Probability")
plt.show()
# Cox Proportional Hazards Model: Fit the model
cph = CoxPHFitter()
cph.fit(data, duration_col='week', event_col='arrest')
# Display the summary of the Cox model
cph.print_summary()
# Plot the baseline survival function from the Cox model
cph.plot_baseline_survival()
plt.title("Baseline Survival Function (Cox Model)")
plt.show()
Explanation:
1. Kaplan-Meier Estimator: We use the KaplanMeierFitter from the lifelines package
to estimate the survival function for the dataset. This plot shows the survival
probability over time.
2. Cox Proportional Hazards Model: The CoxPHFitter is used to model the
relationship between the predictors (e.g., age, gender, etc.) and the time to event (e.g.,
recidivism).
Interpreting Results:
Kaplan-Meier Curve: The plot shows how the survival probability decreases over
time.
Cox Model Summary: The summary provides insights into how each predictor
variable influences the time to event (e.g., the effect of a specific treatment on
survival).