R & Python notes
R & Python notes
1
Unit 1: Introduction to R
Basic Concept in R, Data Structure, Import of Data. Graphic Concept in R: Graphic System,
Graphic Parameter Settings, Margin Settings for Figures and Graphics, Multiple Charts, More
Complex Assembly and Layout, Font Embedding, Output with cairo pdf, Unicode in figures,
Colour settings, R packages and functions related to visualization.
Computing an overall summary of a variable and an entire data frame, summary() function,
sapply() function, stat.desc() function, Case of missing values, Descriptive statistics by groups,
Simple frequency distribution: one categorical variable, Two-way contingency table: Two
categorical variables, Multiway tables: More than two categorical variables.
Bar Chart Simple, Bar Chart with Multiple Response Questions, Column Chart with two- line
labeling, Column chart with 45o labeling, Profile Plot, Dot Chart for 3 variables, Pie Chart and
Radial Diagram, Chart Tables, Distributions: Histogram overlay, Box Plots for group,
Pyramids with multiple colors, Pyramid: emphasis on the outer and inner area, Pyramid with
added line, Aggregated Pyramids, Simple Lorenz curve.
Juypter Notebook, Python Functions, Python Types and Sequences, Python More on Strings,
Reading and Writing CSV files, Advanced Python Objects, map(), Numpy, Pandas, , Series
Data Structure, Querying a Series, The DataFrame Data Structure, DataFrame Indexing and
Loading, Querying a DataFrame, Indexing Dataframes, Merging Dataframes
Time Series, Date and Time, Data Types and Tools, Time Series Basics, Date Ranges,
Frequencies, and Shifting, Time Zone Handling, Periods and Period Arithmetic, Resampling
and Frequency Conversion, Time Series Plotting, Moving Window Functions, Natural
Language Processing, Image Processing, Machine Learning K Nearest Neighbors Algorithm
for Classification, Clustering
Using Matplotlib Create line plots, area plots, histograms, bar charts, pie charts, box plots and
scatter plots and bubble plots. Advanced visualization tools such as waffle charts, word clouds,
seaborn and Folium for visualizing geospatial data. Creating choropleth maps
2
Unit 1: Introduction to R
Basic Concepts of R
Overview
R is a statistical programming language that provides different categories of functionality in
libraries (also called packages). For applying statistical analysis, one often needs sample data.
R ships with many real-life built-in sample datasets that can be used for analysing the statistical
computations and algorithms. To develop these computations, one needs to know regular
programming constructs like variables, data types, operators, loops, etc.
Most of the programming constructs that are available in R are also available in T-SQL. Our
intention is not to learn R in full, but to learn R constructs that enable us to consume the unique
R libraries and data processing / computation mechanisms that are not available in T-SQL. In
this lesson, we will be learning the basic concepts of R, just sufficient enough for us to apply
R functions and packages against a SQL Server data repository.
3
The next step is to explore the different default libraries available in Microsoft R Open server.
You can explore them from here. You can load any given library by using the library function.
We will look at an example of the use of this function very shortly.
After exploring the list of packages available in R, the next step is to explore the list of datasets
that you can use. You can explore as well as download a list of datasets classified by packages
from here.
--Example: Variables
execute sp_execute_external_script
@language = N'R',
@script = N'
print(var1)
print(Var1)
print(var2 + var3)
print(var4)
print(class(var1))
print(class(var2))
print(class(var4))
Executing the above code, the output should look as shown below. Below are the points you
can derive from the above example:
4
● Variables can be created using the “<-“(assignment) operator.
● Variables are case-sensitive. Var1 and var1 are considered different variables.
● The data-type of the variable is determined by the type of data stored in the variable.
● You can inquire about the value of variables using the print function
● The class function can be used on variables to determine the data type of the variable
which is classified in three major types – character, numeric and logical.
● There are other data structure types too, but we will be limiting our discussing to these
three basic types.
Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division
^ Exponentiation
%% Modulus
5
< Less than
== Exactly equal to
!= Not equal to
! NOT
| OR
& AND
Though these operators should be easy to understand, below is a basic example of how you
may use these operators.
Here we have used these operators on actual values. You can use these operators in the same
way on variables too.
6
There is a high possibility that we may have to loop through the data for applying some
statistical computations. So, we need to learn at least one looping technique in R. Below is a
simple example of a while loop. In this example, we are assigning the value of 0 to variable i.
We are printing the value of “i” in the loop and incrementing the value of i. We are also placing
a condition that if the value of i reaches 3, then break out of the loop using the “break”
statement.
Graphic Concept in R
1. Graphic System
The graphic system in R is a powerful tool for creating high-quality graphics and visualisations.
It is based on the grid graphics system, which allows for the creation of complex graphics by
breaking them down into smaller components.
The grid graphics system is built around two main types of objects: viewports and grobs. A
viewport is a rectangular area of the plotting region that can contain one or more grobs, which
are graphical objects such as lines, text, or shapes.
7
Viewports can be nested inside each other to create more complex layouts. For example, a
viewport might contain a grid of smaller viewports, each of which contains one or more grobs.
The grid package provides a set of functions for creating and manipulating viewports and grobs.
Some of the key functions include:
Overall, the grid graphics system in R provides a flexible and powerful tool for creating a wide
range of graphics and visualisations.
Here is an example of how the grid graphics system can be used to create a simple plot:
#example program….
library(grid)
8
# Add labels
grid.text("X", x=0.9, y=0.5, gp=gpar(col="black", fontsize=20))
grid.text("Y", x=0.5, y=0.9, gp=gpar(col="black", fontsize=20))
In this example, we first load the grid library, which provides the functions for creating grid-
based graphics. We then create a new plot using the grid.newpage() function.
Next, we set up the plot area using the viewport() function, which specifies the size and position
of the plot. In this case, we set the width and height to 0.8, and position the plot at the centre
of the page.
We then draw a rectangle and a line using the grid.rect() and grid.lines() functions, respectively.
We specify the colour and line width of the line using the gpar() function, which creates a
graphical parameter object.
Finally, we add labels to the x and y axes using the grid.text() function, again specifying the
colour and font size using the gpar() function.
This is just a simple example, but the grid graphics system can be used to create much more
complex and sophisticated graphics in R.
Graphical parameters are typically set using the par() function, which takes a list of parameter-
value pairs as its argument. For example, the following code sets the line width to 2 and the
line colour to red:
example code….
par(lwd=2, col="red")
9
Some of the most commonly used graphical parameters in R include:
These parameters can be set globally using the par() function, or they can be set on a per-
element basis using functions like points(), lines(), and text(). For example, the following code
sets the colour and size of individual points:
example code…..
x <- 1:10
y <- x^2
plot(x, y)
points(x, y, col="blue", cex=2)
In this example, we first plot a simple line plot using the plot() function. We then add individual points
to the plot using the points() function, specifying the color and size of the points using the col and cex
parameters.
Overall, graphical parameter settings provide a powerful way to modify the appearance of graphics in
R, allowing you to create visually appealing and informative visualisations.
The margin settings in R are controlled by four graphical parameters: mar, mai, oma, and
mgp. Each of these parameters controls a different aspect of the plot margins:
10
mar: This parameter controls the size of the outer margins of the plot, in lines of text. The
default value is c(5, 4, 4, 2), which means that the bottom margin is 5 lines tall, the left and
right margins are 4 lines tall, and the top margin is 2 lines tall.
mai: This parameter controls the size of the inner margins of the plot, in inches. The default
value is c(0.5, 0.5, 0.5, 0.5), which means that there is a 0.5-inch margin on all sides of the plot.
oma: This parameter controls the size of the outer margins of the entire graphic device, in lines
of text. The default value is c(0, 0, 0, 0), which means that there is no outer margin.
mgp: This parameter controls the size of the axis label margins and the line spacing between
the axis labels and the plot area. The default value is c(3, 1, 0), which means that the axis label
margin is 3 lines tall, the axis line margin is 1 line tall, and there is no spacing between the axis
labels and the plot area.
To change the margin settings of a plot or graphic in R, you can modify these parameters using
the par() function. For example, to increase the size of the outer margins of a plot, you could
use the following code:
example code
par(mar=c(8, 6, 6, 4))
This code sets the bottom margin to 8 lines, the left and right margins to 6 lines, and the top
margin to 4 lines. Similarly, to increase the size of the inner margins of the plot, you could use
the following code:
example code
par(mai=c(1, 1, 1, 1))
This code sets the inner margins to 1 inch on all sides. By adjusting these margin settings, you
can control the spacing around your plot or graphic and ensure that it looks great in your final
output.
11
4. Multiple Charts
In R, you can create multiple charts or plots within a single graphic device by using functions
like par() and layout().
One way to create multiple charts within a single graphic device is to use the par() function to
set the layout of the plots. The par() function can be used to specify the number of rows and
columns of plots, as well as the size and spacing of each plot. For example, the following code
creates a graphic device with two plots arranged in a 1x2 grid:
Example code
par(mfrow=c(1, 2))
plot(1:10, rnorm(10))
plot(rnorm(100))
In this code, we first use the par() function to set the number of rows and columns to 1x2, which
creates a grid with one row and two columns. We then create two plots using the plot() function,
and they are automatically arranged in the grid.
Another way to create multiple plots is to use the layout() function. The layout() function allows
you to specify a grid of plots using a matrix of numbers, where each number represents the size
of the corresponding plot. For example, the following code creates a graphic device with two
plots arranged in a 2x1 grid:
example code
layout(matrix(c(1,2), nrow=2))
plot(1:10, rnorm(10))
plot(rnorm(100))
In this code, we use the layout() function to specify a grid with two rows and one column, and
then create two plots using the plot() function. The first plot is assigned to the first cell in the
grid, and the second plot is assigned to the second cell.
12
Overall, creating multiple charts or plots within a single graphic device in R is a useful way to
compare data or visualisations side-by-side, and can help to create more informative and
visually appealing visualisations.
The gridExtra package provides a set of functions for arranging multiple charts or plots in a
grid-like layout. For example, the grid.arrange() function can be used to arrange multiple charts
or plots in a grid, while the arrangeGrob() function can be used to arrange multiple graphical
objects (such as plots, text, and images) in a grid. These functions allow you to control the
spacing and alignment of the grid cells, as well as the overall size and aspect ratio of the grid.
The ggplot2 package also provides a powerful system for creating complex visualisations in R.
In ggplot2, you can create plots using a layered grammar of graphics, where each layer
represents a different component of the plot (such as the data, the aesthetics, and the geometry).
This allows you to easily add and customise different components of the plot, such as adding
multiple layers of data or adjusting the layout and spacing of the plot elements.
For example, the following code creates a complex visualisation using ggplot2 to display
multiple layers of data on a single plot:
example code
library(ggplot2)
library(dplyr)
data(mpg)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(method = "lm") +
facet_wrap(~manufacturer, nrow = 3) +
theme(legend.position = "none")
13
In this code, we first load the “ggplot2” and “dplyr” libraries, and then load the mpg dataset
that comes with “ggplot2”. We then use the “ggplot()” function to create a base plot with the
“displ” and hwy variables mapped to the x and y axes, respectively. We then add multiple
layers to the plot, including a point layer with color mapped to the ‘class’ variable, a smoothed
line layer using a linear regression method, and a faceted layer that displays separate panels for
each manufacturer in the dataset. Finally, we use the ‘theme()’ function to remove the legend
from the plot.
Overall, by using tools like ‘gridExtra’ and ‘ggplot2’, you can create more complex and
sophisticated visualisations in R that can help to better communicate your data and insights.
6. Font Embedding
When creating visualisations in R, it is sometimes necessary to use custom fonts to achieve a
desired style or aesthetic. However, when sharing your visualisations with others, it is
important to ensure that the fonts are embedded in the graphics to ensure that they are displayed
correctly on other systems.
R provides several options for embedding fonts in graphics, including using the extrafont
package or the Cairo graphics device. The extrafont package allows you to install and use
custom fonts in R, and also provides functions for embedding fonts in graphics. The Cairo
graphics device, on the other hand, provides a high-quality graphics output that can be used to
embed fonts in graphics in a platform-independent way.
For example, the following code demonstrates how to use the Cairo graphics device to create
a plot with a custom font and embed the font in the output:
example code
library(ggplot2)
library(extrafont)
14
# Set up plot with custom font
p <- ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
ggtitle("My Plot") +
theme(plot.title = element_text(family = "MyFont", size = 20))
In this code, we first load the ggplot2 and extrafont libraries, and then use the font_import()
function to import a custom font file (myfont.ttf) into R. We then use the loadfonts() function
to load the font into R's font database.
Next, we create a ggplot2 plot (p) that uses the custom font in the plot title by specifying the
font family as "MyFont". We then use the CairoPDF() function to create a PDF output file
(myplot.pdf) with the Cairo graphics device. This function takes several arguments, including
the width and height of the output in inches, the units of measurement, the resolution in dots
per inch (dpi), and the file name.
Finally, we use the print() function to print the plot to the Cairo graphics device, and then use
the dev.off() function to close the graphics device and save the output to the file.
Overall, embedding fonts in R graphics can help to ensure that your visualizations are displayed
correctly on other systems and can help to maintain a consistent style or aesthetic across your
work.
15
To create a PDF output with the Cairo graphics device, you can use the CairoPDF() function,
which takes several arguments including the output file name, the width and height of the
output in inches, the units of measurement, the resolution in dots per inch (dpi), and more.
Here's an example:
example code
library(ggplot2)
library(Cairo)
# Create plot
p <- ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
ggtitle("My Plot")
This code creates a simple scatter plot using ggplot2, and then outputs the plot to a PDF file
(myplot.pdf) using the CairoPDF() function. The output file will be 6 inches wide by 4 inches
tall, with a resolution of 300 dots per inch (dpi).
8. Unicode in figures
Unicode is a standard for encoding characters and symbols from various writing systems,
including Latin, Cyrillic, Arabic, Chinese, and more. In R, Unicode characters can be used in
graphics to add symbols or text in various languages.
To use Unicode in R graphics, you can use the expression() function or the bquote() function
to create expressions that include Unicode characters. Here's an example:
16
example code
library(ggplot2)
# Display plot
print(p)
In this code, we create a scatter plot using ggplot2, and then add a plot title that includes
Unicode symbols using the expression() function. The title includes the Greek letter "mu", the
symbol for "equals", and the Greek letter "sigma" with a superscript "2". When the plot is
displayed, the Unicode symbols are rendered correctly.
9. Colour settings
Colour settings are an important aspect of creating effective and visually appealing
visualisations in R. R provides a wide range of built-in colour palettes, as well as functions for
creating custom colour palettes.
For example, the ggplot2 package provides several built-in colour palettes that can be used to
colourize plots, such as scale_color_brewer() and scale_color_gradient(). Here's an example of
how to use the scale_color_brewer() function to colourize a scatter plot:
Example code
library(ggplot2)
17
# Display plot
print(p)
In this code, we create a scatter plot using ggplot2 and colour the points by the number of
cylinders using the colour aesthetic. We then use the scale_color_brewer() function to apply a
colour palette from the ColorBrewer library to the plot. The palette argument specifies
ggplot2: ggplot2 is a widely used package for creating elegant and customizable data
visualisations. It uses a grammar of graphics approach, which allows you to specify the
components of a plot separately (e.g., data, aesthetics, geometric objects, and statistical
transformations) and then combine them into a final plot.
lattice: lattice is another popular package for creating data visualisations, particularly for
multivariate data. It provides a range of high-level plotting functions for creating trellis plots,
which display multiple panels of data arranged in a grid.
plotly: plotly is an interactive visualisation library that allows you to create interactive web-
based plots in R. It provides a wide range of chart types, including scatter plots, line charts, bar
charts, heatmaps, and more.
ggvis: ggvis is a package for creating interactive visualisations using ggplot2 syntax. It uses
reactive programming to enable linked brushing and filtering, which allows you to dynamically
update visualisations based on user input.
leaflet: leaflet is a package for creating interactive maps in R. It provides a wide range of
options for customising maps, including base maps, markers, pop ups, and overlays.
18
dygraphs: dygraphs is a package for creating interactive time series plots in R. It provides a
range of options for customising time series plots, including zooming, panning, and
highlighting.
cowplot: cowplot is a package for creating complex plots by combining multiple plots together
into a single figure. It provides functions for arranging and annotating plots, as well as for
customising the appearance of the final figure.
viridis: viridis is a package for creating visually appealing color maps in R. It provides a range
of colour maps that are designed to be perceptually uniform and easy to interpret, as well as
functions for customising the appearance of colour maps.
patchwork: patchwork is a package for creating complex plots by combining multiple plots
together into a single figure. It provides a flexible grammar for arranging and annotating plots,
as well as for customising the appearance of the final figure.
gganimate: gganimate is a package for creating animated plots using ggplot2 syntax. It
provides a range of options for customising the appearance and behaviour of animated plots,
as well as for controlling the animation speed and direction.
19
Unit 2: Descriptive Analysis using R
This article explains how to compute the main descriptive statistics in R and how to present
them graphically. To learn more about the reasoning behind each descriptive statistics, how
to compute them by hand and how to interpret them, read the article “Descriptive statistics
by hand”.
To briefly recap what has been said in that article, descriptive statistics (in the broad sense
of the term) is a branch of statistics aiming at summarising, describing and presenting a
series of values or a dataset. Descriptive statistics is often the first step and an important
part in any statistical analysis. It allows us to check the quality of the data and it helps to
“understand” the data by having a clear overview of it. If well presented, descriptive
statistics is already a good starting point for further analyses. There exists many measures
to summarise a dataset. They are divided into two types:
Location measures give an understanding about the central tendency of the data, whereas
dispersion measures give an understanding about the spread of the data. In this article, we
focus only on the implementation in R of the most common descriptive statistics and their
visualizations (when deemed appropriate)..
20
#Description R function
Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum max()
Median median()
Range of values (minimum and range()
maximum).
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()
Summary() function
The summary() function in R is a powerful tool for generating descriptive statistics of a given
object, such as a vector, data frame, or statistical model. It provides a concise summary of the
central tendency, dispersion, and distribution of the data. Here's an overview of how the
summary() function works:
Syntax: R
summary(object)
Parameters:
object: The R object for which you want to generate the summary statistics, such as a vector,
data frame, or statistical model.
Usage:
The summary() function automatically detects the type of the object and provides an
appropriate summary based on its class. Here are a few examples:
# Numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
summary(numeric_vector)
21
Output:
# Data frame
data_frame <- data.frame(Age = c(25, 30, 35, 40, 45),
Height = c(165, 170, 175, 180, 185),
Weight = c(60, 65, 70, 75, 80))
summary(data_frame)
Output:
sapply() function
The sapply() function in R is used to apply a given function to each element of a vector or list
and returns a simplified version of the result. It is a convenient way to apply functions to
multiple elements simultaneously and obtain the output in a compact format.
Syntax: R
Parameters:
X: A vector or list.
22
Usage:
Here's an example of using sapply() to apply the mean() function to each column of a data
frame:
R codes
# Data frame
data <- data.frame(A = c(1, 2, 3), B = c(4, 5, 6), C = c(7, 8, 9))
# Applying mean() function to each column
result <- sapply(data, mean)
# Output
result
Output:
css
A B C
2 5 8
In this example, the mean() function is applied to each column of the data data frame using
sapply(). The result is a vector containing the means of each column.
stat.desc() function:
The stat.desc() function is not a built-in function in R, but it is available in some packages, such
as "pastecs" or "psych". It provides a comprehensive summary of descriptive statistics for a
numeric vector or a data frame, including measures such as mean, median, standard deviation,
minimum, maximum, quartiles, skewness, and kurtosis.
Syntax: R
Parameters:
23
desc: Logical value indicating whether to include additional descriptive statistics (variance,
skewness, etc.). Default is TRUE.
norm: Logical value indicating whether to include tests of normality. Default is FALSE.
p: Confidence level for the confidence intervals. Default is 0.95.
Usage:
Here's an example using the stat.desc() function from the "pastecs" package to compute
descriptive statistics for a numeric vector:
R codes
install.packages("pastecs")
library(pastecs)
# Numeric vector
vector <- c(1, 2, 3, 4, 5)
# Compute descriptive statistics
result <- stat.desc(vector)
# Output
result
Output:
5 0 0 1 5 4 15
24
1. Types of missingness:
● Missing Completely at Random (MCAR): The missingness is unrelated to any other
variables or the observed data. It is a random occurrence.
● Missing at Random (MAR): The missingness is related to other observed variables but
not to the missing values themselves.
● Missing Not at Random (MNAR): The missingness is related to the missing values
themselves, which are typically related to unobserved factors or reasons.
2. Identifying missing values:
● The is.na() function in R can be used to identify missing values. It returns a logical
vector with TRUE for missing values and FALSE for non-missing values.
● Other functions like complete.cases() and anyNA() can also be used to check for
missing values in vectors or data frames.
● Removing missing values: If missing values are minimal and do not significantly affect
the analysis, you can remove the observations or variables with missing values using
functions like na.omit() or subsetting techniques.
● Imputing missing values: Imputation involves estimating missing values based on
observed data. Common imputation methods include mean imputation, median
imputation, hot-deck imputation, regression imputation, and multiple imputation.
● Creating a missing indicator: Instead of imputing or removing missing values, you can
create a binary indicator variable that denotes the presence or absence of missing values
for each observation or variable.
● R packages like "mice," "Amelia," and "missForest" provide comprehensive tools for
imputing missing values using various algorithms.
● Many R functions have built-in options to handle missing values. For example, the
na.rm argument in functions like mean(), median(), and sum() can be set to TRUE to
exclude missing values from calculations.
25
5.Sensitivity analysis:
● It is important to consider the potential impact of missing values on the validity of the
analysis. Conducting sensitivity analyses by comparing results with and without
imputation can help assess the robustness of the findings.
Handling missing values requires careful consideration and should be based on the specific
dataset and research question. It is important to understand the nature of missingness, select
appropriate techniques, and interpret the results appropriately to ensure the accuracy and
validity of data analyses in R.
1.Grouping variable:
A grouping variable is a categorical variable that defines the groups for which you want to
compute descriptive statistics. It divides the data into distinct subsets based on its different
levels or categories.
2.Split-Apply-Combine strategy:
The most common approach to compute descriptive statistics by groups in R is the split-apply-
combine strategy. It involves splitting the data into groups, applying a summary function to
each group, and then combining the results.
3.Base R functions:
● aggregate(): This function allows you to compute summary statistics by group using a
formula syntax. It takes a formula specifying the variable(s) to summarise and the
grouping variable(s), along with the summary function to apply.
26
● by(): The by() function applies a function to subsets of a data frame split by a factor
variable. It takes the data frame, the grouping variable(s), and the function to apply as
arguments.
4.dplyr package:
The dplyr package provides a set of functions for data manipulation and transformation. It
offers an intuitive syntax for computing descriptive statistics by groups. Key functions include:
● group_by(): This function groups the data by one or more variables.
● summarise(): It applies summary functions to the grouped data to compute descriptive
statistics.
● %>% (pipe operator): It allows you to chain operations together, making the code
more readable and concise.
These approaches enable you to compute various descriptive statistics, such as means, medians,
standard deviations, quantiles, counts, or proportions, for different groups within your data. By
examining these statistics, you can gain insights into the distribution and characteristics of
variables across different groups, facilitating comparisons and further analysis.
1. Categorical Variable:
● A categorical variable is a variable that represents different categories or groups.
● It can be represented as a character, factor, or integer variable in R.
27
2. Creating the Frequency Distribution:
28
R code
# Create a bar plot of the frequency distribution
barplot(frequency)
The bar plot will display bars of different heights corresponding to the frequency counts of
each category.
By creating a simple frequency distribution, you gain insights into the distribution and relative
frequencies of categories within a categorical variable, which can be helpful for exploratory
data analysis and understanding the characteristics of the data.
Categorical Variables:
● A two-way contingency table involves two categorical variables.
● Each variable consists of categories or levels that represent different groups or
attributes.
29
# Print the contingency table
print(contingency)
The stacked bar plot will display bars representing each category of the first variable, with the
height of the bars representing the frequencies, and each bar subdivided into segments for each
category of the second variable.
30
By creating a two-way contingency table, you gain insights into the relationship and association
between two categorical variables, enabling further analysis and understanding of the data.
Categorical Variables:
Example: R
# Create three vectors with categorical data
variable1 <- c("A", "B", "B", "C", "A", "A", "B", "C", "C", "C")
variable2 <- c("X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", "X")
variable3 <- c("P", "Q", "Q", "P", "R", "R", "R", "Q", "P", "Q")
31
Interpreting the Multiway Table:
The resulting multiway table displays the joint frequencies or counts for all combinations of
categories across all variables.
The dimensions of the table correspond to the categories of each variable.
Example Output:
, , variable3 = P
variable2
variable1 X Y
A10
B10
C20
, , variable3 = Q
variable2
variable1 X Y
A00
B02
C11
, , variable3 = R
variable2
variable1 X Y
A10
B00
C01
32
Visualising the Multiway Table:
Visualisations for multiway tables can become complex due to the involvement of multiple
variables.
Techniques such as mosaic plots, heatmaps, or stacked bar plots can be used to visualize the
joint distribution across multiple categorical variables.
The mosaic plot represents the joint distribution of categories across multiple variables using
areas proportional to the frequencies.
By creating a multiway table, you gain a comprehensive understanding of the joint distribution
and relationships among multiple categorical variables. It enables you to explore and analyse
complex associations within your data.
33
Unit 3: Visualisation of Data in R
Bar Chart Simple, Bar Chart with Multiple Response Questions, Column Chart
with two- line labeling, Column chart with 45o labeling, Profile Plot, Dot Chart
for 3 variables, Pie Chart and Radial Diagram, Chart Tables, Distributions:
Histogram overlay, Box Plots for group, Pyramids with multiple colors, Pyramid:
emphasis on the outer and inner area, Pyramid with added line, Aggregated
Pyramids, Simple Lorenz curve
R- code
# Create a vector with categorical data
categories <- c("A", "B", "C", "D", "E")
In the code above, we first define a vector categories containing the different categories, and a
vector frequencies containing the corresponding frequencies or values for each category. We
then use the barplot() function to create the bar chart. The frequencies vector is passed as the
first argument, and the names.arg parameter is used to label the x-axis with the categories.
You can further customize the bar chart by modifying additional parameters. For example, you
can set the title of the chart using the main parameter, adjust the colors of the bars using the col
parameter, add grid lines using the grid parameter, and more. Refer to the documentation of
the barplot() function for a full list of available customization options.
34
By creating a simple bar chart, you can visually compare the frequencies or values of different
categories, making it easier to understand the distribution and relative magnitudes of your data.
R - code
# Create a data frame with the responses
data <- data.frame(
Respondent = c(1:10),
Option1 = c(1, 1, 0, 1, 0, 1, 0, 0, 1, 1),
Option2 = c(1, 0, 1, 0, 1, 1, 0, 1, 0, 1),
Option3 = c(0, 1, 1, 0, 1, 0, 1, 1, 0, 0)
)
In this example, we have a dataframe data with the respondent ID in the first column
(Respondent) and subsequent columns (Option1, Option2, Option3) representing the different
options. Each cell indicates whether the respondent selected that option (1) or not (0).
We use the colSums() function to calculate the frequencies of each option. By applying it to
data[, -1], we exclude the respondent ID column from the calculation. The result is stored in
the option_frequencies vector.
Next, we create a stacked bar chart using the barplot() function. The beside = TRUE parameter
ensures that the bars are stacked instead of grouped side by side. The col parameter assigns
colors to the bars, and names.arg labels the x-axis with the option names. The xlab and ylab
parameters set the labels for the x-axis and y-axis, respectively.
35
The resulting bar chart will display stacked bars, with each segment representing one option.
The different colors differentiate between the options, allowing for easy comparison of
frequencies across the categories.
R- code
# Create a vector with categories
categories <- c("Category 1", "Category 2", "Category 3")
In this example, we define a vector categories representing the different categories and a vector
values containing the corresponding values for each category.
We create a column chart using the barplot() function, passing the values vector as the first
argument and the categories vector as the names.arg parameter. The xlab and ylab parameters
set the labels for the x-axis and y-axis, respectively. The main parameter sets the title of the
chart.
To add two-line labeling on the x-axis, we use the axis() function. The at parameter specifies
the position of the axis labels, and labels assigns custom labels to those positions. In this case,
we provide a vector with two elements, where the first element includes a line break (\n) to
split the label into two lines. The las parameter sets the orientation of the axis labels, with 1
indicating horizontal orientation.
36
By customizing the axis labels with two lines, you can provide more detailed information or
add line breaks to improve the readability of the labeling in your column chart.
R-code
# Create a vector with categories
categories <- c("Category 1", "Category 2", "Category 3")
In this example, we define a vector categories representing the different categories and a vector
values containing the corresponding values for each category.
We create a column chart using the barplot() function, passing the values vector as the first
argument and the categories vector as the names.arg parameter. The xlab and ylab parameters
set the labels for the x-axis and y-axis, respectively. The main parameter sets the title of the
chart.
To add 45-degree labeling on the x-axis, we use the axis() function. The at parameter specifies
the position of the axis labels, and labels assigns the category labels to those positions. The
cex.axis parameter adjusts the size of the axis labels, and las = 2 sets the orientation of the
labels to 45 degrees.
37
By customizing the axis labels with a 45-degree rotation, you can fit longer category names or
improve the readability of the labeling in your column chart. Adjust the cex.axis parameter to
control the size of the labels according to your preferences.
Profile Plot
A profile plot is a visualization technique used to display the change in a continuous variable
across different levels of one or more categorical variables. In R, you can create profile plots
using various packages, such as ggplot2 or lattice.
Here's an example of how to create a profile plot using the ggplot2 package in R:
R - code
# Load the ggplot2 package
library(ggplot2)
In this example, we have a dataset data with three variables: Category, Level, and Value. The
Category variable represents the different groups or categories, the Level variable represents
the levels within each category, and the Value variable represents the continuous variable of
interest.
We use the ggplot() function to create a plot object and specify the dataset. The aes() function
defines the aesthetic mappings, where we map Level to the x-axis (x), Value to the y-axis (y),
Category to the grouping (group), and Category to the color aesthetic.
38
We then add geom_line() and geom_point() layers to draw the lines and points for each
category. The labs() function is used to set the labels for the x-axis, y-axis, and plot title.
Finally, we apply the theme_minimal() theme to style the plot.
The resulting plot will display a line for each category, showing the change in the continuous
variable (Value) across the different levels (Low, Medium, High). The points represent the
actual data points for each level and category, while the lines connect them to visualize the
overall trend or profile.
R - code
# Create a sample dataset
data <- data.frame(
Category = c("A", "B", "C"),
Variable1 = c(10, 15, 8),
Variable2 = c(12, 9, 14),
Variable3 = c(7, 13, 16)
)
39
In this example, we have a dataset data with four columns: Category and three variables
(Variable1, Variable2, Variable3). Each row represents a different category (A, B, C), and the
variables hold the corresponding values.
We use the stripchart() function to create a dot chart. The first argument, data[, -1], selects the
variables to plot (excluding the Category column). The method parameter is set to "stack" to
stack the dots for overlapping points. The pch parameter sets the point shape to a solid circle,
and the col parameter assigns different colors to each variable.
We set the x-axis limits (xlim) based on the maximum value of the variables. The xlab and ylab
parameters label the x-axis and y-axis, respectively. Finally, the main parameter sets the title
of the plot.
The resulting dot chart will display a stack of dots for each category, with each dot representing
a data point for a particular variable. The different colors differentiate the variables, allowing
for easy comparison between categories and variables.
In this example, we define a vector categories representing the category names and a vector
values containing the corresponding values for each category.
We use the pie() function to create the pie chart, passing the values vector as the first argument
and the labels parameter to assign the category names as labels. The main parameter sets the
title of the chart.
40
To create a radial diagram, you can use the radial.plot() function from the plotrix
package.
Here's an example: R - code
# Install and load the plotrix package
install.packages("plotrix")
library(plotrix)
In this example, we first install and load the plotrix package, which provides the radial.plot()
function.
We define the categories vector with the category names and the values vector with the
corresponding values.
We use the radial.plot() function to create the radial diagram, passing the values vector as the
first argument and the labels parameter to assign the category names as labels. The line.col
parameter sets the color of the lines connecting the points, and the main parameter sets the title
of the diagram.
Both the pie chart and radial diagram provide a visual representation of data, with the pie chart
showing proportions of a whole and the radial diagram displaying values on a circular axis.
Choose the appropriate visualization based on the nature and purpose of your data.
Chart Tables
In R, you can create chart tables, which are tabular representations of data with additional
formatting and visual elements. There are several packages available in R that provide
functions to create chart tables, including gt, flextable, and kableExtra. Here's an example using
the gt package:
41
R - code
# Install and load the gt package
install.packages("gt")
library(gt)
# Create a gt table
table <- gt(data)
In this example, we create a sample dataset data with four columns: Category, Value1, Value2,
and Value3.
We use the gt() function to create a gt table object from the data.
To format the table, we chain multiple functions using the %>% operator from the magrittr
package. The tab_header() function sets the title of the table. The fmt_number() function is
used to format numeric columns with a specified number of decimal places. In this case, we
format columns 2 to 4 with 1 decimal place. The fmt_number() function can also be used to
apply other formatting options, such as currency or percentage formatting. The text_format
argument in the fmt_number() function allows you to apply additional text formatting, such as
making specific cells bold.
Finally, we save the table as an image file using the gtsave() function from the gt package.
42
This is just one example of creating a chart table in R using the gt package. You can explore
other packages like flextable and kableExtra for additional features and customization options
to create chart tables that suit your specific needs.
Certainly! A histogram overlay is a useful visualization technique that allows you to compare
the distributions of multiple variables by overlaying their histograms on a single plot. In R, you
can achieve this using the ggplot2 package. Let's go through an example step by step:
R - code
# Load the ggplot2 package
library(ggplot2)
In this example, we first load the ggplot2 package, which provides functions for creating data
visualizations. Then, we create a random dataset called data with two variables: Variable1 and
Variable2. We generate 1000 random values for each variable using the rnorm() function,
where rnorm() creates random numbers following a normal distribution. The mean and
standard deviation are specified to control the characteristics of the distributions.
R - code
# Create a histogram overlay using ggplot2
ggplot(data, aes(x = value, fill = variable)) +
geom_histogram(alpha = 0.5, bins = 30, color = "black") +
labs(x = "Value", y = "Frequency", title = "Histogram Overlay") +
theme_minimal()
In this code snippet, we use the ggplot() function to create a plot object and specify the dataset
data as the data source. The aes() function (short for aesthetics) is used to map the variables to
visual elements. We map the values of both Variable1 and Variable2 to the x-axis (x) using
43
value, and we assign the variable to the fill aesthetic, which will be used to differentiate the
two histograms.
The geom_histogram() function is used to create the histograms. We set alpha = 0.5 to make
the histograms semi-transparent, allowing both distributions to be visible. The bins parameter
controls the number of bins in the histograms, determining the granularity of the distribution
representation. The color parameter sets the outline color of the histograms.
We use the labs() function to set the x-axis label, y-axis label, and plot title. In this case, the x-
axis is labeled as "Value", the y-axis as "Frequency", and the title as "Histogram Overlay".
Finally, we apply the theme_minimal() theme to style the plot with a clean and minimalistic
appearance.
When you run this code, you will obtain a histogram overlay plot showing the distributions of
Variable1 and Variable2 overlaid on a single chart. The transparency and distinct colors help
to visualize and compare the two distributions. You can modify the code to suit your specific
data and visualization requirements, such as adjusting the number of bins, colors, or plot labels.
R - code
# Load the ggplot2 package
library(ggplot2)
44
# Create a box plot by group using ggplot2
ggplot(data, aes(x = Group, y = Value)) +
geom_boxplot() +
labs(x = "Group", y = "Value", title = "Box Plot by Group") +
theme_minimal()
In this example, we create a sample dataset called data with two columns: Group and Value.
Each group has 100 observations, and the Value column contains continuous variable values.
Using the ggplot() function, we create a plot object and specify the dataset data as the data
source. The aes() function maps the Group variable to the x-axis (x) and the Value variable to
the y-axis (y).
We then use geom_boxplot() to create the box plots. This function automatically computes and
visualizes the five-number summary (minimum, lower quartile, median, upper quartile, and
maximum) of the Value variable for each group.
The labs() function is used to set the x-axis label, y-axis label, and plot title. In this example,
the x-axis is labeled as "Group", the y-axis as "Value", and the title as "Box Plot by Group".
Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.
When you run this code, you will obtain a box plot showing the distribution of the Value
variable for each group. Each box plot represents the range and quartiles of the variable values
within each group, allowing for easy comparison of distributions between groups. You can
further customize the plot by modifying the labels, adding color or grouping options, or
adjusting the plot theme to suit your specific requirements.
R - code
library(ggplot2)
# Create data for the pyramids
45
categories <- c('Category 1', 'Category 2', 'Category 3')
values <- c(10, 20, 30)
We create the data for the pyramids with the categories and corresponding values.
Using ggplot(), we specify the data frame as the data source and map the Value variable to the
y-axis (y), and the Category variable to the fill aesthetic.
We use geom_bar() with stat = 'identity' to create the pyramids. The width = 1 argument sets
the width of the bars, and the color = 'black' argument adds a black border around each bar.
To create pyramids, we use coord_flip() to flip the x and y axes, resulting in vertical pyramids.
The scale_fill_manual() function allows us to manually specify the colors for each category. In
this example, we assign the colors 'red', 'green', and 'blue' to the categories.
We use labs() to set the x-axis label to an empty string, the y-axis label to 'Value', and the plot
title to 'Pyramid with Multiple Colors'.
Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.
46
When you run this code, you will obtain pyramids with multiple colors. Each category is
represented by a segment of the pyramid, and the colors assigned to each category are specified
using the scale_fill_manual() function. You can modify the categories, values, colors, and other
parameters to create pyramids with multiple colors that suit your specific data and visualization
requirements.
To create a pyramid chart in R with emphasis on the outer and inner areas, you can use the
plotrix package. This package provides functions to create specialized plots, including pyramid
charts. Here's an example:
R - code
library(plotrix)
We use the pyramid.plot() function from the plotrix package to create the pyramid chart. We
pass the cumulative values as the first argument, the category labels as labels, and the values
as top.labels. The col argument allows us to specify the colors for each category. In this
example, we use 'lightblue', 'lightgreen', and 'lightpink' for the outer and inner areas of the
47
pyramid. The border argument is set to NA to remove the borders around the pyramid
segments. The legend argument is set to FALSE to hide the legend.
We use the main argument to set the main title of the chart to 'Pyramid Chart with Emphasis
on Outer and Inner Area'.
When you run this code, you will obtain a pyramid chart with emphasis on the outer and inner
areas. The colors and cumulative values help draw attention to the relative sizes of the
segments. You can modify the categories, values, colors, and other parameters to create a
pyramid chart that suits your specific data and visualization requirements.
R - code
library(plotrix)
48
We use the pyramid.plot() function from the plotrix package to create the pyramid chart. We
pass the cumulative values as the first argument, the category labels as labels, and set other
parameters such as the col (color) and border (border color) arguments. The legend argument
is set to FALSE to hide the legend.
After creating the pyramid chart, we add a horizontal line using the abline() function. The h
argument specifies the y-coordinate of the line, lwd sets the line width, and col defines the line
color. In this example, we use a red line.
When you run this code, you will obtain a pyramid chart with an added horizontal line. The
line can be used to indicate a specific value or a reference point within the chart. You can
modify the categories, values, colors, line properties, and other parameters to customize the
pyramid chart with the added line to suit your specific data and visualization requirements.
Aggregated Pyramids
To create aggregated pyramids in R, you can use the ggplot2 package. Aggregated pyramids
allow you to compare two sets of data side by side, each represented as a pyramid. Here's an
example:
R - code
library(ggplot2)
49
# Create the aggregated pyramids
ggplot(data, aes(x = Group, y = ifelse(Group == 'Group 1', -Value, Value), fill = Category))
+
geom_bar(stat = 'identity', position = 'identity') +
coord_flip() +
scale_fill_manual(values = c('green', 'blue', 'green', 'blue', 'green', 'blue')) +
labs(x = '', y = 'Value', title = 'Aggregated Pyramids') +
theme_minimal()
We create the data for the pyramids with the categories and values for two groups: Group 1
and Group 2. We combine the data into a single data frame.
Using ggplot(), we specify the data frame as the data source and map the Group variable to the
x-axis (x), the Value variable to the y-axis (y), and the Category variable to the fill aesthetic.
We use geom_bar() with stat = 'identity' to create the pyramids. The position = 'identity'
argument ensures that the bars are positioned according to the Value variable.
To create the aggregated effect, we use ifelse() within the aes() mapping to assign negative
values to Group 1 and positive values to Group 2, effectively mirroring the pyramids.
The scale_fill_manual() function allows us to manually specify the colors for each category. In
this example, we assign different colors to each category within each group.
We use labs() to set the x-axis label to an empty string, the y-axis label to 'Value', and the plot
title to 'Aggregated Pyramids'.
Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.
When you run this code, you will obtain aggregated pyramids showing the comparison between
two groups. Each group is represented by a pyramid, and the categories are aligned side by
50
side. You can modify the categories, values, colors, and other parameters to create aggregated
pyramids that match your specific data and visualization requirements.
R - code
library(ggplot2)
We create the data for the Lorenz curve with the cumulative percentiles (cumulative_perc) and
cumulative shares (cumulative_share). These values represent the cumulative percentages of
the population and their corresponding cumulative shares of wealth or income, respectively.
Next, we create a data frame using the data.frame() function, specifying the percentiles as the
Percentile variable and the shares as the Share variable.
Using ggplot(), we specify the data frame as the data source and map the Percentile variable to
the x-axis (x) and the Share variable to the y-axis (y).
51
We use geom_step() to create the Lorenz curve, which connects the points with horizontal and
vertical lines to represent the cumulative distribution of the share.
The labs() function is used to set the x-axis label to "Cumulative Percentile", the y-axis label
to "Cumulative Share", and the plot title to "Lorenz Curve".
Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.
When you run this code, you will obtain a simple Lorenz curve plot. The curve represents the
cumulative distribution of the share, with the x-axis representing the cumulative percentiles
and the y-axis representing the cumulative share. You can modify the data values, labels, and
plot aesthetics to create a Lorenz curve that reflects your specific data and visualization
requirements.
52
Unit 4: Introduction to Python
Juypter Notebook, Python Functions, Python Types and Sequences, Python More
on Strings, Reading and Writing CSV files, Advanced Python Objects, map(),
Numpy, Pandas, , Series Data Structure, Querying a Series, The DataFrame Data
Structure, DataFrame Indexing and Loading, Querying a DataFrame, Indexing
Dataframes, Merging Dataframes
Juypter Notebook
Jupyter Notebook is an interactive coding environment that allows you to create and share
documents containing live code, visualizations, explanatory text, and more. It is particularly
popular among Python users, although it supports multiple programming languages.
Here are some key aspects and features of Jupyter Notebook when used with Python:
1. Notebook Structure: Jupyter Notebook is organized into cells, where each cell can
contain either code or markdown text. Code cells are where you write and execute
Python code, while markdown cells allow you to add formatted text, headings, lists,
and images to provide explanations and documentation.
2. Code Execution: You can execute code cells individually or all at once. When a code
cell is executed, the Python interpreter runs the code and displays the output below the
cell. This allows you to iteratively develop and test your code in a step-by-step manner.
3. Kernel: Jupyter Notebook uses a kernel, which is responsible for executing code in a
specific programming language. For Python, the IPython kernel is used by default,
providing additional features such as tab completion, object introspection, and rich
media display.
4. Data Exploration and Visualization: Jupyter Notebook integrates seamlessly with
popular Python libraries for data manipulation and visualization, such as Pandas,
NumPy, Matplotlib, Seaborn, and Plotly. You can easily load, analyze, and visualize
data within the notebook using these libraries.
5. Rich Media Display: Jupyter Notebook allows you to display various types of media
directly in the notebook, including images, audio, and video. You can even embed
interactive visualizations and widgets to create dynamic and engaging content.
53
6. Notebook Extensions: Jupyter Notebook offers a wide range of extensions that
enhance its functionality and customization. These extensions provide additional
features like code linting, code folding, table of contents, and more, making your coding
experience more efficient and enjoyable.
7. Collaboration and Sharing: Jupyter Notebook facilitates collaboration by allowing
you to share your notebooks with others. You can share notebooks as standalone files,
publish them on platforms like GitHub, or use services like Jupyter Notebook Viewer
to share notebooks online. This enables others to run your code, view your
visualizations, and understand your analysis.
Jupyter Notebook provides an interactive and flexible environment for Python programming,
data analysis, and scientific computing. It promotes a reproducible workflow by combining
code, documentation, and visualizations in a single document. With its rich features and broad
community support, Jupyter Notebook has become a popular choice among Python users for
data exploration, prototyping, and sharing computational research.
Python Functions
In Python, a function is a block of reusable code that performs a specific task or set of tasks.
Functions provide modularity, code organization, and code reusability in Python programs.
They allow you to encapsulate a piece of code into a named block, which can be called and
executed multiple times throughout the program. Here are some key aspects and features of
functions in Python:
1. Function Definition: To create a function in Python, you use the def keyword followed
by the function name, parentheses (), and a colon :. You can also specify parameters
within the parentheses if the function needs input values. The function definition block
is indented below the def statement.
2. Function Parameters: Parameters are placeholders for values that can be passed to a
function. They define the input requirements of the function. You can have zero or more
parameters in a function definition. Parameters can have default values, making them
optional when calling the function.
3. Function Body: The body of the function contains the statements that define the task
the function performs. It is indented under the function definition. You can include any
54
valid Python code within the function body, such as variable declarations, calculations,
conditionals, loops, and other function calls.
4. Return Statement: Functions can optionally return a value using the return statement.
The return statement specifies the value or values that the function should provide as
output. When the return statement is encountered, the function execution terminates,
and the specified value(s) are returned to the caller.
5. Function Call: To execute a function, you call it by using its name followed by
parentheses (). If the function has parameters, you pass the corresponding arguments
within the parentheses. The function call evaluates the code within the function body,
and any return value is available for further use.
6. Scope: Functions have their own scope, which means that variables defined within a
function are only accessible within that function. Similarly, variables defined outside
the function have a global scope and can be accessed from any part of the program.
7. Function Documentation: You can add documentation to functions using docstrings,
which are triple-quoted strings placed immediately after the function definition.
Docstrings provide information about the purpose of the function, its parameters, return
values, and any other relevant details. They are used to generate documentation and
provide help for users of the function.
Functions play a vital role in structuring Python programs and promoting code reusability.
They help in breaking down complex tasks into smaller, more manageable parts. By
encapsulating code within functions, you can improve the readability, maintainability, and
efficiency of your Python code.
In Python, types refer to the classification of data objects. Python has built-in types that define
the characteristics and behavior of objects. Sequences, on the other hand, are a specific type of
data structure that holds an ordered collection of elements. Let's dive into more detail about
Python types and sequences:
Python Types:
1. Numeric Types: Python includes numeric types such as integers (int), floating-point
numbers (float), and complex numbers (complex).
55
2. Boolean Type: The boolean type (bool) represents either True or False values, which
are used for logical operations and conditional statements.
3. Strings: The string type (str) represents sequences of characters enclosed in single or
double quotes. Strings are immutable, meaning they cannot be modified once created.
4. Lists: Lists (list) are ordered collections of objects enclosed in square brackets []. They
can contain objects of different types and are mutable, allowing you to modify, add, or
remove elements.
5. Tuples: Tuples (tuple) are similar to lists but are immutable, meaning they cannot be
modified once created. They are typically used to represent fixed collections of
elements.
6. Sets: Sets (set) are unordered collections of unique elements. They do not allow
duplicate values and provide operations like union, intersection, and difference.
7. Dictionaries: Dictionaries (dict) are key-value pairs enclosed in curly braces {}. They
allow you to store and retrieve values based on unique keys, providing efficient lookup
operations.
Python Sequences:
1. Lists: Lists are mutable sequences that can hold objects of any type. They maintain the
order of elements and allow indexing and slicing operations.
2. Tuples: Tuples are immutable sequences similar to lists. They are useful for
representing fixed collections of elements and can be accessed using indexing and
slicing.
3. Strings: Strings are sequences of characters. They can be indexed and sliced like other
sequences and provide various string manipulation methods.
4. Ranges: Ranges (range) represent a sequence of numbers and are commonly used in
loops for iterating a specific number of times.
Understanding Python types and sequences is crucial for effective programming as they
provide the foundation for storing, manipulating, and processing data in Python programs.
56
Python More on Strings
Strings are an important data type in Python that represent sequences of characters. They are
immutable, meaning that once a string is created, it cannot be modified. Here are some
additional concepts and operations related to strings in Python:
String Creation: Strings can be created using single quotes ('), double quotes ("), or triple
quotes (''' or """). Triple quotes are used for multiline strings.
Python - code
single_quote = 'Hello'
double_quote = "World"
multiline = '''This is a
multiline string'''
Python - code
String Length: The len() function returns the length of a string, which is the number of
characters in the string.
Python - code
String Indexing: Individual characters within a string can be accessed using index positions.
Indexing starts from 0 for the first character and goes up to length - 1 for the last character.
Python - code
57
String Slicing: Substrings can be extracted from a string using slicing. The syntax for slicing
is start_index:end_index. The resulting substring includes characters from start_index up to,
but not including, end_index.
Python - code
String Methods: Python provides numerous built-in methods for string manipulation, such as
converting cases, replacing characters, splitting and joining strings, finding substrings, and
more.
Python - code
These are just a few examples of string operations in Python. Strings are versatile and
commonly used in many programming tasks, such as text processing, data manipulation, and
input/output operations. Python provides a rich set of string methods and functionalities to
handle and manipulate strings efficiently.
Reading and writing CSV (Comma-Separated Values) files is a common task in data analysis
and manipulation. CSV files are a plain text format used to store tabular data, where each line
represents a row and the values within each line are separated by a delimiter, typically a comma.
To read data from a CSV file, you need to follow these steps:
58
● Import the csv module in Python.
● Open the CSV file using the open() function, specifying the file path and the mode as
'r' for reading.
● Create a CSV reader object using the reader() function from the csv module, passing
the opened file as the argument.
● Iterate over the rows in the CSV file using a loop. The reader object acts as an iterator,
returning each row as a list of strings.
● Process the data in each row as needed. You can access individual values by indexing
the row list.
Additional Considerations:
● You can specify different delimiters for CSV files, such as a semicolon or tab, using
the delimiter parameter in the reader() or writer() function.
● If a value in a CSV file contains the delimiter character itself or special characters like
newline or quotes, it is typically enclosed in quotes. The csv module handles these cases
automatically.
● The csv module provides various options for quoting, handling empty values, and
specifying the newline character. Refer to the official Python documentation for more
details on these options.
Overall, reading and writing CSV files in Python is straightforward with the help of the csv
module. It allows you to easily handle tabular data in a format that is widely supported across
different applications.
59
Advanced Python Objects
Advanced Python objects refer to the concepts and techniques used to create more complex
and specialized objects in Python programming. These concepts build upon the fundamentals
of object-oriented programming in Python and provide additional functionality and flexibility.
Here are some advanced Python object topics:
60
6. Decorators: Decorators are a powerful feature in Python that allow you to modify the
behavior of functions or classes without changing their source code. Decorators are
functions that wrap around other functions or classes and provide additional
functionality. They are commonly used for adding logging, caching, or authentication
to existing functions or classes.
These are just a few examples of advanced Python object concepts. By understanding and
applying these concepts, you can write more flexible, modular, and reusable code in Python.
map() function
The map() function in Python is used to apply a given function to each element of an iterable
(e.g., a list, tuple, or string) and returns a new iterable with the transformed values. The map()
function takes two arguments: the function to be applied and the iterable. Here's an example:
Python - code
Numpy
Numpy is a Python library that provides support for large, multi-dimensional arrays and
matrices, along with a collection of mathematical functions to operate on these arrays
efficiently. It is widely used in scientific computing and data analysis tasks. Numpy provides
a high-performance multidimensional array object called ndarray and various functions for
array manipulation, mathematical operations, linear algebra, and more. Here's an example:
61
Python - code
import numpy as np
# Create a 1-dimensional array
a = np.array([1, 2, 3, 4, 5])
print("Mean:", mean)
print("Standard Deviation:", std_dev)
Pandas
Pandas is a Python library built on top of Numpy that provides high-level data manipulation
and analysis tools. It introduces two main data structures: Series (1-dimensional labeled array)
and DataFrame (2-dimensional labeled data structure). Pandas offers various functions and
methods to handle missing data, clean and transform data, perform grouping and aggregation,
merge and join datasets, and more. Here's an example:
Python - code
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Country': ['USA', 'Canada', 'UK']}
df = pd.DataFrame(data)
In summary, map() is a built-in function in Python used to apply a function to each element of
an iterable, Numpy is a library for numerical computing with support for multi-dimensional
62
arrays, and Pandas is a library for data manipulation and analysis, providing data structures and
functions for efficient handling of tabular data.
The Series data structure is a fundamental component of the pandas library in Python. It
represents a one-dimensional labeled array that can hold any data type. The Series is similar to
a column in a spreadsheet or a single column of data in a table. It consists of two main
components: the data and the index.
Python - code
import pandas as pd
series = pd.Series(data, index)
data: The data can be a list, numpy array, dictionary, or scalar value. It represents the actual
values in the Series.
index (optional): The index provides labels to access the data elements. If not specified, a
default integer index will be assigned.
Python - code
import pandas as pd
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
63
In the example above, we created a Series using a list of numbers. The resulting Series has an
integer index (0 to 4) and displays the corresponding values.
Series objects have several key properties and methods that allow for easy data manipulation
and analysis. Some common operations include indexing, slicing, arithmetic operations, and
applying functions element-wise. For example:
Python - code
# Accessing elements by index
print(series[2]) # Output: 30
# Slicing the Series
print(series[1:4]) # Output: 20, 30, 40
# Arithmetic operations
print(series * 2) # Output: 20, 40, 60, 80, 100
# Applying a function element-wise
print(series.apply(lambda x: x**2)) # Output: 100, 400, 900, 1600, 2500
The Series data structure is an essential tool for handling and manipulating one-dimensional
labeled data in pandas. It provides a convenient way to store, access, and perform operations
on data, making it an integral part of data analysis workflows.
Querying a Series,
Querying a Series in pandas involves accessing and retrieving specific elements or subsets of
data from the Series based on certain conditions. Pandas provides several methods and
techniques for querying Series data.
1.Index-based Selection:
Single element: Use square brackets and provide the index label or position to access a single
element.
python code
series = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(series['b']) # Output: 20
Multiple elements: Use a list of index labels or positions to retrieve multiple elements.
64
python code
Slicing: Use slicing notation to select a range of elements based on index positions.
python code
2.Conditional Selection:
python code
Using logical operators: Combine multiple conditions using logical operators (e.g., & for
AND, | for OR).
python code
3.Label-based Selection:
.loc[] indexer: Use the .loc[] indexer to access elements or subsets based on index labels.
Python code
print(series.loc['c']) # Output: 30
.loc[] with boolean indexing: Combine label-based selection with boolean indexing to filter
based on conditions.
python code
65
These are some common methods for querying a Series in pandas. By using these techniques,
you can easily retrieve specific elements or subsets of data based on index labels, positions, or
conditions.
The DataFrame data structure is a fundamental component of the pandas library in Python. It
provides a flexible and efficient way to store and manipulate structured, two-dimensional data.
The DataFrame is similar to a table in a relational database or a spreadsheet in that it organizes
data in rows and columns.
Creating a DataFrame:
You can create a DataFrame in pandas using various methods, such as reading data from files
(e.g., CSV, Excel), converting from other data structures (e.g., lists, dictionaries), or generating
data programmatically. Here's an example of creating a DataFrame from a dictionary:
python code
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
66
'Age': [25, 28, 32],
'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)
Output:
1 Alice 28 Paris
2 Bob 32 London
In the example above, we created a DataFrame using a dictionary where each key represents a
column name, and the corresponding value is a list of data for that column. The resulting
DataFrame has three columns: 'Name', 'Age', and 'City', with the associated data.
DataFrames provide a wide range of functionalities for data manipulation and analysis. Some
common operations include indexing and slicing, filtering, merging and joining, reshaping,
grouping, and statistical calculations. DataFrames also offer built-in methods for handling
missing data, handling duplicates, and handling outliers.
python code
# Accessing columns
print(df['Name'])
# Slicing rows
print(df[1:3])
# Filtering rows based on a condition
print(df[df['Age'] > 25])
# Merging DataFrames
67
df2 = pd.DataFrame({'Name': ['John', 'Alice', 'Bob'],
'Salary': [5000, 6000, 4000]})
merged_df = pd.merge(df, df2, on='Name')
# Grouping and aggregation
grouped_df = df.groupby('City').mean()
# Statistical calculations
print(df['Age'].mean())
print(df['Age'].max())
These examples demonstrate a few of the many operations that can be performed on
DataFrames in pandas. The DataFrame data structure provides a powerful tool for data analysis
and manipulation in Python, making it a popular choice for working with structured data.
1.Indexing a DataFrame
Indexing allows you to select specific rows and columns from a DataFrame. There are several
ways to index a DataFrame in Python:
Using square brackets: You can use square brackets [] to access columns or a specific subset
of rows based on conditions.
python code
# Access a single column
df['column_name']
# Access multiple columns
df[['column1', 'column2']]
# Access rows based on conditions
df[df['column'] > 10]
68
Using loc and iloc: The .loc[] and .iloc[] indexers provide more advanced indexing
capabilities. .loc[] allows you to access rows and columns using labels, while .iloc[] uses
integer-based indexing.
python code
Using boolean indexing: Boolean indexing allows you to select rows based on a condition
using a Boolean expression.
python code
df[condition]
2.Loading a DataFrame
Pandas provides various methods to load data into a DataFrame from different sources, such
as CSV files, Excel files, databases, or even from an existing Python data structure.
CSV files: Use the pd.read_csv() function to load data from a CSV file into a DataFrame.
python code
import pandas as pd
df = pd.read_csv('data.csv')
Excel files: Use the pd.read_excel() function to load data from an Excel file into a DataFrame.
python code
import pandas as pd
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
69
Databases: Use the appropriate database connector (e.g., pymysql, psycopg2) to establish a
connection to the database and then use pd.read_sql() function to load data from a database
query into a DataFrame.
python code
import pandas as pd
import pymysql
# Establish a connection
connection = pymysql.connect(host='localhost', user='username', password='password',
database='database_name')
# Load data from a query into a DataFrame
query = "SELECT * FROM table_name"
df = pd.read_sql(query, connection)
Other sources: Pandas also provides functions to load data from various other sources, such
as JSON, HTML, and more. You can explore the pandas documentation for more details on
loading data from different sources.
These are some basic concepts of DataFrame indexing and loading in Python using the pandas
library. With these techniques, you can select specific data from a DataFrame based on your
requirements and load data from various sources to perform data analysis and manipulation.
Querying a DataFrame
Querying a DataFrame in Python refers to the process of extracting specific data or subsets of
data from a DataFrame based on certain conditions or criteria. Pandas, a popular data
manipulation library in Python, provides various methods to query DataFrames effectively.
Here are some common ways to query a DataFrame:
Basic Indexing: You can use basic indexing with square brackets [] to extract specific columns
or rows from a DataFrame. For example:
70
python code
Boolean Indexing: Boolean indexing allows you to filter rows based on specific conditions
using logical operators such as ==, >, <, >=, <=, and !=. For example:
python code
loc and iloc: The .loc[] and .iloc[] indexers provide more advanced querying capabilities. .loc[]
allows you to access rows and columns by label, while .iloc[] uses integer-based indexing. For
example:
python code
Query Method: Pandas provides the .query() method to query a DataFrame using a more
concise and expressive syntax. It allows you to write queries using a string-based syntax that
resembles SQL. For example:
python code
71
# Query using the query method
df.query('column1 > 5 and column2 == "value"')
GroupBy: The groupby() function allows you to group the DataFrame by one or more columns
and perform aggregation or apply functions to the grouped data. For example:
python code
These are just a few examples of how you can query a DataFrame in Python using pandas. The
library provides many more functionalities, such as filtering rows based on string matching,
handling missing values, combining multiple DataFrames, and more. Pandas documentation is
a valuable resource for learning about all the available querying methods and their parameters.
Certainly! Here's an example that demonstrates how to query a DataFrame in Python using
pandas:
python code
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie', 'Eve'],
'Age': [25, 28, 32, 30, 27],
'City': ['New York', 'Paris', 'London', 'Paris', 'Tokyo'],
'Salary': [5000, 6000, 4000, 5500, 4500]
}
df = pd.DataFrame(data)
# Querying based on conditions
# Filter rows where Age is greater than 25
filtered_df = df[df['Age'] > 25]
print("Filtered DataFrame:")
print(filtered_df)
72
# Querying based on multiple conditions
# Filter rows where Age is greater than 25 and Salary is greater than 5000
filtered_df = df[(df['Age'] > 25) & (df['Salary'] > 5000)]
print("Filtered DataFrame with multiple conditions:")
print(filtered_df)
In this example, we create a DataFrame with columns 'Name', 'Age', 'City', and 'Salary'. Then
we demonstrate different querying techniques:
● Filtering rows based on a condition: We filter rows where the 'Age' column is greater
than 25 and store the result in the variable 'filtered_df'.
● Filtering rows based on multiple conditions: We filter rows where the 'Age' is greater
than 25 and the 'Salary' is greater than 5000.
● Querying using the query method: We use the query method to filter rows where the
'City' is 'Paris' and the 'Salary' is less than 6000.
● Grouping and aggregation: We group the DataFrame by 'City' and calculate the average
salary for each city using the groupby() function.
Each query demonstrates a different way to extract specific data from the DataFrame based on
conditions or grouping. You can modify these examples to suit your own DataFrame and query
requirements.
73
Indexing Dataframes
Indexing in Python DataFrames refers to accessing and manipulating data based on row and
column labels or positions. Pandas provides various indexing methods to retrieve specific data
from DataFrames. Here are some common indexing techniques:
1.Column Indexing:
Using square brackets []: You can access a single column or multiple columns by specifying
their column names inside square brackets. For example:
python code
2.Row Indexing:
Using loc and iloc: The .loc[] and .iloc[] indexers are used to access rows based on their labels
or integer positions, respectively.
python code
3.Conditional Indexing:
Boolean indexing: You can use Boolean expressions to filter rows based on specific
conditions.
Python code
74
Combining conditions: You can use logical operators such as & (and) and | (or) to combine
multiple conditions.
python code
Using loc and iloc together: You can combine row and column indexing using the .loc[] or
.iloc[] indexers.
python code
These are some of the commonly used indexing techniques in Python DataFrames. They allow
you to retrieve specific data based on labels or positions, filter rows based on conditions, and
access individual cells or subsets of data.
Merging Dataframes
1.Inner Merge:
An inner merge combines only the rows that have matching values in both DataFrames. It
retains only the common records between the DataFrames.
python code
2.Outer Merge:
75
An outer merge combines all the rows from both DataFrames and fills missing values with
NaN where there is no match.
python code
3.Left Merge:
A left merge includes all the rows from the left DataFrame and the matching rows from the
right DataFrame. Missing values are filled with NaN where there is no match.
python code
4.Right Merge:
A right merge includes all the rows from the right DataFrame and the matching rows from the
left DataFrame. Missing values are filled with NaN where there is no match.
python code
You can merge DataFrames based on multiple columns by passing a list of column names to
the on parameter.
python code
If the column names are different in both DataFrames, you can use the left_on and right_on
parameters to specify the column names from each DataFrame to merge on.
python code
76
merged_df = pd.merge(df1, df2, left_on='column1', right_on='column2')
These are the basic techniques for merging DataFrames in Python using the Pandas library.
They allow you to combine data from different sources based on common columns or indices.
You can choose the appropriate merge type based on your data requirements.
77
Unit 5: Data Aggregation, processing and Group Operations
Time Series, Date and Time, Data Types and Tools, Time Series Basics, Date
Ranges, Frequencies, and Shifting, Time Zone Handling, Periods and Period
Arithmetic, Resampling and Frequency Conversion, Time Series Plotting,
Moving Window Functions, Natural Language Processing, Image Processing,
Machine Learning K Nearest Neighbors Algorithm for Classification, Clustering
Time Series
Time series data refers to a sequence of data points collected and recorded over time, where
each data point is associated with a specific timestamp or time interval. Time series data is
commonly encountered in various domains such as finance, economics, weather forecasting,
stock market analysis, and more.
In Python, the Pandas library provides powerful tools and data structures for working with time
series data. Here are some key concepts and techniques related to time series in Python:
1.DateTime Index:
The DateTime Index is a specialized Pandas index object that allows for indexing and slicing
of time series data based on dates and times. It provides convenient methods for working with
time-related data.
python code
import pandas as pd
dates = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D')
df = pd.DataFrame({'date': dates, 'value': [1, 2, 3, ...]})
df.set_index('date', inplace=True)
2.Resampling:
Resampling involves changing the frequency of the time series data. It can be used to convert
higher frequency data to lower frequency (downsampling) or lower frequency data to higher
78
frequency (upsampling). Common frequency aliases include 'D' for daily, 'M' for monthly, 'Y'
for yearly, etc.
python code
The DateTime Index allows for intuitive indexing and slicing of time series data based on
specific dates, date ranges, or time intervals.
python code
4.Time Shifting:
Time shifting involves moving the entire time series data forward or backward in time. It can
be useful for calculating time differences, lagging or leading indicators, or aligning data from
different time periods.
python code
Rolling window functions compute statistics over a sliding window of consecutive data points.
They are useful for calculating moving averages, rolling sums, or other time-dependent
calculations.
python code
79
6.Time Series Visualization:
Python libraries such as Matplotlib and Seaborn provide tools for visualizing time series data.
Line plots, area plots, bar plots, and scatter plots can be used to visualize trends, patterns, and
anomalies in the data over time.
python code
These are some of the fundamental concepts and techniques related to working with time series
data in Python. Pandas and other libraries offer extensive functionality for analyzing,
manipulating, and visualizing time series data, allowing you to extract valuable insights and
make informed decisions based on temporal patterns and trends.
Working with dates and times in Python involves utilizing the datetime module, which provides
classes and functions for manipulating, formatting, and performing calculations with date and
time values. Here's an overview of how to work with date and time in Python:
python code
import datetime
80
2.Creating Date and Time Objects:
The datetime module provides several classes for representing dates and times. Some
commonly used classes include:
python code
You can obtain the current date and time using the datetime.now() function:
python code
current_datetime = datetime.datetime.now()
You can format dates and times using the strftime() method, which allows you to specify a
format string to represent the desired format:
python code
formatted_date = date.strftime('%Y-%m-%d')
formatted_time = time.strftime('%H:%M:%S')
formatted_datetime = datetime_obj.strftime('%Y-%m-%d %H:%M:%S')
81
5.Parsing Strings into Dates and Times:
You can parse strings that represent dates or times into datetime objects using the strptime()
function, specifying the format of the input string:
python code
date_str = '2023-05-09'
parsed_date = datetime.datetime.strptime(date_str, '%Y-%m-%d')
time_str = '12:30:00'
parsed_time = datetime.datetime.strptime(time_str, '%H:%M:%S')
You can perform arithmetic operations on date and time objects, such as adding or subtracting
time intervals or calculating time differences:
python code
7.Timezone Handling:
If you need to work with timezones, the pytz module provides support for handling timezones
in Python.
python code
import pytz
# Set a timezone for a datetime object
tz = pytz.timezone('America/New_York')
datetime_obj = datetime_obj.astimezone(tz)
82
These are some of the basic operations and functionalities for working with date and time in
Python using the datetime module. Python's datetime module provides a comprehensive set of
tools for working with dates, times, and timezones, allowing you to handle various date and
time-related tasks in your Python programs.
In Python, data types are used to define the type of data that a variable can hold. Different data
types have different properties and methods associated with them. Here are some commonly
used data types and tools in Python:
Dictionary (dict): Collection of key-value pairs, enclosed in curly braces {}, such as
{'name': 'John', 'age': 25}.
Set (set): Unordered collection of unique elements, enclosed in curly braces {}, such as
{1, 2, 3}.
FrozenSet (frozenset): Immutable version of a set.
83
5.Boolean Data Type:
NumPy: A powerful library for numerical computing in Python, providing support for
large, multi-dimensional arrays and mathematical functions.
Pandas: A library for data manipulation and analysis, providing data structures like
DataFrame for handling structured data.
Matplotlib: A plotting library for creating static, animated, and interactive
visualizations in Python.
SciPy: A library for scientific and technical computing, providing functions for
optimization, integration, linear algebra, and more.
Scikit-learn: A machine learning library that provides tools for data mining, analysis,
and building predictive models.
Jupyter Notebook: An interactive computing environment that allows you to create
and share documents containing code, visualizations, and explanatory text.
These are just a few examples of data types and tools in Python. Python offers a rich ecosystem
of libraries and tools for various data analysis, manipulation, and visualization tasks, allowing
you to effectively work with different types of data and perform advanced data analysis tasks.
Time series refers to a sequence of data points collected and recorded at specific time intervals.
In the context of data analysis and forecasting, time series data is commonly used to analyze
patterns, trends, and seasonality over time. Here are some key concepts and techniques related
to time series analysis:
84
1.Time Series Data Representation:
In Python, time series data is typically represented using pandas, a powerful library for
data manipulation and analysis. The primary data structure for time series in pandas is
the Series object, which is a one-dimensional labeled array capable of holding any data
type with associated time indices.
The time indices can be specified as dates, timestamps, or numeric values representing
time intervals.
Visualizing time series data is important for gaining insights and identifying patterns.
The matplotlib library provides various functions for creating line plots, scatter plots,
bar plots, and other visualizations to represent time series data.
Additional libraries like seaborn and plotly offer more advanced plotting options and
interactive visualizations for time series data.
Time series data often exhibits components such as trend, seasonality, and noise.
Decomposing a time series helps separate these components for analysis and
forecasting. The statsmodels library in Python provides methods for decomposing time
series using techniques like moving averages, exponential smoothing, and seasonal
decomposition of time series (STL).
Time series analysis involves studying the statistical properties, patterns, and
dependencies within a time series. Techniques such as autocorrelation analysis,
stationarity testing, and spectral analysis can be applied to understand the underlying
characteristics of the data.
Python libraries like statsmodels, scipy, and numpy offer functions for performing
various time series analysis tasks, including autocorrelation functions, periodogram
analysis, and statistical tests for stationarity.
85
5.Time Series Forecasting:
Python's pandas library offers numerous tools and functions for handling time series
data. It provides capabilities for resampling, aggregating, and transforming time series
data, handling missing values, and handling time zone conversions.
Pandas also supports time-based indexing, allowing you to slice and select data based
on time intervals.
Time series analysis and forecasting play a crucial role in various domains, including finance,
economics, weather forecasting, sales forecasting, and more. Python, with its rich ecosystem
of libraries and tools, provides a comprehensive environment for working with time series data,
performing analysis, visualization, modeling, and forecasting tasks.
In time series analysis, working with date ranges, frequencies, and shifting data is essential for
manipulating and analyzing time-based data. Here's an explanation of these concepts in Python
using the pandas library:
1.Date Ranges:
A date range represents a sequence of dates over a specified period. In pandas, you can generate
date ranges using the pd.date_range() function. It allows you to specify the start date, end
date, and frequency of the range.
86
2.Frequencies:
Frequencies define the intervals at which observations occur in a time series. In pandas,
frequencies are represented using frequency strings or offsets. The freq parameter in
pandas functions accepts these frequency strings.
Common frequency strings include 'D' for daily, 'W' for weekly, 'M' for monthly, 'Q'
for quarterly, 'A' for annually, and more. You can also specify custom frequencies.
Example: date_range = pd.date_range(start='2022-01-01', periods=12, freq='M')
generates a monthly date range for 12 months starting from January 2022.
3.Shifting Data:
Shifting data involves moving the values of a time series forward or backward in time.
This can be useful for calculating time-based differences or comparing values at
different time periods.
In pandas, you can shift a time series using the shift() method. Positive values shift the
data forward, while negative values shift it backward.
Example: shifted_series = series.shift(i) shifts the values of a series one step forward.
4.Rolling Windows:
Rolling windows allow you to calculate aggregated statistics over a sliding window of
time. This is useful for smoothing data, calculating moving averages, or identifying
trends.
In pandas, you can create a rolling window using the rolling() method. You can specify
the window size and apply various aggregation functions like mean, sum, min, max,
etc.
Example: rolling_mean = series.rolling(window=3).mean() calculates the rolling
mean over a window of size 3.
These concepts provide the foundation for working with time series data in Python. They allow
you to create date ranges, specify frequencies for data intervals, and manipulate time-based
data by shifting and aggregating values. Using pandas, you can easily handle and analyze time
series data, perform calculations, and extract meaningful insights.
87
Time Zone Handling
Handling time zones is an important aspect of working with time series data, especially when
dealing with data from different regions or when performing analysis across different time
zones. In Python, the pytz and dateutil libraries, along with the capabilities of pandas, provide
functionality for working with time zones. Here's an explanation of time zone handling in
Python:
Time zone localization involves assigning a specific time zone to a datetime object.
This is important when the original data does not have time zone information or when
converting data to a different time zone.
The pytz library provides a comprehensive database of time zones, and you can use the
pytz.timezone() function to specify a time zone. The tz_localize() method in pandas is
used to localize a datetime object to a specific time zone.
Eg: localized_datetime = datetime.tz_localize(pytz.timezone('America/New_York'))
assigns the 'America/New_York' time zone to a datetime object.
Time zone conversion involves converting datetime objects from one time zone to
another. This is useful when you want to compare or combine data from different time
zones.
The tz_convert() method in pandas is used to convert datetime objects from one time
zone to another. It automatically adjusts the datetime values to reflect the new time
zone.
Example: converted_datetime =
localized_datetime.tz_convert(pytz.timezone('Asia/Tokyo')) converts a localized
datetime object from the 'America/New_York' time zone to the 'Asia/Tokyo' time
zone.
88
3.Time Zone-aware Timestamps:
In pandas, the Timestamp object can be made time zone-aware by using the tz
parameter. Time zone-aware timestamps allow for easy manipulation and comparison
of dates and times across different time zones.
Example: aware_timestamp = pd.Timestamp('2022-01-01 12:00', tz='Europe/Paris')
creates a time zone-aware timestamp for the specified datetime in the 'Europe/Paris'
time zone.
The dateutil library provides functions to handle time zone offsets. The
dateutil.relativedelta class can be used to perform arithmetic operations with time zone-
aware datetime objects, allowing for adjustments based on specific time zone offsets.
Example: adjusted_datetime = datetime + relativedelta(hours=2) adds 2 hours to a
time zone-aware datetime object.
By using these libraries and techniques, you can effectively handle time zones in Python.
Whether it's localizing datetime objects to a specific time zone, converting between time zones,
or performing operations with time zone offsets, Python provides the necessary tools to work
with time series data in different time zones accurately and efficiently.
Periods in Python represent a fixed-length span of time, such as a day, month, or year. The
pandas library provides the Period class to work with periods and perform period arithmetic.
Here's an explanation of periods and period arithmetic in Python:
1.Creating Periods:
Periods can be created using the pd.Period() function by specifying a date or time string and a
frequency code. The frequency code determines the length of the period, such as 'D' for daily,
'M' for monthly, 'Y' for yearly, and so on.
89
2.Period Arithmetic:
Period arithmetic allows you to perform mathematical operations on periods, such as addition,
subtraction, and comparison. The arithmetic operations respect the defined frequency and
adjust the periods accordingly.
Example:
period1 = pd.Period('2022-01', freq='M')
period2 = pd.Period('2022-03', freq='M')
period_diff = period2 - period1 calculates the difference between two periods, resulting
in a new period representing the number of months between them.
period_sum = period1 + 2 adds 2 periods to the original period, resulting in a new
period that is two months later.
3.Period Index:
Periods can be used as an index in a pandas Series or DataFrame, allowing for efficient
indexing and slicing based on periods. The pd.PeriodIndex class is used to create an index of
periods.
Example:
4.Frequency Conversion:
Periods can be converted to a different frequency using the asfreq() method. This allows
you to change the length of the period while preserving the start or end timestamp.
Example: new_period = period.asfreq('Y') converts the original monthly period to a
yearly period.
Periods and period arithmetic provide a convenient way to work with fixed-length spans of
time in Python. Whether it's creating periods, performing arithmetic operations, or using
90
periods as an index, the pandas library offers robust functionality to handle time-based data at
different frequencies accurately.
Resampling and frequency conversion are essential techniques for working with time series
data in Python. The pandas library provides robust functionality to perform resampling and
frequency conversion operations. Here's an explanation of how to perform resampling and
frequency conversion in Python:
1.Resampling:
Resampling involves changing the frequency of your time series data. You can
upsample the data to a higher frequency or downsample it to a lower frequency.
The resample() method in pandas is used to perform resampling. It takes a frequency
string as an argument to specify the new frequency.
You can also specify an aggregation function to summarize the data within each new
frequency interval, such as sum(), mean(), max(), etc.
Example:
python code
2.Frequency Conversion:
Frequency conversion involves converting your time series data from one frequency to another.
It allows you to align the data to a different frequency or standardize it.
The asfreq() method in pandas is used to perform frequency conversion. It takes a frequency
string as an argument to specify the desired frequency.
91
The method handles the appropriate alignment or interpolation of data points based on the
specified frequency.
Example:
python code
Both resample() and asfreq() methods accept additional parameters to control the behavior of
the operation.
The how parameter can be used to specify the aggregation function for resampling (e.g., sum(),
mean(), max()).
The fill_method parameter can be used to handle missing values during upsampling by
specifying ffill (forward fill) or bfill (backward fill).
Example:
python code
# Resample the data, summing values within each new frequency interval
monthly_data = df.resample('M').sum()
Resampling and frequency conversion allow you to manipulate and analyze time series data at
different frequencies. Whether you need to change the frequency, aggregate the data, or align
92
it with other time series, pandas provides a comprehensive set of tools to perform these
operations effectively.
Time series plotting is an essential part of analyzing and visualizing time-based data in Python.
The pandas library, in combination with matplotlib, provides powerful tools for creating
insightful time series plots. Here's an explanation of how to plot time series data in Python:
python code
import pandas as pd
import matplotlib.pyplot as plt
Load the time series data into a pandas DataFrame, ensuring that the date or time
column is of the correct data type.
Set the date or time column as the index of the DataFrame to enable time-based
indexing and plotting.
python code
Use the plot() method of the DataFrame to create basic line plots of the time series data.
93
Customize the plot by specifying the plot type, labels, title, gridlines, and other options.
python code
Adjust the figure size using plt.figure(figsize=(width, height)) to control the dimensions
of the plot.
Apply different plot styles using plt.style.use('style_name'), such as 'seaborn', 'ggplot',
or custom styles.
Add legends, change line colors, specify line styles, or add markers to the plot to
enhance readability.
python code
94
# Plot the time series data with customized options
df.plot(color='blue', linestyle='-', linewidth=2, marker='o', markersize=5, label='Data')
# Add a legend
plt.legend()
Time series plotting in Python allows you to visualize trends, patterns, and anomalies in your
time-based data. By leveraging the capabilities of pandas and matplotlib, you can create
informative and visually appealing plots that help you gain insights into your time series data.
Moving window functions, also known as rolling or sliding window functions, are a class of
operations commonly used in time series analysis and data smoothing. These functions
compute an aggregate value over a fixed-size window of consecutive data points as it slides
through the time series. The window "rolls" or "slides" over the data, updating the aggregate
value at each step. The pandas library in Python provides convenient methods to perform
moving window operations. Here's an explanation of moving window functions in Python:
python code
95
# Compute the mean value over the rolling window
mean_values = rolling_window.mean()
Example:
python code
3.Aggregation Functions:
Various aggregation functions can be applied to the moving windows, such as mean, sum, min,
max, standard deviation, etc.
These functions are applied to the window using the mean(), sum(), min(), max(), std(), etc.,
methods of the rolling or expanding window objects.
Example:
python code
# Compute the maximum value over the rolling window
max_values = rolling_window.max()
96
# Compute the standard deviation over the expanding window
std_values = expanding_window.std()
Moving window functions are useful for calculating rolling averages, smoothing time series
data, detecting trends or outliers, and performing various other analyses on time-based data.
By specifying the window size and choosing an appropriate aggregation function, you can
derive meaningful insights from your time series data using moving window operations in
Python.
Natural Language Processing (NLP) is a field of study that focuses on the interaction between
computers and human language. It involves techniques and algorithms that enable computers
to understand, interpret, and generate human language in a meaningful way. Python provides
several powerful libraries and tools for NLP, making it a popular choice among developers.
Here's an overview of NLP in Python:
NLTK is a widely used library for NLP in Python. It provides various functionalities for text
processing, tokenization, stemming, tagging, parsing, and more.
It also includes a large collection of corpora, lexical resources, and models for different NLP
tasks.
python code
import nltk
# Tokenization
text = "This is an example sentence."
tokens = nltk.word_tokenize(text)
# Part-of-speech tagging
tagged = nltk.pos_tag(tokens)
# Stemming
stemmer = nltk.stem.PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]
97
2.spaCy:
spaCy is a modern NLP library that offers high-performance and efficient tools for NLP tasks
such as tokenization, part-of-speech tagging, named entity recognition, dependency parsing,
and more.
It is known for its speed, accuracy, and ease of use, making it suitable for large-scale NLP
applications.
python code
import spacy
# Load the English language model
nlp = spacy.load('en_core_web_sm')
# Tokenization
doc = nlp("This is an example sentence.")
tokens = [token.text for token in doc]
# Dependency parsing
for token in doc:
print(token.text, token.dep_, token.head.text)
3.TextBlob:
TextBlob is a user-friendly library built on top of NLTK, providing a simple API for common
NLP tasks such as sentiment analysis, part-of-speech tagging, noun phrase extraction,
translation, and more.
It also offers a straightforward interface for working with textual data and performing basic
text processing operations.
Example:
98
python code
These are just a few examples of the libraries available for NLP in Python. Other popular
libraries include Gensim for topic modeling, scikit-learn for machine learning-based NLP
tasks, and Transformers for advanced deep learning models such as BERT and GPT. With
these libraries, you can perform a wide range of NLP tasks, analyze textual data, and extract
valuable insights from text using Python.
Image Processing
Image processing is a field of study that involves manipulating digital images to enhance their
quality, extract useful information, or perform specific tasks. Python provides various libraries
and tools for image processing, making it a popular choice among developers. Here's an
overview of image processing in Python:
OpenCV is a widely used library for computer vision and image processing in Python.
It provides a comprehensive set of functions and algorithms for image manipulation, filtering,
feature detection, object recognition, and more.
python code
import cv2
# Read an image from file
image = cv2.imread('image.jpg')
99
# Convert the image to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
PIL is a library for opening, manipulating, and saving many different image file formats in
Python.
It provides functions for basic image processing tasks such as resizing, cropping, rotating, and
converting image formats.
python code
from PIL import Image
100
3.scikit-image:
scikit-image is a library that provides a collection of algorithms and functions for image
processing tasks in Python.
It offers various functionalities for image filtering, segmentation, morphology, and feature
extraction.
Example:
python code
import skimage.io
import skimage.filters
These are just a few examples of the libraries available for image processing in Python. Other
notable libraries include scikit-learn for machine learning-based image analysis, matplotlib and
seaborn for image visualization, and TensorFlow and PyTorch for deep learning-based image
processing tasks. With these libraries, you can perform a wide range of image processing tasks,
analyze and manipulate images, and develop computer vision applications using Python.
101
Machine Learning K Nearest Neighbors Algorithm for Classification
The k-nearest neighbors (KNN) algorithm is a simple yet powerful machine learning algorithm
used for both classification and regression tasks. In this explanation, we will focus on using the
KNN algorithm for classification.
1.Training:
During the training phase, the algorithm simply stores the labeled data points in
memory.
Each data point consists of a set of features (input variables) and a corresponding class
label (output variable).
2.Prediction:
When a new unlabeled data point is given, the KNN algorithm predicts its class label
based on its similarity to the labeled data points.
The algorithm measures the similarity using a distance metric (e.g., Euclidean distance).
It considers the k nearest neighbors (data points with the smallest distances) to the new
data point.
3.Voting:
For classification, the KNN algorithm employs majority voting among the k nearest
neighbors to determine the class label of the new data point.
Each neighbor's class label contributes one vote, and the majority class label is assigned
to the new data point.
4.Choosing k:
102
Here's an example of how to implement the KNN algorithm for classification using the
scikit-learn library in Python:
python code
In this example, X represents the feature matrix (input variables) and y represents the
corresponding class labels. The train_test_split function is used to split the data into training
and testing sets. The fit method is used to train the KNN classifier, and the predict method is
used to make predictions on the testing data. Finally, the accuracy of the model is evaluated
using the accuracy_score function.
Remember to preprocess and normalize the data as needed before applying the KNN algorithm.
Additionally, feature scaling and handling categorical variables might be necessary for certain
datasets.
103
The KNN algorithm is relatively simple to understand and implement, making it a good starting
point for classification tasks. However, it is important to choose an appropriate value for k and
handle the curse of dimensionality when working with high-dimensional data.
Clustering
Clustering is an unsupervised machine learning technique used to group similar data points
together based on their characteristics or patterns. It is often used for exploratory data analysis,
pattern recognition, and data segmentation. The goal of clustering is to discover inherent
structures or clusters in the data without any predefined class labels.
There are various clustering algorithms available, but we will focus on two commonly used
algorithms: K-means clustering and hierarchical clustering.
1.K-means Clustering:
K-means clustering is an iterative algorithm that partitions the data into k clusters,
where k is a predefined number chosen by the user.
The algorithm works by initially randomly selecting k centroids (representative points)
in the feature space.
It assigns each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).
After assigning all the data points, the algorithm updates the centroids by calculating
the mean of the points in each cluster.
This process is repeated until the centroids no longer change significantly or a
maximum number of iterations is reached.
2.Hierarchical Clustering:
104
The distance metric can be Euclidean distance, Manhattan distance, or other similarity
measures.
The linkage criterion determines how the distance between clusters is calculated, such
as complete linkage, single linkage, or average linkage.
Here's an example of how to perform K-means clustering using the scikit-learn library in
Python:
python code
In this example, X represents the feature matrix (input variables). The n_clusters parameter
specifies the number of clusters to create. The fit method is used to fit the clustering model to
the data, and the labels_ attribute provides the cluster labels for each data point. The
cluster_centers_ attribute gives the coordinates of the cluster centers.
Clustering is an iterative process, and the choice of the number of clusters (k) is crucial. You
can use various evaluation metrics, such as the silhouette score or elbow method, to determine
the optimal number of clusters.
It's important to note that clustering is an unsupervised learning technique, meaning it does not
require labeled data. However, it is often used as a preprocessing step for other tasks, such as
anomaly detection, customer segmentation, or recommendation systems.
105
Unit 6: Visualization of Data with Python 10 Hours
Using Matplotlib Create line plots, area plots, histograms, bar charts, pie charts,
box plots and scatter plots and bubble plots. Advanced visualization tools such as
waffle charts, word clouds, seaborn and Folium for visualizing geospatial data.
Creating choropleth maps
Using Matplotlib Create line plots, area plots, histograms, bar charts, pie
charts, box plots and scatter plots and bubble plots.
Line plots
Matplotlib is a popular data visualization library in Python that provides a wide range of tools
for creating various types of plots, including line plots. Line plots are commonly used to
visualize the relationship between two variables and show how the data changes over a
continuous range.
To create line plots using Matplotlib, you need to follow these basic steps:
python code
Create lists or arrays to store the x-axis values and y-axis values.
Ensure that both lists have the same length and the values are in the correct order.
python code
plt.plot(x, y)
106
Pass the x-axis values as the first argument and the y-axis values as the second
argument.
Matplotlib will automatically connect the data points with lines.
python code
plt.show()
Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
Set a title for the plot using the title() function.
Customize the line style, color, and marker using optional parameters in the plot()
function.
Add a legend to distinguish multiple lines in the plot using the legend() function.
Matplotlib provides many more customization options to enhance the appearance of your line
plot. You can control the line style, thickness, color, marker style, and much more. You can
also add grid lines, annotations, and text to provide additional information.
python code
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()
107
This code will create a line plot with the given x-axis and y-axis values. You can modify the
code based on your data and requirements to create more complex line plots with additional
customization.
Area plots
Area plots, also known as stacked area plots, are used to represent the cumulative magnitude
of different variables over a continuous range. They are useful for visualizing the composition
or distribution of multiple variables and showcasing their cumulative impact.
To create area plots using Matplotlib in Python, you can follow these steps:
python code
Create lists or arrays to store the x-axis values and y-axis values for each variable.
Make sure the length of the x-axis values is the same for all variables.
The y-axis values should represent the cumulative magnitude or proportion for each
variable at each point on the x-axis.
python code
plt.stackplot(x, y1, y2, y3, labels=['Variable 1', 'Variable 2', 'Variable 3'])
108
4.Customize the plot (optional):
Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
Set a title for the plot using the title() function.
Customize the colors, transparency, and other visual aspects of the area plot using
optional parameters in the stackplot() function.
Add a legend using the legend() function to differentiate between different variables.
Python code
plt.show()
Matplotlib provides additional customization options to enhance the appearance of your area
plot. You can control the colors, transparency, line styles, and markers of each variable. You
can also add grid lines, annotations, and text to provide additional information.
Python code
109
This code will create an area plot with the given x-axis and y-axis values for three variables.
Each variable is represented by a different color, and a legend is added to identify each variable.
You can modify the code based on your data and requirements to create more complex area
plots with additional customization.
Histograms,
Histograms are used to visualize the distribution of a continuous variable. They provide a
graphical representation of the frequency or count of values falling within specific intervals or
bins. Histograms help in understanding the shape, central tendency, and spread of the data.
To create histograms in Python, you can use various libraries such as Matplotlib, Seaborn, or
Pandas. Here, I will explain how to create histograms using Matplotlib.
python code
Create a list or array containing the values of the variable you want to plot.
Python code
plt.hist(data, bins=10)
110
4.Customize the plot (optional):
Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
Set a title for the plot using the title() function.
Adjust the appearance of the histogram, such as the color, transparency, and edge color,
using optional parameters in the hist() function.
python code
plt.show()
Python code
This code will create a histogram with the given data, dividing it into five bins. The x-axis
represents the value, and the y-axis represents the frequency. You can modify the code based
on your data and requirements to create more complex histograms with additional
customization.
111
Bar charts
Bar charts are a common visualization tool used to represent categorical data using rectangular
bars. They are particularly useful for displaying the frequency or count of different categories
or comparing values across different groups.
To create bar charts in Python, you can use various libraries such as Matplotlib, Seaborn, or
Plotly. Here, I will explain how to create bar charts using Matplotlib, which is a popular plotting
library.
Python code
Create a list or array containing the categories or labels for the x-axis.
Create a corresponding list or array containing the values or counts for each category.
Python code
plt.bar(x, height)
Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
Set a title for the plot using the title() function.
112
Adjust the appearance of the bars, such as the color, width, and edge color, using
optional parameters in the bar() function.
Python code
plt.show()
Matplotlib provides various customization options to enhance the appearance of your bar chart.
You can adjust the bar width, add error bars, annotate the bars with values, change the color
scheme, and more.
Python code
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Count')
plt.title('Bar Chart')
plt.show()
This code will create a bar chart with the given categories and values. Each category is
represented by a bar, and the height of the bar represents the count. You can modify the code
based on your data and requirements to create more complex bar charts with additional
customization.
Pie charts
Pie charts are a popular way to represent categorical data, showing the proportion or percentage
of each category relative to the whole. They are particularly useful for visualizing data with a
small number of categories or comparing the relative sizes of different categories.
113
To create pie charts in Python, you can use various libraries such as Matplotlib, Plotly, or
Seaborn. Here, I will explain how to create pie charts using Matplotlib, which is a widely used
plotting library.
Python code
Python code
plt.pie(sizes, labels=labels)
114
plt.show()
Matplotlib provides additional customization options to enhance the appearance of your pie
chart. You can explode or highlight specific slices, add percentage values, adjust the text
properties, and more.
Python code
This code will create a pie chart with the given labels and sizes. Each category is represented
by a slice, and the size of the slice represents the proportion or percentage. The autopct
parameter is used to display the percentage values on the chart. You can modify the code based
on your data and requirements to create more complex pie charts with additional customization.
Box plots
Box plots, also known as box-and-whisker plots, are a useful visualization tool to display the
distribution of a continuous variable across different categories or groups. They provide a
summary of key statistical measures such as the median, quartiles, and potential outliers.
To create box plots in Python, you can use various libraries such as Matplotlib, Seaborn, or
Plotly. Here, I will explain how to create box plots using Matplotlib, which is a commonly used
plotting library.
115
Python code
Organize your data into separate groups or categories, each containing a list or array of
values.
Optionally, provide labels for each group if you want to display them on the plot.
Python code
plt.boxplot(data, labels=labels)
Python code
plt.show()
Matplotlib provides additional customization options to enhance the appearance of your box
plot. You can show or hide specific elements such as outliers, caps, or median lines, change
the orientation of the plot, add grid lines, and more.
116
Here's a simple example of creating a box plot using Matplotlib:
Python code
plt.boxplot(data, labels=labels)
plt.title('Box Plot')
plt.ylabel('Values')
plt.show()
This code will create a box plot with three groups. Each group is represented by a box, with
the central line inside the box representing the median. The whiskers extend to the minimum
and maximum values, and any potential outliers are indicated by individual points. You can
modify the code based on your data and requirements to create more complex box plots with
additional customization.
Scatter plots
Scatter plots are used to visualize the relationship between two continuous variables. They
show the individual data points as dots on a two-dimensional coordinate system, with one
variable plotted on the x-axis and the other variable plotted on the y-axis. Scatter plots help to
identify patterns, trends, or correlations between the two variables.
To create scatter plots in Python, you can use various plotting libraries such as Matplotlib,
Seaborn, or Plotly. Here, I will explain how to create scatter plots using Matplotlib, which is a
commonly used plotting library.
117
Here are the steps to create a scatter plot using Matplotlib:
Python code
Organize your data into two arrays or lists, one for the x-values and one for the y-values.
Python code
plt.scatter(x, y)
Python code
plt.show()
118
Matplotlib provides additional customization options to enhance the appearance of your scatter
plot. You can add regression lines, error bars, annotations, or other plot elements to provide
more context or insights.
Python code
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.show()
This code will create a scatter plot with the given x-values and y-values. Each data point is
represented by a dot on the plot. You can modify the code based on your data and requirements
to create more complex scatter plots with additional customization.
Bubble plots
Bubble plots, also known as bubble charts, are a variation of scatter plots where the size of the
markers (bubbles) represents a third variable. They are useful for visualizing three-dimensional
data, where the x-axis and y-axis represent two continuous variables, and the size of the bubbles
represents the magnitude or frequency of another variable.
To create bubble plots in Python, you can use various plotting libraries such as Matplotlib or
Plotly. Here, I will explain how to create bubble plots using Matplotlib, which is a commonly
used plotting library.
119
1.Import the necessary libraries:
Python code
Organize your data into three arrays or lists: one for the x-values, one for the y-values,
and one for the bubble sizes.
Python code
plt.scatter(x, y, s=sizes)
Python code
plt.show()
Matplotlib provides additional customization options to enhance the appearance of your bubble
plot. You can use different marker shapes, colors, or color maps to represent additional
variables or categories.
120
Here's a simple example of creating a bubble plot using Matplotlib:
Python code
plt.scatter(x, y, s=sizes)
plt.title('Bubble Plot')
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.show()
This code will create a bubble plot with the given x-values, y-values, and bubble sizes. Each
data point is represented by a bubble on the plot, and the size of the bubble represents the
corresponding size value. You can modify the code based on your data and requirements to
create more complex bubble plots with additional customization.
Advanced visualization tools such as waffle charts, word clouds, seaborn and
Folium for visualizing geospatial data.
Waffle charts
Waffle charts are a type of visualization that represents proportions or percentages using square
tiles, where each tile represents a specific portion of the whole. Waffle charts can be a visually
appealing and intuitive way to convey information about categorical data.
In Python, you can create waffle charts using the pywaffle library. Here's a step-by-step
explanation of how to create a waffle chart:
121
2.Import the necessary libraries:
Python code
Create a dictionary or a pandas Series that represents the categories and their
corresponding values.
Ensure that the values represent proportions or percentages of the whole.
You can customize the chart by specifying parameters such as the number of rows, columns,
colors, and figure size.
Python code
# Example data
data = {'Category A': 30,
'Category B': 20,
'Category C': 50}
122
In this example, the waffle chart will have 5 rows and 10 columns, representing a total of 50
tiles. The values from the data dictionary will determine the number of tiles assigned to each
category, and the colors parameter sets the colors for each category.
You can further customize the chart by adding a title, adjusting the figure size, or modifying
the legend.
Waffle charts can be a great way to visually compare proportions or percentages across
different categories. They offer a unique and engaging visualization that can enhance data
communication and storytelling.
Word clouds
Word clouds are a popular visualization technique used to represent the frequency or
importance of words in a given text corpus. They provide a visual summary of the most
common words in the text, with the size of each word indicating its frequency or importance.
Python offers several libraries to create word clouds, including wordcloud and matplotlib.
Here's a step-by-step explanation of how to create a word cloud using the wordcloud library in
Python:
Python code
pip install wordcloud
Python code
from wordcloud import WordCloud
import matplotlib.pyplot as plt
Clean and preprocess your text data, removing any irrelevant words or characters.
Convert your text data into a string or a list of words.
123
4.Generate the word cloud:
Python code
# Example data
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua."
In this example, the WordCloud object generates the word cloud based on the provided text
data. You can further customize the word cloud by specifying parameters such as the
background color, word colors, font size, and maximum number of words displayed.
You can also use additional methods and functions provided by the wordcloud library to
enhance the appearance of the word cloud, such as masking the word cloud to a specific shape
or generating word clouds from word frequency dictionaries.
Word clouds are a visually appealing way to explore and summarize textual data. They can be
useful for tasks such as analyzing customer reviews, identifying important keywords in a
document, or visually representing word frequencies in a corpus.
Seaborn
Seaborn is a powerful data visualization library in Python that is built on top of Matplotlib. It
provides a high-level interface for creating visually appealing and informative statistical
124
graphics. Seaborn simplifies the process of creating complex visualizations by offering a wide
range of plot types and built-in statistical functionalities.
Seaborn is widely used in the data science community for its ability to create visually appealing
and informative visualizations with minimal code. It complements the functionality of
Matplotlib and provides a higher-level interface for creating complex plots while incorporating
statistical analysis.
Certainly! Here's an example that demonstrates the usage of Seaborn to create a scatter plot:
125
Python code
# Example data
x = [1, 2, 3, 4, 5]
y = [3, 5, 2, 6, 1]
In this example, we import the necessary libraries, including Seaborn and Matplotlib. We
define two lists x and y as our data points. We then use the sns.scatterplot() function from
Seaborn to create a scatter plot, passing in the x and y data. Finally, we customize the plot by
adding a title and axis labels using Matplotlib, and display the plot using plt.show().
Seaborn offers many more functionalities for customizing and enhancing the appearance of the
scatter plot. You can further customize the color, marker style, size, and other visual aspects of
the plot using Seaborn's additional parameters and options. Additionally, Seaborn provides
various statistical functionalities to add regression lines, confidence intervals, or perform
additional data analysis within the scatter plot.
Folium is a Python library used for visualizing geospatial data on interactive maps. It leverages
the Leaflet.js library, which is a popular JavaScript library for creating interactive maps. Folium
126
allows you to create maps directly in Python, making it convenient for data analysis and
visualization tasks.
1. Map Creation: Folium provides a simple and intuitive way to create maps by
specifying the initial center location and zoom level. You can choose from various tile
providers, such as OpenStreetMap, Mapbox, and Stamen, to set the base map style.
2. Markers: Folium allows you to add markers to the map to represent specific locations.
You can customize the markers by setting their position, icon, color, and popup
messages. Markers are commonly used to plot points of interest or to represent data
points on the map.
3. Polygons and Polylines: Folium supports the drawing of polygons and polylines on
the map. Polygons are used to highlight areas or create boundaries, while polylines are
used to draw lines between specific points. This functionality is useful for visualizing
regions, routes, or trajectories.
4. Choropleth Maps: Folium provides the capability to create choropleth maps, where
areas are shaded or colored based on a specific attribute or value. Choropleth maps are
commonly used to visualize spatial patterns or thematic data, such as population density
or economic indicators.
5. Heatmaps: Folium allows you to create heatmaps, which visualize the density or
intensity of data points on the map. Heatmaps are useful for identifying hotspots or
areas of high activity based on the concentration of data.
6. Interactive Features: Folium supports interactive features like tooltips and popups.
Tooltips provide additional information when hovering over markers, while popups
display more detailed information when markers are clicked. This interactivity
enhances the user experience and enables further exploration of the geospatial data.
Folium provides a versatile and flexible framework for visualizing geospatial data in Python.
It integrates well with popular data analysis libraries like Pandas and NumPy, allowing you to
easily combine geospatial data with other data sources for comprehensive analysis and
visualization.
127
Folium is a Python library used for visualizing geospatial data on interactive maps. It is built
on top of the Leaflet.js JavaScript library and provides a simple and intuitive interface to create
dynamic and interactive maps in Python.
Here's an example that demonstrates how to use Folium to create a basic map and plot markers
on it:
Python code
import folium
In this example, we start by importing the folium library. We create a Map object by specifying
the initial center location and zoom level. Next, we add markers to the map using the Marker
class, specifying the latitude and longitude coordinates and an optional popup message for each
marker. Finally, we display the map by calling m.
Folium provides many options to customize the map, such as setting the tile style, adding
overlays like polygons or lines, and applying different color schemes. It also supports different
tile providers like OpenStreetMap, Mapbox, and Stamen. Additionally, Folium allows you to
incorporate interactive features like tooltips, popups, and click events on the markers or other
map elements.
128
With Folium, you can easily create interactive maps to visualize geospatial data, plot markers,
draw polygons, and add other geospatial overlays to enhance your visualizations.
To create choropleth maps in Python, you can use various libraries such as GeoPandas, Plotly,
or Matplotlib. Here, I will explain how to create choropleth maps using GeoPandas, which is a
powerful library for working with geospatial data.
Python code
2.Read the shapefile or GeoJSON file containing the geographic boundaries and attribute
data:
Python code
data = gpd.read_file('path/to/shapefile.shp')
Replace 'path/to/shapefile.shp' with the actual file path of your shapefile or GeoJSON
file.
Use the head() function to preview the attribute data in the GeoDataFrame.
Verify that the data includes the necessary information for creating the choropleth map,
such as a column with the values to be mapped.
129
4.Create the choropleth map:
Python code
Replace 'column_name' with the name of the column containing the values to be
mapped.
Specify the desired color map ('color_map') to represent the data values. Matplotlib
provides various color maps, such as 'viridis', 'magma', 'coolwarm', etc.
Adjust the linewidth, edge color, and legend properties according to your preferences.
Python code
plt.show()
GeoPandas also provides additional functionality for manipulating and analyzing geospatial
data. You can perform spatial joins, overlays, or spatial queries to enhance your analysis and
visualization.
Python code
# Read shapefile
data = gpd.read_file('path/to/shapefile.shp')
130
# Create choropleth map
data.plot(column='population', cmap='viridis', linewidth=0.8, edgecolor='0.8', legend=True)
-------------THANK-YOU-------------
131