0% found this document useful (0 votes)
7 views

R & Python notes

The document provides a comprehensive overview of data science using R and Python, covering topics such as basic concepts, data structures, and visualization techniques in both programming languages. It includes detailed units on descriptive analysis, data aggregation, and advanced visualization tools, along with practical examples and code snippets. The content is designed to equip learners with the necessary skills to perform data analysis and visualization effectively.

Uploaded by

Vishnutha Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

R & Python notes

The document provides a comprehensive overview of data science using R and Python, covering topics such as basic concepts, data structures, and visualization techniques in both programming languages. It includes detailed units on descriptive analysis, data aggregation, and advanced visualization tools, along with practical examples and code snippets. The content is designed to equip learners with the necessary skills to perform data analysis and visualization effectively.

Uploaded by

Vishnutha Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

The Data Science using R and Python

1
Unit 1: Introduction to R

Basic Concept in R, Data Structure, Import of Data. Graphic Concept in R: Graphic System,
Graphic Parameter Settings, Margin Settings for Figures and Graphics, Multiple Charts, More
Complex Assembly and Layout, Font Embedding, Output with cairo pdf, Unicode in figures,
Colour settings, R packages and functions related to visualization.

Unit 2: Descriptive Analysis using R

Computing an overall summary of a variable and an entire data frame, summary() function,
sapply() function, stat.desc() function, Case of missing values, Descriptive statistics by groups,
Simple frequency distribution: one categorical variable, Two-way contingency table: Two
categorical variables, Multiway tables: More than two categorical variables.

Unit 3: Visualization of Data in R

Bar Chart Simple, Bar Chart with Multiple Response Questions, Column Chart with two- line
labeling, Column chart with 45o labeling, Profile Plot, Dot Chart for 3 variables, Pie Chart and
Radial Diagram, Chart Tables, Distributions: Histogram overlay, Box Plots for group,
Pyramids with multiple colors, Pyramid: emphasis on the outer and inner area, Pyramid with
added line, Aggregated Pyramids, Simple Lorenz curve.

Unit 4: Introduction to Python

Juypter Notebook, Python Functions, Python Types and Sequences, Python More on Strings,
Reading and Writing CSV files, Advanced Python Objects, map(), Numpy, Pandas, , Series
Data Structure, Querying a Series, The DataFrame Data Structure, DataFrame Indexing and
Loading, Querying a DataFrame, Indexing Dataframes, Merging Dataframes

Unit 5: Data Aggregation, processing and Group Operations

Time Series, Date and Time, Data Types and Tools, Time Series Basics, Date Ranges,
Frequencies, and Shifting, Time Zone Handling, Periods and Period Arithmetic, Resampling
and Frequency Conversion, Time Series Plotting, Moving Window Functions, Natural
Language Processing, Image Processing, Machine Learning K Nearest Neighbors Algorithm
for Classification, Clustering

Unit 6: Visualization of Data with Python 10 Hours

Using Matplotlib Create line plots, area plots, histograms, bar charts, pie charts, box plots and
scatter plots and bubble plots. Advanced visualization tools such as waffle charts, word clouds,
seaborn and Folium for visualizing geospatial data. Creating choropleth maps

2
Unit 1: Introduction to R

Basic Concept in R, Data Structure, Import of Data. Graphic Concept in R:


Graphic System, Graphic Parameter Settings, Margin Settings for Figures and
Graphics, Multiple Charts, More Complex Assembly and Layout, Font
Embedding, Output with cairo pdf, Unicode in figures, Colour settings, R
packages and functions related to visualization.

Basic Concepts of R

Overview
R is a statistical programming language that provides different categories of functionality in
libraries (also called packages). For applying statistical analysis, one often needs sample data.
R ships with many real-life built-in sample datasets that can be used for analysing the statistical
computations and algorithms. To develop these computations, one needs to know regular
programming constructs like variables, data types, operators, loops, etc.

Most of the programming constructs that are available in R are also available in T-SQL. Our
intention is not to learn R in full, but to learn R constructs that enable us to consume the unique
R libraries and data processing / computation mechanisms that are not available in T-SQL. In
this lesson, we will be learning the basic concepts of R, just sufficient enough for us to apply
R functions and packages against a SQL Server data repository.

R version, packages and datasets


We already learned in the last lesson how we can check the version of R server with which the
database engine is communicating. It’s necessary to know the version of R you are working
with, as that can be considered the basis of what is supported by a particular version of R server.
Using sp_execute_external_script, with a simple R property “R.version”, we can check the
details of the R version as shown below. The print function prints the output on the SSMS
message console. If this command is executed on the R console, it would print the same output
as shown below. In this lesson, our focus is developing the fundamentals of R. We will discuss
the details of sp_execute_external_script in the next lesson. Until then, consider this procedure
as an execution wrapper.

3
The next step is to explore the different default libraries available in Microsoft R Open server.
You can explore them from here. You can load any given library by using the library function.
We will look at an example of the use of this function very shortly.

After exploring the list of packages available in R, the next step is to explore the list of datasets
that you can use. You can explore as well as download a list of datasets classified by packages
from here.

Variables, Comments and Printing Data


In R, a variable is created by using the assignment operator “<-“. The data type of the variable
is determined by the data stored in R. The code can be commented in R using the # character.
Let’s understand these concepts with an example.

--Example: Variables
execute sp_execute_external_script
@language = N'R',
@script = N'

var1 <- "Siddharth"


Var1 <- "Sid"
var2 <- 100
var3 <- 50.5
var4 <- TRUE

print(var1)
print(Var1)
print(var2 + var3)
print(var4)

print(class(var1))
print(class(var2))
print(class(var4))
Executing the above code, the output should look as shown below. Below are the points you
can derive from the above example:

4
● Variables can be created using the “<-“(assignment) operator.
● Variables are case-sensitive. Var1 and var1 are considered different variables.
● The data-type of the variable is determined by the type of data stored in the variable.
● You can inquire about the value of variables using the print function
● The class function can be used on variables to determine the data type of the variable
which is classified in three major types – character, numeric and logical.
● There are other data structure types too, but we will be limiting our discussing to these
three basic types.

Arithmetic, Operators, Loops


The below table shows a list of arithmetic and logical operators in R. It’s not an exhaustive list,
but covers major operators that you may use when you start learning R.

Operator Description

+ Addition

- Subtraction

* Multiplication

/ Division

^ Exponentiation

%% Modulus

5
< Less than

<= Less than or equal to

> Greater than

>= Greater than or equal to

== Exactly equal to

!= Not equal to

! NOT

| OR

& AND

Though these operators should be easy to understand, below is a basic example of how you
may use these operators.

Here we have used these operators on actual values. You can use these operators in the same
way on variables too.

6
There is a high possibility that we may have to loop through the data for applying some
statistical computations. So, we need to learn at least one looping technique in R. Below is a
simple example of a while loop. In this example, we are assigning the value of 0 to variable i.
We are printing the value of “i” in the loop and incrementing the value of i. We are also placing
a condition that if the value of i reaches 3, then break out of the loop using the “break”
statement.

Graphic Concept in R
1. Graphic System
The graphic system in R is a powerful tool for creating high-quality graphics and visualisations.
It is based on the grid graphics system, which allows for the creation of complex graphics by
breaking them down into smaller components.

The grid graphics system is built around two main types of objects: viewports and grobs. A
viewport is a rectangular area of the plotting region that can contain one or more grobs, which
are graphical objects such as lines, text, or shapes.

7
Viewports can be nested inside each other to create more complex layouts. For example, a
viewport might contain a grid of smaller viewports, each of which contains one or more grobs.

The grid package provides a set of functions for creating and manipulating viewports and grobs.
Some of the key functions include:

grid.newpage(): Creates a new plotting page.


viewport(): Creates a new viewport with a specified size and position.
pushViewport(): Pushes a new viewport onto the viewport stack.
popViewport(): Removes the current viewport from the viewport stack.
grid.rect(), grid.lines(), grid.text(): Functions for creating grobs such as rectangles, lines, and
text.
By combining these functions, you can create complex layouts and graphical elements in R.
For example, you might create a grid of plots, each with its own set of axes and labels, or create
a complex graphic that combines multiple data sources and visualisations.

Overall, the grid graphics system in R provides a flexible and powerful tool for creating a wide
range of graphics and visualisations.

Here is an example of how the grid graphics system can be used to create a simple plot:
#example program….
library(grid)

# Create a new plot


grid.newpage()

# Set up the plot area


vp <- viewport(width=0.8, height=0.8, x=0.5, y=0.5, just=c("centre", "centre"))
pushViewport(vp)

# Draw the plot


grid.rect(gp=gpar(col="black", fill="white"))
grid.lines(x=c(0.2, 0.8), y=c(0.2, 0.8), gp=gpar(col="red", lwd=2))

8
# Add labels
grid.text("X", x=0.9, y=0.5, gp=gpar(col="black", fontsize=20))
grid.text("Y", x=0.5, y=0.9, gp=gpar(col="black", fontsize=20))

In this example, we first load the grid library, which provides the functions for creating grid-
based graphics. We then create a new plot using the grid.newpage() function.

Next, we set up the plot area using the viewport() function, which specifies the size and position
of the plot. In this case, we set the width and height to 0.8, and position the plot at the centre
of the page.

We then draw a rectangle and a line using the grid.rect() and grid.lines() functions, respectively.
We specify the colour and line width of the line using the gpar() function, which creates a
graphical parameter object.

Finally, we add labels to the x and y axes using the grid.text() function, again specifying the
colour and font size using the gpar() function.

This is just a simple example, but the grid graphics system can be used to create much more
complex and sophisticated graphics in R.

2. Graphic Parameter Settings


R provides a wide range of graphical parameters that can be used to modify the appearance of
graphics. These parameters can be used to change the color, size, shape, and other attributes of
graphical elements such as lines, points, and text.

Graphical parameters are typically set using the par() function, which takes a list of parameter-
value pairs as its argument. For example, the following code sets the line width to 2 and the
line colour to red:

example code….
par(lwd=2, col="red")

9
Some of the most commonly used graphical parameters in R include:

col: The colour of lines, points, and text.


lwd: The width of lines.
pch: The symbol used for points (e.g., circles, squares, etc.).
cex: The size of points and text.
font: The font family and style for text.

These parameters can be set globally using the par() function, or they can be set on a per-
element basis using functions like points(), lines(), and text(). For example, the following code
sets the colour and size of individual points:

example code…..
x <- 1:10
y <- x^2
plot(x, y)
points(x, y, col="blue", cex=2)

In this example, we first plot a simple line plot using the plot() function. We then add individual points
to the plot using the points() function, specifying the color and size of the points using the col and cex
parameters.
Overall, graphical parameter settings provide a powerful way to modify the appearance of graphics in
R, allowing you to create visually appealing and informative visualisations.

3. Margin Settings for Figures and Graphics explain


Margin settings in R allow you to control the spacing around the edges of a plot or graphic.
This can be useful for ensuring that your plot or graphic is properly centred and aligned, or for
creating space to add additional elements such as a title or legend.

The margin settings in R are controlled by four graphical parameters: mar, mai, oma, and
mgp. Each of these parameters controls a different aspect of the plot margins:

10
mar: This parameter controls the size of the outer margins of the plot, in lines of text. The
default value is c(5, 4, 4, 2), which means that the bottom margin is 5 lines tall, the left and
right margins are 4 lines tall, and the top margin is 2 lines tall.

mai: This parameter controls the size of the inner margins of the plot, in inches. The default
value is c(0.5, 0.5, 0.5, 0.5), which means that there is a 0.5-inch margin on all sides of the plot.

oma: This parameter controls the size of the outer margins of the entire graphic device, in lines
of text. The default value is c(0, 0, 0, 0), which means that there is no outer margin.

mgp: This parameter controls the size of the axis label margins and the line spacing between
the axis labels and the plot area. The default value is c(3, 1, 0), which means that the axis label
margin is 3 lines tall, the axis line margin is 1 line tall, and there is no spacing between the axis
labels and the plot area.

To change the margin settings of a plot or graphic in R, you can modify these parameters using
the par() function. For example, to increase the size of the outer margins of a plot, you could
use the following code:

example code
par(mar=c(8, 6, 6, 4))

This code sets the bottom margin to 8 lines, the left and right margins to 6 lines, and the top
margin to 4 lines. Similarly, to increase the size of the inner margins of the plot, you could use
the following code:

example code
par(mai=c(1, 1, 1, 1))

This code sets the inner margins to 1 inch on all sides. By adjusting these margin settings, you
can control the spacing around your plot or graphic and ensure that it looks great in your final
output.

11
4. Multiple Charts
In R, you can create multiple charts or plots within a single graphic device by using functions
like par() and layout().

One way to create multiple charts within a single graphic device is to use the par() function to
set the layout of the plots. The par() function can be used to specify the number of rows and
columns of plots, as well as the size and spacing of each plot. For example, the following code
creates a graphic device with two plots arranged in a 1x2 grid:

Example code
par(mfrow=c(1, 2))
plot(1:10, rnorm(10))
plot(rnorm(100))

In this code, we first use the par() function to set the number of rows and columns to 1x2, which
creates a grid with one row and two columns. We then create two plots using the plot() function,
and they are automatically arranged in the grid.

Another way to create multiple plots is to use the layout() function. The layout() function allows
you to specify a grid of plots using a matrix of numbers, where each number represents the size
of the corresponding plot. For example, the following code creates a graphic device with two
plots arranged in a 2x1 grid:

example code
layout(matrix(c(1,2), nrow=2))
plot(1:10, rnorm(10))
plot(rnorm(100))

In this code, we use the layout() function to specify a grid with two rows and one column, and
then create two plots using the plot() function. The first plot is assigned to the first cell in the
grid, and the second plot is assigned to the second cell.

12
Overall, creating multiple charts or plots within a single graphic device in R is a useful way to
compare data or visualisations side-by-side, and can help to create more informative and
visually appealing visualisations.

5. More Complex Assembly and Layout


When creating complex visualisations in R, it can be helpful to have more fine-grained control
over the layout and placement of multiple charts or plots. R provides several tools for achieving
this, including the gridExtra package and the ggplot2 package.

The gridExtra package provides a set of functions for arranging multiple charts or plots in a
grid-like layout. For example, the grid.arrange() function can be used to arrange multiple charts
or plots in a grid, while the arrangeGrob() function can be used to arrange multiple graphical
objects (such as plots, text, and images) in a grid. These functions allow you to control the
spacing and alignment of the grid cells, as well as the overall size and aspect ratio of the grid.

The ggplot2 package also provides a powerful system for creating complex visualisations in R.
In ggplot2, you can create plots using a layered grammar of graphics, where each layer
represents a different component of the plot (such as the data, the aesthetics, and the geometry).
This allows you to easily add and customise different components of the plot, such as adding
multiple layers of data or adjusting the layout and spacing of the plot elements.

For example, the following code creates a complex visualisation using ggplot2 to display
multiple layers of data on a single plot:

example code
library(ggplot2)
library(dplyr)
data(mpg)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(method = "lm") +
facet_wrap(~manufacturer, nrow = 3) +
theme(legend.position = "none")

13
In this code, we first load the “ggplot2” and “dplyr” libraries, and then load the mpg dataset
that comes with “ggplot2”. We then use the “ggplot()” function to create a base plot with the
“displ” and hwy variables mapped to the x and y axes, respectively. We then add multiple
layers to the plot, including a point layer with color mapped to the ‘class’ variable, a smoothed
line layer using a linear regression method, and a faceted layer that displays separate panels for
each manufacturer in the dataset. Finally, we use the ‘theme()’ function to remove the legend
from the plot.

Overall, by using tools like ‘gridExtra’ and ‘ggplot2’, you can create more complex and
sophisticated visualisations in R that can help to better communicate your data and insights.

6. Font Embedding
When creating visualisations in R, it is sometimes necessary to use custom fonts to achieve a
desired style or aesthetic. However, when sharing your visualisations with others, it is
important to ensure that the fonts are embedded in the graphics to ensure that they are displayed
correctly on other systems.

R provides several options for embedding fonts in graphics, including using the extrafont
package or the Cairo graphics device. The extrafont package allows you to install and use
custom fonts in R, and also provides functions for embedding fonts in graphics. The Cairo
graphics device, on the other hand, provides a high-quality graphics output that can be used to
embed fonts in graphics in a platform-independent way.

For example, the following code demonstrates how to use the Cairo graphics device to create
a plot with a custom font and embed the font in the output:
example code
library(ggplot2)
library(extrafont)

# Load custom font


font_import("myfont.ttf")
loadfonts()

14
# Set up plot with custom font
p <- ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
ggtitle("My Plot") +
theme(plot.title = element_text(family = "MyFont", size = 20))

# Create output with Cairo graphics device


CairoPDF("myplot.pdf", width = 6, height = 4, units = "in", dpi = 300)
print(p)
dev.off()

In this code, we first load the ggplot2 and extrafont libraries, and then use the font_import()
function to import a custom font file (myfont.ttf) into R. We then use the loadfonts() function
to load the font into R's font database.

Next, we create a ggplot2 plot (p) that uses the custom font in the plot title by specifying the
font family as "MyFont". We then use the CairoPDF() function to create a PDF output file
(myplot.pdf) with the Cairo graphics device. This function takes several arguments, including
the width and height of the output in inches, the units of measurement, the resolution in dots
per inch (dpi), and the file name.

Finally, we use the print() function to print the plot to the Cairo graphics device, and then use
the dev.off() function to close the graphics device and save the output to the file.

Overall, embedding fonts in R graphics can help to ensure that your visualizations are displayed
correctly on other systems and can help to maintain a consistent style or aesthetic across your
work.

7. Output with cairo pdf


R provides various graphics devices for creating graphical output, including pdf, png, jpeg, and
more. However, one of the most popular graphics devices is Cairo, which provides high-
quality, anti-aliased graphics output that is platform-independent.

15
To create a PDF output with the Cairo graphics device, you can use the CairoPDF() function,
which takes several arguments including the output file name, the width and height of the
output in inches, the units of measurement, the resolution in dots per inch (dpi), and more.
Here's an example:

example code
library(ggplot2)
library(Cairo)

# Create plot
p <- ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
ggtitle("My Plot")

# Output plot to PDF with Cairo


CairoPDF("myplot.pdf", width = 6, height = 4, units = "in", dpi = 300)
print(p)
dev.off()

This code creates a simple scatter plot using ggplot2, and then outputs the plot to a PDF file
(myplot.pdf) using the CairoPDF() function. The output file will be 6 inches wide by 4 inches
tall, with a resolution of 300 dots per inch (dpi).

8. Unicode in figures
Unicode is a standard for encoding characters and symbols from various writing systems,
including Latin, Cyrillic, Arabic, Chinese, and more. In R, Unicode characters can be used in
graphics to add symbols or text in various languages.

To use Unicode in R graphics, you can use the expression() function or the bquote() function
to create expressions that include Unicode characters. Here's an example:

16
example code
library(ggplot2)

# Create plot with Unicode symbols


p <- ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
ggtitle(expression(paste("My Unicode Plot: ", mu, " = 2", "\u03C3", "^2")))

# Display plot
print(p)

In this code, we create a scatter plot using ggplot2, and then add a plot title that includes
Unicode symbols using the expression() function. The title includes the Greek letter "mu", the
symbol for "equals", and the Greek letter "sigma" with a superscript "2". When the plot is
displayed, the Unicode symbols are rendered correctly.

9. Colour settings
Colour settings are an important aspect of creating effective and visually appealing
visualisations in R. R provides a wide range of built-in colour palettes, as well as functions for
creating custom colour palettes.

For example, the ggplot2 package provides several built-in colour palettes that can be used to
colourize plots, such as scale_color_brewer() and scale_color_gradient(). Here's an example of
how to use the scale_color_brewer() function to colourize a scatter plot:

Example code
library(ggplot2)

# Create plot with color scale


p <- ggplot(mtcars, aes(x = mpg, y = wt, color = factor(cyl))) +
geom_point() +
ggtitle("My Colored Plot") +
scale_color_brewer(palette = "Set1")

17
# Display plot
print(p)

In this code, we create a scatter plot using ggplot2 and colour the points by the number of
cylinders using the colour aesthetic. We then use the scale_color_brewer() function to apply a
colour palette from the ColorBrewer library to the plot. The palette argument specifies

R packages and functions related to visualisation


R provides a wide range of packages and functions for data visualisation. Here are some
popular packages and functions:

ggplot2: ggplot2 is a widely used package for creating elegant and customizable data
visualisations. It uses a grammar of graphics approach, which allows you to specify the
components of a plot separately (e.g., data, aesthetics, geometric objects, and statistical
transformations) and then combine them into a final plot.

lattice: lattice is another popular package for creating data visualisations, particularly for
multivariate data. It provides a range of high-level plotting functions for creating trellis plots,
which display multiple panels of data arranged in a grid.

plotly: plotly is an interactive visualisation library that allows you to create interactive web-
based plots in R. It provides a wide range of chart types, including scatter plots, line charts, bar
charts, heatmaps, and more.

ggvis: ggvis is a package for creating interactive visualisations using ggplot2 syntax. It uses
reactive programming to enable linked brushing and filtering, which allows you to dynamically
update visualisations based on user input.

leaflet: leaflet is a package for creating interactive maps in R. It provides a wide range of
options for customising maps, including base maps, markers, pop ups, and overlays.

18
dygraphs: dygraphs is a package for creating interactive time series plots in R. It provides a
range of options for customising time series plots, including zooming, panning, and
highlighting.

cowplot: cowplot is a package for creating complex plots by combining multiple plots together
into a single figure. It provides functions for arranging and annotating plots, as well as for
customising the appearance of the final figure.

viridis: viridis is a package for creating visually appealing color maps in R. It provides a range
of colour maps that are designed to be perceptually uniform and easy to interpret, as well as
functions for customising the appearance of colour maps.

patchwork: patchwork is a package for creating complex plots by combining multiple plots
together into a single figure. It provides a flexible grammar for arranging and annotating plots,
as well as for customising the appearance of the final figure.

gganimate: gganimate is a package for creating animated plots using ggplot2 syntax. It
provides a range of options for customising the appearance and behaviour of animated plots,
as well as for controlling the animation speed and direction.

19
Unit 2: Descriptive Analysis using R

Computing an overall summary of a variable and an entire data frame, summary()


function, sapply() function, stat.desc() function, Case of missing values,
Descriptive statistics by groups, Simple frequency distribution: one categorical
variable, Two-way contingency table: Two categorical variables, Multiway
tables: More than two categorical variables.

Overall summary of a variable and an entire data frame

This article explains how to compute the main descriptive statistics in R and how to present
them graphically. To learn more about the reasoning behind each descriptive statistics, how
to compute them by hand and how to interpret them, read the article “Descriptive statistics
by hand”.

To briefly recap what has been said in that article, descriptive statistics (in the broad sense
of the term) is a branch of statistics aiming at summarising, describing and presenting a
series of values or a dataset. Descriptive statistics is often the first step and an important
part in any statistical analysis. It allows us to check the quality of the data and it helps to
“understand” the data by having a clear overview of it. If well presented, descriptive
statistics is already a good starting point for further analyses. There exists many measures
to summarise a dataset. They are divided into two types:

1. location measures and


2. dispersion measures

Location measures give an understanding about the central tendency of the data, whereas
dispersion measures give an understanding about the spread of the data. In this article, we
focus only on the implementation in R of the most common descriptive statistics and their
visualizations (when deemed appropriate)..

20
#Description R function

Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum max()
Median median()
Range of values (minimum and range()
maximum).
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()

Summary() function

The summary() function in R is a powerful tool for generating descriptive statistics of a given
object, such as a vector, data frame, or statistical model. It provides a concise summary of the
central tendency, dispersion, and distribution of the data. Here's an overview of how the
summary() function works:

Syntax: R
summary(object)

Parameters:

object: The R object for which you want to generate the summary statistics, such as a vector,
data frame, or statistical model.

Usage:

The summary() function automatically detects the type of the object and provides an
appropriate summary based on its class. Here are a few examples:

Summary of a numeric vector: R

# Numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
summary(numeric_vector)

21
Output:

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.00 2.00 3.00 3.00 4.00 5.00

Summary of a data frame: R

# Data frame
data_frame <- data.frame(Age = c(25, 30, 35, 40, 45),
Height = c(165, 170, 175, 180, 185),
Weight = c(60, 65, 70, 75, 80))
summary(data_frame)

Output:

Age Height Weight

Min. : 25.0 Min. :165 Min. :60.0

1st Qu.: 30.0 1st Qu.:170 1st Qu.:65.0

Median : 35.0 Median :175 Median :70.0

Mean : 35.0 Mean :175 Mean :70.0

3rd Qu.: 40.0 3rd Qu.:180 3rd Qu.:75.0

Max. : 45.0 Max. :185 Max. :80.0

sapply() function

The sapply() function in R is used to apply a given function to each element of a vector or list
and returns a simplified version of the result. It is a convenient way to apply functions to
multiple elements simultaneously and obtain the output in a compact format.

Syntax: R

sapply(X, FUN, ...)

Parameters:

X: A vector or list.

FUN: The function to be applied to each element of X.

...: Additional arguments to be passed to the function FUN.

22
Usage:

Here's an example of using sapply() to apply the mean() function to each column of a data
frame:

R codes

# Data frame
data <- data.frame(A = c(1, 2, 3), B = c(4, 5, 6), C = c(7, 8, 9))
# Applying mean() function to each column
result <- sapply(data, mean)
# Output
result

Output:

css

A B C

2 5 8

In this example, the mean() function is applied to each column of the data data frame using
sapply(). The result is a vector containing the means of each column.

stat.desc() function:

The stat.desc() function is not a built-in function in R, but it is available in some packages, such
as "pastecs" or "psych". It provides a comprehensive summary of descriptive statistics for a
numeric vector or a data frame, including measures such as mean, median, standard deviation,
minimum, maximum, quartiles, skewness, and kurtosis.

Syntax: R

stat.desc(x, basic = TRUE, desc = TRUE, norm = FALSE, p = 0.95)

Parameters:

x: A numeric vector or a data frame.


basic: Logical value indicating whether to include basic descriptive statistics (mean, median,
etc.). Default is TRUE.

23
desc: Logical value indicating whether to include additional descriptive statistics (variance,
skewness, etc.). Default is TRUE.
norm: Logical value indicating whether to include tests of normality. Default is FALSE.
p: Confidence level for the confidence intervals. Default is 0.95.

Usage:

Here's an example using the stat.desc() function from the "pastecs" package to compute
descriptive statistics for a numeric vector:

R codes

# Install and load the "pastecs" package

install.packages("pastecs")
library(pastecs)
# Numeric vector
vector <- c(1, 2, 3, 4, 5)
# Compute descriptive statistics
result <- stat.desc(vector)
# Output
result

Output:

nbr.val nbr.null nbr.na min max range sum

5 0 0 1 5 4 15

median mean SE.mean CI.mean.0.95 var std.dev coef.var

3.0 3.0 0.6 1.34 2.5 1.6 0.53

skewness kurtosis se.skew se.kurt

0.00 -1.30 0.89 1.77

Case of missing values


In data analysis, missing values refer to the absence of data for certain observations or
variables. Dealing with missing values is crucial as they can impact the accuracy and reliability
of statistical analyses and modeling. Here are some important concepts and techniques for
handling missing values in R:

24
1. Types of missingness:
● Missing Completely at Random (MCAR): The missingness is unrelated to any other
variables or the observed data. It is a random occurrence.
● Missing at Random (MAR): The missingness is related to other observed variables but
not to the missing values themselves.
● Missing Not at Random (MNAR): The missingness is related to the missing values
themselves, which are typically related to unobserved factors or reasons.
2. Identifying missing values:

● The is.na() function in R can be used to identify missing values. It returns a logical
vector with TRUE for missing values and FALSE for non-missing values.
● Other functions like complete.cases() and anyNA() can also be used to check for
missing values in vectors or data frames.

3.Handling missing values:

● Removing missing values: If missing values are minimal and do not significantly affect
the analysis, you can remove the observations or variables with missing values using
functions like na.omit() or subsetting techniques.
● Imputing missing values: Imputation involves estimating missing values based on
observed data. Common imputation methods include mean imputation, median
imputation, hot-deck imputation, regression imputation, and multiple imputation.
● Creating a missing indicator: Instead of imputing or removing missing values, you can
create a binary indicator variable that denotes the presence or absence of missing values
for each observation or variable.

4.Package and function options for missing value handling:

● R packages like "mice," "Amelia," and "missForest" provide comprehensive tools for
imputing missing values using various algorithms.
● Many R functions have built-in options to handle missing values. For example, the
na.rm argument in functions like mean(), median(), and sum() can be set to TRUE to
exclude missing values from calculations.

25
5.Sensitivity analysis:
● It is important to consider the potential impact of missing values on the validity of the
analysis. Conducting sensitivity analyses by comparing results with and without
imputation can help assess the robustness of the findings.
Handling missing values requires careful consideration and should be based on the specific
dataset and research question. It is important to understand the nature of missingness, select
appropriate techniques, and interpret the results appropriately to ensure the accuracy and
validity of data analyses in R.

Descriptive statistics by groups


When conducting data analysis in R, it is often necessary to compute descriptive statistics for
different groups within your data. This allows you to summarize and compare the
characteristics and distribution of variables across different categories or levels of a grouping
variable. Here are some important concepts and techniques for computing descriptive statistics
by groups in R:

1.Grouping variable:
A grouping variable is a categorical variable that defines the groups for which you want to
compute descriptive statistics. It divides the data into distinct subsets based on its different
levels or categories.

2.Split-Apply-Combine strategy:
The most common approach to compute descriptive statistics by groups in R is the split-apply-
combine strategy. It involves splitting the data into groups, applying a summary function to
each group, and then combining the results.

3.Base R functions:
● aggregate(): This function allows you to compute summary statistics by group using a
formula syntax. It takes a formula specifying the variable(s) to summarise and the
grouping variable(s), along with the summary function to apply.

26
● by(): The by() function applies a function to subsets of a data frame split by a factor
variable. It takes the data frame, the grouping variable(s), and the function to apply as
arguments.

4.dplyr package:
The dplyr package provides a set of functions for data manipulation and transformation. It
offers an intuitive syntax for computing descriptive statistics by groups. Key functions include:
● group_by(): This function groups the data by one or more variables.
● summarise(): It applies summary functions to the grouped data to compute descriptive
statistics.
● %>% (pipe operator): It allows you to chain operations together, making the code
more readable and concise.

5. summaryBy function (in the doBy package):


The doBy package provides the summaryBy() function, which allows you to compute
descriptive statistics by groups using a formula syntax. It takes a formula specifying the
variable(s) to summarize and the grouping variable(s), along with the summary function(s) to
apply.

These approaches enable you to compute various descriptive statistics, such as means, medians,
standard deviations, quantiles, counts, or proportions, for different groups within your data. By
examining these statistics, you can gain insights into the distribution and characteristics of
variables across different groups, facilitating comparisons and further analysis.

Simple frequency distribution: one categorical variable


A simple frequency distribution in R refers to the process of calculating the counts or
frequencies of each category in a categorical variable. It helps to understand the distribution
and occurrence of different categories within the variable. Here's an overview of the steps
involved in creating a simple frequency distribution for a categorical variable in R:

1. Categorical Variable:
● A categorical variable is a variable that represents different categories or groups.
● It can be represented as a character, factor, or integer variable in R.

27
2. Creating the Frequency Distribution:

● The table() function in R is commonly used to create a frequency distribution for a


categorical variable.
● The table() function takes a vector or data frame column as input and returns a table
object with the frequencies of each category.
Example:
R code
# Create a vector with categorical data
categories <- c("A", "B", "B", "C", "A", "A", "B", "C", "C", "C")

# Compute the frequency distribution using table()


frequency <- table(categories)

# Print the frequency distribution


print(frequency)

Interpreting the Frequency Distribution:


● The resulting frequency distribution table displays the unique categories as the table
rows and the corresponding frequency counts in the table cells.
● You can analyse the distribution and frequencies to understand the prevalence and
occurrence of each category.
Example Output:
categories
A. B C
3 3 4

3. Visualising the Frequency Distribution:


 Visual representations, such as bar plots or pie charts, can provide a clearer view of
the frequency distribution.
 In R, you can use functions like barplot() or pie() to create visualizations based on
the frequency distribution table.

Example Bar Plot:

28
R code
# Create a bar plot of the frequency distribution
barplot(frequency)

The bar plot will display bars of different heights corresponding to the frequency counts of
each category.

By creating a simple frequency distribution, you gain insights into the distribution and relative
frequencies of categories within a categorical variable, which can be helpful for exploratory
data analysis and understanding the characteristics of the data.

Two-way contingency table: Two categorical variables


A two-way contingency table, also known as a cross-tabulation or a two-dimensional frequency
table, is used to analyze the relationship between two categorical variables. It shows the joint
distribution of the categories of both variables. Here's an overview of the theory and steps
involved in creating a two-way contingency table in R:

Categorical Variables:
● A two-way contingency table involves two categorical variables.
● Each variable consists of categories or levels that represent different groups or
attributes.

Creating the Contingency Table:


The table() function in R is commonly used to create a contingency table for two categorical
variables.
You provide the two categorical variables as arguments to the table() function.
Example: R
# Create two vectors with categorical data
variable1 <- c("A", "B", "B", "C", "A", "A", "B", "C", "C", "C")
variable2 <- c("X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", "X")

# Compute the contingency table using table()


contingency <- table(variable1, variable2)

29
# Print the contingency table
print(contingency)

Interpreting the Contingency Table:


● The resulting contingency table displays the joint frequencies or counts of each
combination of categories from both variables.
● The rows of the table represent the categories of the first variable, while the columns
represent the categories of the second variable.
Example Output:
variable2
variable1 X Y
A21
B21
C22

Analyzing the Relationship:


You can examine the frequencies in the contingency table to understand the relationship
between the two categorical variables.
Look for patterns, associations, or dependencies between the categories of the variables.

Visualizing the Contingency Table:


● Visualisations like stacked bar plots or heatmaps can provide a graphical representation
of the contingency table.
● R packages like ggplot2 or heatmap() can be used to create visualisations based on the
contingency table.
Example Stacked Bar Plot: R
# Create a stacked bar plot of the contingency table
barplot(contingency, beside = TRUE, legend = TRUE)

The stacked bar plot will display bars representing each category of the first variable, with the
height of the bars representing the frequencies, and each bar subdivided into segments for each
category of the second variable.

30
By creating a two-way contingency table, you gain insights into the relationship and association
between two categorical variables, enabling further analysis and understanding of the data.

Multiway tables: More than two categorical variables


A multiway table, also known as a multi-dimensional contingency table or a cross-tabulation
involving more than two categorical variables, allows for the analysis of relationships among
multiple categorical variables simultaneously. It provides a comprehensive view of the joint
distribution of categories across all variables. Here's an overview of the theory and steps
involved in creating a multiway table in R:

Categorical Variables:

● A multiway table involves three or more categorical variables.


● Each variable consists of categories or levels representing different groups or attributes.

Creating the Multiway Table:

● The table() function in R can be extended to create multiway tables by including


additional categorical variables as arguments.
● You provide the multiple categorical variables separated by commas within the table()
function.

Example: R
# Create three vectors with categorical data
variable1 <- c("A", "B", "B", "C", "A", "A", "B", "C", "C", "C")
variable2 <- c("X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", "X")
variable3 <- c("P", "Q", "Q", "P", "R", "R", "R", "Q", "P", "Q")

# Compute the multiway table using table()


multiway_table <- table(variable1, variable2, variable3)

# Print the multiway table


print(multiway_table)

31
Interpreting the Multiway Table:
The resulting multiway table displays the joint frequencies or counts for all combinations of
categories across all variables.
The dimensions of the table correspond to the categories of each variable.
Example Output:
, , variable3 = P
variable2
variable1 X Y
A10
B10
C20

, , variable3 = Q
variable2
variable1 X Y
A00
B02
C11

, , variable3 = R
variable2
variable1 X Y
A10
B00
C01

Analysing the Relationship:


Examine the frequencies in the multiway table to understand the relationship and dependencies
among the categorical variables.
Identify patterns, associations, or dependencies across the categories of all variables.

32
Visualising the Multiway Table:
Visualisations for multiway tables can become complex due to the involvement of multiple
variables.
Techniques such as mosaic plots, heatmaps, or stacked bar plots can be used to visualize the
joint distribution across multiple categorical variables.

Example Mosaic Plot: R


# Load the vcd package for mosaic plot
install.packages("vcd")
library(vcd)

# Create a mosaic plot of the multiway table


mosaic(multiway_table)

The mosaic plot represents the joint distribution of categories across multiple variables using
areas proportional to the frequencies.

By creating a multiway table, you gain a comprehensive understanding of the joint distribution
and relationships among multiple categorical variables. It enables you to explore and analyse
complex associations within your data.

33
Unit 3: Visualisation of Data in R

Bar Chart Simple, Bar Chart with Multiple Response Questions, Column Chart
with two- line labeling, Column chart with 45o labeling, Profile Plot, Dot Chart
for 3 variables, Pie Chart and Radial Diagram, Chart Tables, Distributions:
Histogram overlay, Box Plots for group, Pyramids with multiple colors, Pyramid:
emphasis on the outer and inner area, Pyramid with added line, Aggregated
Pyramids, Simple Lorenz curve

Bar Chart Simple


To create a simple bar chart in R, you can use the barplot() function. A bar chart is a common
type of plot used to visualize categorical data by representing the frequencies or values of
different categories using bars. Here's an example of how to create a basic bar chart in R:

R- code
# Create a vector with categorical data
categories <- c("A", "B", "C", "D", "E")

# Create a vector with corresponding frequencies or values


frequencies <- c(10, 15, 7, 12, 9)

# Create a basic bar chart using barplot()


barplot(frequencies, names.arg = categories, xlab = "Categories", ylab = "Frequencies")

In the code above, we first define a vector categories containing the different categories, and a
vector frequencies containing the corresponding frequencies or values for each category. We
then use the barplot() function to create the bar chart. The frequencies vector is passed as the
first argument, and the names.arg parameter is used to label the x-axis with the categories.

You can further customize the bar chart by modifying additional parameters. For example, you
can set the title of the chart using the main parameter, adjust the colors of the bars using the col
parameter, add grid lines using the grid parameter, and more. Refer to the documentation of
the barplot() function for a full list of available customization options.

34
By creating a simple bar chart, you can visually compare the frequencies or values of different
categories, making it easier to understand the distribution and relative magnitudes of your data.

Bar Chart with Multiple Response Questions


To create a bar chart with multiple response questions in R, you can use the barplot() function
with stacked bars. Here's an example that demonstrates how to create a stacked bar chart for
multiple response questions:

R - code
# Create a data frame with the responses
data <- data.frame(
Respondent = c(1:10),
Option1 = c(1, 1, 0, 1, 0, 1, 0, 0, 1, 1),
Option2 = c(1, 0, 1, 0, 1, 1, 0, 1, 0, 1),
Option3 = c(0, 1, 1, 0, 1, 0, 1, 1, 0, 0)
)

# Compute the frequencies of each option


option_frequencies <- colSums(data[, -1])

# Create a stacked bar chart using barplot()


barplot(option_frequencies, beside = TRUE, col = rainbow(ncol(data) - 1),
names.arg = colnames(data)[-1], xlab = "Options", ylab = "Frequency")

In this example, we have a dataframe data with the respondent ID in the first column
(Respondent) and subsequent columns (Option1, Option2, Option3) representing the different
options. Each cell indicates whether the respondent selected that option (1) or not (0).

We use the colSums() function to calculate the frequencies of each option. By applying it to
data[, -1], we exclude the respondent ID column from the calculation. The result is stored in
the option_frequencies vector.

Next, we create a stacked bar chart using the barplot() function. The beside = TRUE parameter
ensures that the bars are stacked instead of grouped side by side. The col parameter assigns
colors to the bars, and names.arg labels the x-axis with the option names. The xlab and ylab
parameters set the labels for the x-axis and y-axis, respectively.

35
The resulting bar chart will display stacked bars, with each segment representing one option.
The different colors differentiate between the options, allowing for easy comparison of
frequencies across the categories.

Column Chart with two- line labeling


To create a column chart with two-line labeling in R, you can use the barplot() function along
with custom axis labels. Here's an example that demonstrates how to create a column chart
with two-line labeling:

R- code
# Create a vector with categories
categories <- c("Category 1", "Category 2", "Category 3")

# Create a vector with corresponding values


values <- c(10, 15, 7)

# Create a column chart using barplot()


barplot(values, names.arg = categories,
xlab = "Categories", ylab = "Values",
main = "Column Chart with Two-Line Labeling")

# Customize the axis labels with two lines


axis(1, at = 1:length(categories), labels = c("Line 1\n(Label)", "Line 2"), las = 1)

In this example, we define a vector categories representing the different categories and a vector
values containing the corresponding values for each category.

We create a column chart using the barplot() function, passing the values vector as the first
argument and the categories vector as the names.arg parameter. The xlab and ylab parameters
set the labels for the x-axis and y-axis, respectively. The main parameter sets the title of the
chart.

To add two-line labeling on the x-axis, we use the axis() function. The at parameter specifies
the position of the axis labels, and labels assigns custom labels to those positions. In this case,
we provide a vector with two elements, where the first element includes a line break (\n) to
split the label into two lines. The las parameter sets the orientation of the axis labels, with 1
indicating horizontal orientation.

36
By customizing the axis labels with two lines, you can provide more detailed information or
add line breaks to improve the readability of the labeling in your column chart.

Column chart with 45o labeling


To create a column chart with 45-degree labeling in R, you can use the barplot() function along
with the las parameter. Here's an example that demonstrates how to create a column chart with
45-degree labeling:

R-code
# Create a vector with categories
categories <- c("Category 1", "Category 2", "Category 3")

# Create a vector with corresponding values


values <- c(10, 15, 7)

# Create a column chart using barplot()


barplot(values, names.arg = categories,
xlab = "Categories", ylab = "Values",
main = "Column Chart with 45-Degree Labeling")

# Customize the axis labels with 45-degree rotation


par(las = 1) # Set orientation to horizontal
axis(1, at = 1:length(categories), labels = categories, cex.axis = 0.8, las = 2)

In this example, we define a vector categories representing the different categories and a vector
values containing the corresponding values for each category.

We create a column chart using the barplot() function, passing the values vector as the first
argument and the categories vector as the names.arg parameter. The xlab and ylab parameters
set the labels for the x-axis and y-axis, respectively. The main parameter sets the title of the
chart.

To add 45-degree labeling on the x-axis, we use the axis() function. The at parameter specifies
the position of the axis labels, and labels assigns the category labels to those positions. The
cex.axis parameter adjusts the size of the axis labels, and las = 2 sets the orientation of the
labels to 45 degrees.

37
By customizing the axis labels with a 45-degree rotation, you can fit longer category names or
improve the readability of the labeling in your column chart. Adjust the cex.axis parameter to
control the size of the labels according to your preferences.

Profile Plot
A profile plot is a visualization technique used to display the change in a continuous variable
across different levels of one or more categorical variables. In R, you can create profile plots
using various packages, such as ggplot2 or lattice.

Here's an example of how to create a profile plot using the ggplot2 package in R:

R - code
# Load the ggplot2 package
library(ggplot2)

# Create a sample dataset


data <- data.frame(
Category = rep(c("A", "B", "C"), each = 5),
Level = rep(c("Low", "Medium", "High"), times = 5),
Value = c(10, 15, 12, 8, 11, 7, 9, 13, 16, 14, 18, 20, 17, 19, 22)
)
# Create a profile plot using ggplot2
ggplot(data, aes(x = Level, y = Value, group = Category, color = Category)) +
geom_line() +
geom_point() +
labs(x = "Level", y = "Value", title = "Profile Plot") +
theme_minimal()

In this example, we have a dataset data with three variables: Category, Level, and Value. The
Category variable represents the different groups or categories, the Level variable represents
the levels within each category, and the Value variable represents the continuous variable of
interest.

We use the ggplot() function to create a plot object and specify the dataset. The aes() function
defines the aesthetic mappings, where we map Level to the x-axis (x), Value to the y-axis (y),
Category to the grouping (group), and Category to the color aesthetic.

38
We then add geom_line() and geom_point() layers to draw the lines and points for each
category. The labs() function is used to set the labels for the x-axis, y-axis, and plot title.
Finally, we apply the theme_minimal() theme to style the plot.

The resulting plot will display a line for each category, showing the change in the continuous
variable (Value) across the different levels (Low, Medium, High). The points represent the
actual data points for each level and category, while the lines connect them to visualize the
overall trend or profile.

Dot Chart for 3 variables


A dot chart, also known as a dot plot or Cleveland dot plot, is a simple yet effective way to
visualize data points along a single axis. In R, you can create a dot chart for three variables
using the stripchart() function. Here's an example:

R - code
# Create a sample dataset
data <- data.frame(
Category = c("A", "B", "C"),
Variable1 = c(10, 15, 8),
Variable2 = c(12, 9, 14),
Variable3 = c(7, 13, 16)
)

# Create a dot chart using stripchart()


stripchart(
data[, -1], # Select the variables to plot (excluding Category)
method = "stack", # Stack the dots for overlapping points
pch = 19, # Set the point shape to a solid circle
col = c("red", "green", "blue"), # Set different colors for each variable
xlim = c(0, max(data[, -1])), # Set the x-axis limits
xlab = "Value", # Label for the x-axis
ylab = "Category", # Label for the y-axis
main = "Dot Chart for Three Variables" # Title of the plot
)

39
In this example, we have a dataset data with four columns: Category and three variables
(Variable1, Variable2, Variable3). Each row represents a different category (A, B, C), and the
variables hold the corresponding values.

We use the stripchart() function to create a dot chart. The first argument, data[, -1], selects the
variables to plot (excluding the Category column). The method parameter is set to "stack" to
stack the dots for overlapping points. The pch parameter sets the point shape to a solid circle,
and the col parameter assigns different colors to each variable.

We set the x-axis limits (xlim) based on the maximum value of the variables. The xlab and ylab
parameters label the x-axis and y-axis, respectively. Finally, the main parameter sets the title
of the plot.
The resulting dot chart will display a stack of dots for each category, with each dot representing
a data point for a particular variable. The different colors differentiate the variables, allowing
for easy comparison between categories and variables.

Pie Chart and Radial Diagram


In R, you can create a pie chart and a radial diagram to visualize data.
To create a pie chart, you can use the pie() function. Here's an example:
R - code
# Create a vector with category names
categories <- c("Category 1", "Category 2", "Category 3")

# Create a vector with corresponding values


values <- c(40, 30, 20)

# Create a pie chart


pie(values, labels = categories, main = "Pie Chart")

In this example, we define a vector categories representing the category names and a vector
values containing the corresponding values for each category.

We use the pie() function to create the pie chart, passing the values vector as the first argument
and the labels parameter to assign the category names as labels. The main parameter sets the
title of the chart.

40
To create a radial diagram, you can use the radial.plot() function from the plotrix
package.
Here's an example: R - code
# Install and load the plotrix package
install.packages("plotrix")
library(plotrix)

# Create a vector with category names


categories <- c("Category 1", "Category 2", "Category 3")

# Create a vector with corresponding values


values <- c(40, 30, 20)

# Create a radial diagram


radial.plot(values, labels = categories, line.col = "black", main = "Radial Diagram")

In this example, we first install and load the plotrix package, which provides the radial.plot()
function.
We define the categories vector with the category names and the values vector with the
corresponding values.
We use the radial.plot() function to create the radial diagram, passing the values vector as the
first argument and the labels parameter to assign the category names as labels. The line.col
parameter sets the color of the lines connecting the points, and the main parameter sets the title
of the diagram.

Both the pie chart and radial diagram provide a visual representation of data, with the pie chart
showing proportions of a whole and the radial diagram displaying values on a circular axis.
Choose the appropriate visualization based on the nature and purpose of your data.

Chart Tables
In R, you can create chart tables, which are tabular representations of data with additional
formatting and visual elements. There are several packages available in R that provide
functions to create chart tables, including gt, flextable, and kableExtra. Here's an example using
the gt package:

41
R - code
# Install and load the gt package
install.packages("gt")
library(gt)

# Create a sample dataset


data <- data.frame(
Category = c("Category 1", "Category 2", "Category 3"),
Value1 = c(10, 15, 8),
Value2 = c(12, 9, 14),
Value3 = c(7, 13, 16)
)

# Create a gt table
table <- gt(data)

# Format the table


table <- table %>%
tab_header(title = "Chart Table") %>%
fmt_number(columns = 2:4, decimals = 1) %>%
fmt_number(columns = 1, rows = 1, text_format = text_format(weight = "bold"))

# Render the table


gt::gtsave(table, "chart_table.png")

In this example, we create a sample dataset data with four columns: Category, Value1, Value2,
and Value3.

We use the gt() function to create a gt table object from the data.
To format the table, we chain multiple functions using the %>% operator from the magrittr
package. The tab_header() function sets the title of the table. The fmt_number() function is
used to format numeric columns with a specified number of decimal places. In this case, we
format columns 2 to 4 with 1 decimal place. The fmt_number() function can also be used to
apply other formatting options, such as currency or percentage formatting. The text_format
argument in the fmt_number() function allows you to apply additional text formatting, such as
making specific cells bold.

Finally, we save the table as an image file using the gtsave() function from the gt package.

42
This is just one example of creating a chart table in R using the gt package. You can explore
other packages like flextable and kableExtra for additional features and customization options
to create chart tables that suit your specific needs.

Distributions: Histogram overlay

Certainly! A histogram overlay is a useful visualization technique that allows you to compare
the distributions of multiple variables by overlaying their histograms on a single plot. In R, you
can achieve this using the ggplot2 package. Let's go through an example step by step:

R - code
# Load the ggplot2 package
library(ggplot2)

# Create a random dataset


data <- data.frame(
Variable1 = rnorm(1000, mean = 50, sd = 10),
Variable2 = rnorm(1000, mean = 70, sd = 15)
)

In this example, we first load the ggplot2 package, which provides functions for creating data
visualizations. Then, we create a random dataset called data with two variables: Variable1 and
Variable2. We generate 1000 random values for each variable using the rnorm() function,
where rnorm() creates random numbers following a normal distribution. The mean and
standard deviation are specified to control the characteristics of the distributions.

R - code
# Create a histogram overlay using ggplot2
ggplot(data, aes(x = value, fill = variable)) +
geom_histogram(alpha = 0.5, bins = 30, color = "black") +
labs(x = "Value", y = "Frequency", title = "Histogram Overlay") +
theme_minimal()

In this code snippet, we use the ggplot() function to create a plot object and specify the dataset
data as the data source. The aes() function (short for aesthetics) is used to map the variables to
visual elements. We map the values of both Variable1 and Variable2 to the x-axis (x) using

43
value, and we assign the variable to the fill aesthetic, which will be used to differentiate the
two histograms.

The geom_histogram() function is used to create the histograms. We set alpha = 0.5 to make
the histograms semi-transparent, allowing both distributions to be visible. The bins parameter
controls the number of bins in the histograms, determining the granularity of the distribution
representation. The color parameter sets the outline color of the histograms.

We use the labs() function to set the x-axis label, y-axis label, and plot title. In this case, the x-
axis is labeled as "Value", the y-axis as "Frequency", and the title as "Histogram Overlay".
Finally, we apply the theme_minimal() theme to style the plot with a clean and minimalistic
appearance.

When you run this code, you will obtain a histogram overlay plot showing the distributions of
Variable1 and Variable2 overlaid on a single chart. The transparency and distinct colors help
to visualize and compare the two distributions. You can modify the code to suit your specific
data and visualization requirements, such as adjusting the number of bins, colors, or plot labels.

Box Plots for group


To create box plots for groups in R, you can use the ggplot2 package. Box plots are useful for
visualizing the distribution of a continuous variable across different groups or categories.
Here's an example:

R - code
# Load the ggplot2 package
library(ggplot2)

# Create a sample dataset


data <- data.frame(
Group = rep(c("Group 1", "Group 2", "Group 3"), each = 100),
Value = c(rnorm(100, mean = 50, sd = 10),
rnorm(100, mean = 70, sd = 15),
rnorm(100, mean = 60, sd = 12))
)

44
# Create a box plot by group using ggplot2
ggplot(data, aes(x = Group, y = Value)) +
geom_boxplot() +
labs(x = "Group", y = "Value", title = "Box Plot by Group") +
theme_minimal()

In this example, we create a sample dataset called data with two columns: Group and Value.
Each group has 100 observations, and the Value column contains continuous variable values.

Using the ggplot() function, we create a plot object and specify the dataset data as the data
source. The aes() function maps the Group variable to the x-axis (x) and the Value variable to
the y-axis (y).

We then use geom_boxplot() to create the box plots. This function automatically computes and
visualizes the five-number summary (minimum, lower quartile, median, upper quartile, and
maximum) of the Value variable for each group.

The labs() function is used to set the x-axis label, y-axis label, and plot title. In this example,
the x-axis is labeled as "Group", the y-axis as "Value", and the title as "Box Plot by Group".

Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.

When you run this code, you will obtain a box plot showing the distribution of the Value
variable for each group. Each box plot represents the range and quartiles of the variable values
within each group, allowing for easy comparison of distributions between groups. You can
further customize the plot by modifying the labels, adding color or grouping options, or
adjusting the plot theme to suit your specific requirements.

Pyramids with multiple colors


To create pyramids with multiple colors in R, you can use the geom_bar() function from the
ggplot2 package. Here's an example:

R - code
library(ggplot2)
# Create data for the pyramids

45
categories <- c('Category 1', 'Category 2', 'Category 3')
values <- c(10, 20, 30)

# Create a data frame


data <- data.frame(
Category = categories,
Value = values
)

# Create the pyramid plot


ggplot(data, aes(x = 1, y = Value, fill = Category)) +
geom_bar(stat = 'identity', width = 1, color = 'black') +
coord_flip() +
scale_fill_manual(values = c('red', 'green', 'blue')) +
labs(x = '', y = 'Value', title = 'Pyramid with Multiple Colors') +
theme_minimal()

In this example, we first load the ggplot2 package.

We create the data for the pyramids with the categories and corresponding values.

Using ggplot(), we specify the data frame as the data source and map the Value variable to the
y-axis (y), and the Category variable to the fill aesthetic.

We use geom_bar() with stat = 'identity' to create the pyramids. The width = 1 argument sets
the width of the bars, and the color = 'black' argument adds a black border around each bar.

To create pyramids, we use coord_flip() to flip the x and y axes, resulting in vertical pyramids.

The scale_fill_manual() function allows us to manually specify the colors for each category. In
this example, we assign the colors 'red', 'green', and 'blue' to the categories.

We use labs() to set the x-axis label to an empty string, the y-axis label to 'Value', and the plot
title to 'Pyramid with Multiple Colors'.

Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.

46
When you run this code, you will obtain pyramids with multiple colors. Each category is
represented by a segment of the pyramid, and the colors assigned to each category are specified
using the scale_fill_manual() function. You can modify the categories, values, colors, and other
parameters to create pyramids with multiple colors that suit your specific data and visualization
requirements.

Pyramid: emphasis on the outer and inner area

To create a pyramid chart in R with emphasis on the outer and inner areas, you can use the
plotrix package. This package provides functions to create specialized plots, including pyramid
charts. Here's an example:

R - code
library(plotrix)

# Create data for the pyramid


categories <- c('Category 1', 'Category 2', 'Category 3')
values <- c(10, 20, 30)

# Calculate the cumulative values


cumulative_values <- cumsum(values)

# Create the pyramid chart


pyramid.plot(cumulative_values, labels = categories, top.labels = values,
main = 'Pyramid Chart with Emphasis on Outer and Inner Area',
col = c('lightblue', 'lightgreen', 'lightpink'),
border = NA, legend = FALSE)

In this example, we first load the plotrix package.


We create the data for the pyramid with the categories and corresponding values.
Next, we calculate the cumulative values using cumsum() function. This is done to emphasize
the outer and inner areas of the pyramid.

We use the pyramid.plot() function from the plotrix package to create the pyramid chart. We
pass the cumulative values as the first argument, the category labels as labels, and the values
as top.labels. The col argument allows us to specify the colors for each category. In this
example, we use 'lightblue', 'lightgreen', and 'lightpink' for the outer and inner areas of the

47
pyramid. The border argument is set to NA to remove the borders around the pyramid
segments. The legend argument is set to FALSE to hide the legend.

We use the main argument to set the main title of the chart to 'Pyramid Chart with Emphasis
on Outer and Inner Area'.

When you run this code, you will obtain a pyramid chart with emphasis on the outer and inner
areas. The colors and cumulative values help draw attention to the relative sizes of the
segments. You can modify the categories, values, colors, and other parameters to create a
pyramid chart that suits your specific data and visualization requirements.

Pyramid with added line


To create a pyramid chart with an added line in R, you can use the plotrix package along with
the base R plotting functions. Here's an example:

R - code
library(plotrix)

# Create data for the pyramid


categories <- c('Category 1', 'Category 2', 'Category 3')
values <- c(10, 20, 30)

# Calculate the cumulative values


cumulative_values <- cumsum(values)

# Create the pyramid chart


pyramid.plot(cumulative_values, labels = categories, main = 'Pyramid Chart with Added
Line',
col = 'lightblue', border = NA, legend = FALSE)

# Add a horizontal line


abline(h = 0, lwd = 2, col = 'red')

In this example, we first load the plotrix package.


We create the data for the pyramid with the categories and corresponding values.
Next, we calculate the cumulative values using cumsum() function.

48
We use the pyramid.plot() function from the plotrix package to create the pyramid chart. We
pass the cumulative values as the first argument, the category labels as labels, and set other
parameters such as the col (color) and border (border color) arguments. The legend argument
is set to FALSE to hide the legend.

After creating the pyramid chart, we add a horizontal line using the abline() function. The h
argument specifies the y-coordinate of the line, lwd sets the line width, and col defines the line
color. In this example, we use a red line.

When you run this code, you will obtain a pyramid chart with an added horizontal line. The
line can be used to indicate a specific value or a reference point within the chart. You can
modify the categories, values, colors, line properties, and other parameters to customize the
pyramid chart with the added line to suit your specific data and visualization requirements.

Aggregated Pyramids

To create aggregated pyramids in R, you can use the ggplot2 package. Aggregated pyramids
allow you to compare two sets of data side by side, each represented as a pyramid. Here's an
example:

R - code
library(ggplot2)

# Create data for the pyramids


categories <- c('Category 1', 'Category 2', 'Category 3')
values1 <- c(10, 20, 30)
values2 <- c(15, 25, 35)

# Combine the data into a single data frame


data <- data.frame(
Group = rep(c('Group 1', 'Group 2'), each = length(categories)),
Category = rep(categories, times = 2),
Value = c(values1, values2)
)

49
# Create the aggregated pyramids
ggplot(data, aes(x = Group, y = ifelse(Group == 'Group 1', -Value, Value), fill = Category))
+
geom_bar(stat = 'identity', position = 'identity') +
coord_flip() +
scale_fill_manual(values = c('green', 'blue', 'green', 'blue', 'green', 'blue')) +
labs(x = '', y = 'Value', title = 'Aggregated Pyramids') +
theme_minimal()

In this example, we first load the ggplot2 package.

We create the data for the pyramids with the categories and values for two groups: Group 1
and Group 2. We combine the data into a single data frame.

Using ggplot(), we specify the data frame as the data source and map the Group variable to the
x-axis (x), the Value variable to the y-axis (y), and the Category variable to the fill aesthetic.

We use geom_bar() with stat = 'identity' to create the pyramids. The position = 'identity'
argument ensures that the bars are positioned according to the Value variable.

To create the aggregated effect, we use ifelse() within the aes() mapping to assign negative
values to Group 1 and positive values to Group 2, effectively mirroring the pyramids.

We use coord_flip() to flip the x and y axes, resulting in horizontal pyramids.

The scale_fill_manual() function allows us to manually specify the colors for each category. In
this example, we assign different colors to each category within each group.

We use labs() to set the x-axis label to an empty string, the y-axis label to 'Value', and the plot
title to 'Aggregated Pyramids'.

Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.

When you run this code, you will obtain aggregated pyramids showing the comparison between
two groups. Each group is represented by a pyramid, and the categories are aligned side by

50
side. You can modify the categories, values, colors, and other parameters to create aggregated
pyramids that match your specific data and visualization requirements.

Simple Lorenz curve


To create a simple Lorenz curve in R, you can use the ggplot2 package. The Lorenz curve is a
graphical representation of income or wealth inequality. Here's an example:

R - code
library(ggplot2)

# Create data for the Lorenz curve


cumulative_perc <- seq(0, 100, 10)
cumulative_share <- c(0, 10, 20, 35, 50, 65, 80, 90, 95, 100)

# Create a data frame


data <- data.frame(
Percentile = cumulative_perc,
Share = cumulative_share
)

# Create the Lorenz curve plot


ggplot(data, aes(x = Percentile, y = Share)) +
geom_step() +
labs(x = "Cumulative Percentile", y = "Cumulative Share", title = "Lorenz Curve") +
theme_minimal()

In this example, we first load the ggplot2 package.

We create the data for the Lorenz curve with the cumulative percentiles (cumulative_perc) and
cumulative shares (cumulative_share). These values represent the cumulative percentages of
the population and their corresponding cumulative shares of wealth or income, respectively.

Next, we create a data frame using the data.frame() function, specifying the percentiles as the
Percentile variable and the shares as the Share variable.

Using ggplot(), we specify the data frame as the data source and map the Percentile variable to
the x-axis (x) and the Share variable to the y-axis (y).

51
We use geom_step() to create the Lorenz curve, which connects the points with horizontal and
vertical lines to represent the cumulative distribution of the share.

The labs() function is used to set the x-axis label to "Cumulative Percentile", the y-axis label
to "Cumulative Share", and the plot title to "Lorenz Curve".

Finally, we apply the theme_minimal() theme to style the plot with a minimalistic appearance.

When you run this code, you will obtain a simple Lorenz curve plot. The curve represents the
cumulative distribution of the share, with the x-axis representing the cumulative percentiles
and the y-axis representing the cumulative share. You can modify the data values, labels, and
plot aesthetics to create a Lorenz curve that reflects your specific data and visualization
requirements.

52
Unit 4: Introduction to Python
Juypter Notebook, Python Functions, Python Types and Sequences, Python More
on Strings, Reading and Writing CSV files, Advanced Python Objects, map(),
Numpy, Pandas, , Series Data Structure, Querying a Series, The DataFrame Data
Structure, DataFrame Indexing and Loading, Querying a DataFrame, Indexing
Dataframes, Merging Dataframes

Juypter Notebook

Jupyter Notebook is an interactive coding environment that allows you to create and share
documents containing live code, visualizations, explanatory text, and more. It is particularly
popular among Python users, although it supports multiple programming languages.

Here are some key aspects and features of Jupyter Notebook when used with Python:

1. Notebook Structure: Jupyter Notebook is organized into cells, where each cell can
contain either code or markdown text. Code cells are where you write and execute
Python code, while markdown cells allow you to add formatted text, headings, lists,
and images to provide explanations and documentation.
2. Code Execution: You can execute code cells individually or all at once. When a code
cell is executed, the Python interpreter runs the code and displays the output below the
cell. This allows you to iteratively develop and test your code in a step-by-step manner.
3. Kernel: Jupyter Notebook uses a kernel, which is responsible for executing code in a
specific programming language. For Python, the IPython kernel is used by default,
providing additional features such as tab completion, object introspection, and rich
media display.
4. Data Exploration and Visualization: Jupyter Notebook integrates seamlessly with
popular Python libraries for data manipulation and visualization, such as Pandas,
NumPy, Matplotlib, Seaborn, and Plotly. You can easily load, analyze, and visualize
data within the notebook using these libraries.
5. Rich Media Display: Jupyter Notebook allows you to display various types of media
directly in the notebook, including images, audio, and video. You can even embed
interactive visualizations and widgets to create dynamic and engaging content.

53
6. Notebook Extensions: Jupyter Notebook offers a wide range of extensions that
enhance its functionality and customization. These extensions provide additional
features like code linting, code folding, table of contents, and more, making your coding
experience more efficient and enjoyable.
7. Collaboration and Sharing: Jupyter Notebook facilitates collaboration by allowing
you to share your notebooks with others. You can share notebooks as standalone files,
publish them on platforms like GitHub, or use services like Jupyter Notebook Viewer
to share notebooks online. This enables others to run your code, view your
visualizations, and understand your analysis.

Jupyter Notebook provides an interactive and flexible environment for Python programming,
data analysis, and scientific computing. It promotes a reproducible workflow by combining
code, documentation, and visualizations in a single document. With its rich features and broad
community support, Jupyter Notebook has become a popular choice among Python users for
data exploration, prototyping, and sharing computational research.

Python Functions

In Python, a function is a block of reusable code that performs a specific task or set of tasks.
Functions provide modularity, code organization, and code reusability in Python programs.
They allow you to encapsulate a piece of code into a named block, which can be called and
executed multiple times throughout the program. Here are some key aspects and features of
functions in Python:

1. Function Definition: To create a function in Python, you use the def keyword followed
by the function name, parentheses (), and a colon :. You can also specify parameters
within the parentheses if the function needs input values. The function definition block
is indented below the def statement.
2. Function Parameters: Parameters are placeholders for values that can be passed to a
function. They define the input requirements of the function. You can have zero or more
parameters in a function definition. Parameters can have default values, making them
optional when calling the function.
3. Function Body: The body of the function contains the statements that define the task
the function performs. It is indented under the function definition. You can include any

54
valid Python code within the function body, such as variable declarations, calculations,
conditionals, loops, and other function calls.
4. Return Statement: Functions can optionally return a value using the return statement.
The return statement specifies the value or values that the function should provide as
output. When the return statement is encountered, the function execution terminates,
and the specified value(s) are returned to the caller.
5. Function Call: To execute a function, you call it by using its name followed by
parentheses (). If the function has parameters, you pass the corresponding arguments
within the parentheses. The function call evaluates the code within the function body,
and any return value is available for further use.
6. Scope: Functions have their own scope, which means that variables defined within a
function are only accessible within that function. Similarly, variables defined outside
the function have a global scope and can be accessed from any part of the program.
7. Function Documentation: You can add documentation to functions using docstrings,
which are triple-quoted strings placed immediately after the function definition.
Docstrings provide information about the purpose of the function, its parameters, return
values, and any other relevant details. They are used to generate documentation and
provide help for users of the function.

Functions play a vital role in structuring Python programs and promoting code reusability.
They help in breaking down complex tasks into smaller, more manageable parts. By
encapsulating code within functions, you can improve the readability, maintainability, and
efficiency of your Python code.

Python Types and Sequences

In Python, types refer to the classification of data objects. Python has built-in types that define
the characteristics and behavior of objects. Sequences, on the other hand, are a specific type of
data structure that holds an ordered collection of elements. Let's dive into more detail about
Python types and sequences:

Python Types:

1. Numeric Types: Python includes numeric types such as integers (int), floating-point
numbers (float), and complex numbers (complex).

55
2. Boolean Type: The boolean type (bool) represents either True or False values, which
are used for logical operations and conditional statements.
3. Strings: The string type (str) represents sequences of characters enclosed in single or
double quotes. Strings are immutable, meaning they cannot be modified once created.
4. Lists: Lists (list) are ordered collections of objects enclosed in square brackets []. They
can contain objects of different types and are mutable, allowing you to modify, add, or
remove elements.
5. Tuples: Tuples (tuple) are similar to lists but are immutable, meaning they cannot be
modified once created. They are typically used to represent fixed collections of
elements.
6. Sets: Sets (set) are unordered collections of unique elements. They do not allow
duplicate values and provide operations like union, intersection, and difference.
7. Dictionaries: Dictionaries (dict) are key-value pairs enclosed in curly braces {}. They
allow you to store and retrieve values based on unique keys, providing efficient lookup
operations.

Python Sequences:

1. Lists: Lists are mutable sequences that can hold objects of any type. They maintain the
order of elements and allow indexing and slicing operations.
2. Tuples: Tuples are immutable sequences similar to lists. They are useful for
representing fixed collections of elements and can be accessed using indexing and
slicing.
3. Strings: Strings are sequences of characters. They can be indexed and sliced like other
sequences and provide various string manipulation methods.
4. Ranges: Ranges (range) represent a sequence of numbers and are commonly used in
loops for iterating a specific number of times.

Sequences in Python share common characteristics, such as indexing (accessing elements by


position), slicing (extracting sub-sequences), and various methods for manipulation and
iteration.

Understanding Python types and sequences is crucial for effective programming as they
provide the foundation for storing, manipulating, and processing data in Python programs.

56
Python More on Strings

Strings are an important data type in Python that represent sequences of characters. They are
immutable, meaning that once a string is created, it cannot be modified. Here are some
additional concepts and operations related to strings in Python:

String Creation: Strings can be created using single quotes ('), double quotes ("), or triple
quotes (''' or """). Triple quotes are used for multiline strings.

Python - code

single_quote = 'Hello'
double_quote = "World"
multiline = '''This is a
multiline string'''

String Concatenation: Strings can be concatenated using the + operator.

Python - code

greeting = "Hello" + " " + "World"

String Length: The len() function returns the length of a string, which is the number of
characters in the string.

Python - code

message = "Hello, World!"


length = len(message) # length = 13

String Indexing: Individual characters within a string can be accessed using index positions.
Indexing starts from 0 for the first character and goes up to length - 1 for the last character.

Python - code

message = "Hello, World!"


first_char = message[0] # first_char = 'H'
third_char = message[2] # third_char = 'l'
last_char = message[-1] # last_char = '!'

57
String Slicing: Substrings can be extracted from a string using slicing. The syntax for slicing
is start_index:end_index. The resulting substring includes characters from start_index up to,
but not including, end_index.

Python - code

message = "Hello, World!"


substring1 = message[0:5] # substring1 = 'Hello'
substring2 = message[7:12] # substring2 = 'World'
substring3 = message[:5] # substring3 = 'Hello'
substring4 = message[7:] # substring4 = 'World!'

String Methods: Python provides numerous built-in methods for string manipulation, such as
converting cases, replacing characters, splitting and joining strings, finding substrings, and
more.

Python - code

message = "Hello, World!"


uppercase = message.upper() # uppercase = 'HELLO, WORLD!'
lowercase = message.lower() # lowercase = 'hello, world!'
replaced = message.replace('o', 'a') # replaced = 'Hella, Warld!'
words = message.split(', ') # words = ['Hello', 'World!']
joined = '-'.join(words) # joined = 'Hello-World!'

These are just a few examples of string operations in Python. Strings are versatile and
commonly used in many programming tasks, such as text processing, data manipulation, and
input/output operations. Python provides a rich set of string methods and functionalities to
handle and manipulate strings efficiently.

Reading and Writing CSV files

Reading and writing CSV (Comma-Separated Values) files is a common task in data analysis
and manipulation. CSV files are a plain text format used to store tabular data, where each line
represents a row and the values within each line are separated by a delimiter, typically a comma.

Reading CSV Files:

To read data from a CSV file, you need to follow these steps:

58
● Import the csv module in Python.
● Open the CSV file using the open() function, specifying the file path and the mode as
'r' for reading.
● Create a CSV reader object using the reader() function from the csv module, passing
the opened file as the argument.
● Iterate over the rows in the CSV file using a loop. The reader object acts as an iterator,
returning each row as a list of strings.
● Process the data in each row as needed. You can access individual values by indexing
the row list.

Writing CSV Files:

To write data to a CSV file, you can follow these steps:

● Import the csv module in Python.


● Open a new file or overwrite an existing file using the open() function, specifying the
file path and the mode as 'w' for writing.
● Create a CSV writer object using the writer() function from the csv module, passing
the opened file as the argument.
● Write rows to the CSV file using the writerow() method of the writer object. Each call
to writerow() writes a single row, which is typically a list or tuple of values.

Additional Considerations:

● You can specify different delimiters for CSV files, such as a semicolon or tab, using
the delimiter parameter in the reader() or writer() function.
● If a value in a CSV file contains the delimiter character itself or special characters like
newline or quotes, it is typically enclosed in quotes. The csv module handles these cases
automatically.
● The csv module provides various options for quoting, handling empty values, and
specifying the newline character. Refer to the official Python documentation for more
details on these options.

Overall, reading and writing CSV files in Python is straightforward with the help of the csv
module. It allows you to easily handle tabular data in a format that is widely supported across
different applications.

59
Advanced Python Objects

Advanced Python objects refer to the concepts and techniques used to create more complex
and specialized objects in Python programming. These concepts build upon the fundamentals
of object-oriented programming in Python and provide additional functionality and flexibility.
Here are some advanced Python object topics:

1. Inheritance: Inheritance is a fundamental concept in object-oriented programming that


allows a class to inherit attributes and methods from another class. It enables code reuse
and facilitates the creation of specialized classes based on existing ones. By using
inheritance, you can create a hierarchy of classes, with each level inheriting and
extending the functionality of the parent class.
2. Polymorphism: Polymorphism allows objects of different classes to be used
interchangeably, providing a consistent interface to different types of objects. It enables
you to write code that can work with objects of various types without explicitly
checking their types. Polymorphism is often achieved through method overriding and
method overloading.
3. Encapsulation: Encapsulation is the process of bundling data and the methods that
operate on that data within a class. It helps in organizing code, improving code
maintainability, and controlling access to data by defining public and private attributes
and methods. Encapsulation is often implemented using access modifiers like public,
private, and protected.
4. Abstract Classes and Interfaces: Abstract classes and interfaces provide a way to
define common behavior and method signatures that subclasses or implementing
classes must adhere to. Abstract classes cannot be instantiated and serve as a blueprint
for other classes, while interfaces define a contract that implementing classes must
follow. These concepts allow you to create classes with common characteristics and
enforce specific behaviors.
5. Generators: Generators are a type of iterable that generates values on the fly, as
opposed to storing all the values in memory at once. They are implemented using the
yield keyword and allow you to create memory-efficient and lazy-evaluated sequences.
Generators are particularly useful when dealing with large datasets or infinite
sequences.

60
6. Decorators: Decorators are a powerful feature in Python that allow you to modify the
behavior of functions or classes without changing their source code. Decorators are
functions that wrap around other functions or classes and provide additional
functionality. They are commonly used for adding logging, caching, or authentication
to existing functions or classes.

These are just a few examples of advanced Python object concepts. By understanding and
applying these concepts, you can write more flexible, modular, and reusable code in Python.

map() function

The map() function in Python is used to apply a given function to each element of an iterable
(e.g., a list, tuple, or string) and returns a new iterable with the transformed values. The map()
function takes two arguments: the function to be applied and the iterable. Here's an example:

Python - code

# Example 1: Applying a function to a list


numbers = [1, 2, 3, 4, 5]

# Define a function to square a number


def square(x): return x**2

# Apply the square function to each element using map()


squared_numbers = map(square, numbers)

# Convert the result to a list


squared_numbers = list(squared_numbers)
print(squared_numbers)
# Output: [1, 4, 9, 16, 25]

Numpy

Numpy is a Python library that provides support for large, multi-dimensional arrays and
matrices, along with a collection of mathematical functions to operate on these arrays
efficiently. It is widely used in scientific computing and data analysis tasks. Numpy provides
a high-performance multidimensional array object called ndarray and various functions for
array manipulation, mathematical operations, linear algebra, and more. Here's an example:

61
Python - code

import numpy as np
# Create a 1-dimensional array
a = np.array([1, 2, 3, 4, 5])

# Perform mathematical operations on the array


mean = np.mean(a)
std_dev = np.std(a)

print("Mean:", mean)
print("Standard Deviation:", std_dev)

Pandas

Pandas is a Python library built on top of Numpy that provides high-level data manipulation
and analysis tools. It introduces two main data structures: Series (1-dimensional labeled array)
and DataFrame (2-dimensional labeled data structure). Pandas offers various functions and
methods to handle missing data, clean and transform data, perform grouping and aggregation,
merge and join datasets, and more. Here's an example:

Python - code

import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Country': ['USA', 'Canada', 'UK']}
df = pd.DataFrame(data)

# Perform operations on the DataFrame


mean_age = df['Age'].mean()
oldest_person = df['Age'].max()

print("Mean Age:", mean_age)


print("Oldest Person's Age:", oldest_person)

In summary, map() is a built-in function in Python used to apply a function to each element of
an iterable, Numpy is a library for numerical computing with support for multi-dimensional

62
arrays, and Pandas is a library for data manipulation and analysis, providing data structures and
functions for efficient handling of tabular data.

Series Data Structure,

The Series data structure is a fundamental component of the pandas library in Python. It
represents a one-dimensional labeled array that can hold any data type. The Series is similar to
a column in a spreadsheet or a single column of data in a table. It consists of two main
components: the data and the index.

The syntax to create a Series object in pandas is as follows:

Python - code
import pandas as pd
series = pd.Series(data, index)

data: The data can be a list, numpy array, dictionary, or scalar value. It represents the actual
values in the Series.

index (optional): The index provides labels to access the data elements. If not specified, a
default integer index will be assigned.

Here's an example to illustrate the creation of a Series:

Python - code

import pandas as pd
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

Output:

0 10
1 20
2 30
3 40
4 50
dtype: int64

63
In the example above, we created a Series using a list of numbers. The resulting Series has an
integer index (0 to 4) and displays the corresponding values.

Series objects have several key properties and methods that allow for easy data manipulation
and analysis. Some common operations include indexing, slicing, arithmetic operations, and
applying functions element-wise. For example:

Python - code
# Accessing elements by index
print(series[2]) # Output: 30
# Slicing the Series
print(series[1:4]) # Output: 20, 30, 40
# Arithmetic operations
print(series * 2) # Output: 20, 40, 60, 80, 100
# Applying a function element-wise
print(series.apply(lambda x: x**2)) # Output: 100, 400, 900, 1600, 2500

The Series data structure is an essential tool for handling and manipulating one-dimensional
labeled data in pandas. It provides a convenient way to store, access, and perform operations
on data, making it an integral part of data analysis workflows.

Querying a Series,

Querying a Series in pandas involves accessing and retrieving specific elements or subsets of
data from the Series based on certain conditions. Pandas provides several methods and
techniques for querying Series data.

1.Index-based Selection:

Single element: Use square brackets and provide the index label or position to access a single
element.

python code

series = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(series['b']) # Output: 20
Multiple elements: Use a list of index labels or positions to retrieve multiple elements.

64
python code

print(series[['a', 'c', 'e']]) # Output: a 10, c 30, e 50

Slicing: Use slicing notation to select a range of elements based on index positions.

python code

print(series[1:4]) # Output: b 20, c 30, d 40

2.Conditional Selection:

Boolean indexing: Use a boolean expression to filter elements based on a condition.

python code

print(series[series > 30]) # Output: d 40, e 50

Using logical operators: Combine multiple conditions using logical operators (e.g., & for
AND, | for OR).

python code

print(series[(series > 20) & (series < 50)]) # Output: c 30, d 40

3.Label-based Selection:

.loc[] indexer: Use the .loc[] indexer to access elements or subsets based on index labels.

Python code

print(series.loc['c']) # Output: 30

.loc[] with boolean indexing: Combine label-based selection with boolean indexing to filter
based on conditions.

python code

print(series.loc[series > 30]) # Output: d 40, e 50

65
These are some common methods for querying a Series in pandas. By using these techniques,
you can easily retrieve specific elements or subsets of data based on index labels, positions, or
conditions.

The DataFrame Data Structure,

The DataFrame data structure is a fundamental component of the pandas library in Python. It
provides a flexible and efficient way to store and manipulate structured, two-dimensional data.
The DataFrame is similar to a table in a relational database or a spreadsheet in that it organizes
data in rows and columns.

Key features of the DataFrame data structure:

1. Columns: A DataFrame consists of columns that represent variables or features. Each


column has a unique label, which is used to access and manipulate the data within that
column. Columns can have different data types, such as numeric, string, boolean, or
datetime.
2. Rows: Rows in a DataFrame correspond to individual records or observations. Each
row is identified by an index, which is typically a numeric sequence but can also be
customized to use other values, such as dates, strings, or a combination of multiple
columns.
3. Tabular Structure: The DataFrame is organized in a tabular structure, where rows and
columns intersect to form cells. This structure allows for efficient manipulation and
analysis of data using various operations, such as indexing, slicing, filtering, sorting,
and aggregating.

Creating a DataFrame:

You can create a DataFrame in pandas using various methods, such as reading data from files
(e.g., CSV, Excel), converting from other data structures (e.g., lists, dictionaries), or generating
data programmatically. Here's an example of creating a DataFrame from a dictionary:

python code
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],

66
'Age': [25, 28, 32],
'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)

Output:

Name Age City

0 John 25 New York

1 Alice 28 Paris

2 Bob 32 London

In the example above, we created a DataFrame using a dictionary where each key represents a
column name, and the corresponding value is a list of data for that column. The resulting
DataFrame has three columns: 'Name', 'Age', and 'City', with the associated data.

Working with DataFrames:

DataFrames provide a wide range of functionalities for data manipulation and analysis. Some
common operations include indexing and slicing, filtering, merging and joining, reshaping,
grouping, and statistical calculations. DataFrames also offer built-in methods for handling
missing data, handling duplicates, and handling outliers.

Here are a few examples of common DataFrame operations:

python code

# Accessing columns
print(df['Name'])
# Slicing rows
print(df[1:3])
# Filtering rows based on a condition
print(df[df['Age'] > 25])
# Merging DataFrames

67
df2 = pd.DataFrame({'Name': ['John', 'Alice', 'Bob'],
'Salary': [5000, 6000, 4000]})
merged_df = pd.merge(df, df2, on='Name')
# Grouping and aggregation
grouped_df = df.groupby('City').mean()
# Statistical calculations
print(df['Age'].mean())
print(df['Age'].max())

These examples demonstrate a few of the many operations that can be performed on
DataFrames in pandas. The DataFrame data structure provides a powerful tool for data analysis
and manipulation in Python, making it a popular choice for working with structured data.

DataFrame Indexing and Loading,

DataFrame Indexing and Loading in Python:

1.Indexing a DataFrame

Indexing allows you to select specific rows and columns from a DataFrame. There are several
ways to index a DataFrame in Python:

Using square brackets: You can use square brackets [] to access columns or a specific subset
of rows based on conditions.

python code
# Access a single column
df['column_name']
# Access multiple columns
df[['column1', 'column2']]
# Access rows based on conditions
df[df['column'] > 10]

68
Using loc and iloc: The .loc[] and .iloc[] indexers provide more advanced indexing
capabilities. .loc[] allows you to access rows and columns using labels, while .iloc[] uses
integer-based indexing.

python code

# Access rows and columns by label


df.loc[row_label, column_label]
# Access rows and columns by integer position
df.iloc[row_index, column_index]

Using boolean indexing: Boolean indexing allows you to select rows based on a condition
using a Boolean expression.

python code

df[condition]

2.Loading a DataFrame

Pandas provides various methods to load data into a DataFrame from different sources, such
as CSV files, Excel files, databases, or even from an existing Python data structure.

CSV files: Use the pd.read_csv() function to load data from a CSV file into a DataFrame.

python code

import pandas as pd
df = pd.read_csv('data.csv')

Excel files: Use the pd.read_excel() function to load data from an Excel file into a DataFrame.

python code

import pandas as pd
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

69
Databases: Use the appropriate database connector (e.g., pymysql, psycopg2) to establish a
connection to the database and then use pd.read_sql() function to load data from a database
query into a DataFrame.

python code

import pandas as pd
import pymysql
# Establish a connection
connection = pymysql.connect(host='localhost', user='username', password='password',
database='database_name')
# Load data from a query into a DataFrame
query = "SELECT * FROM table_name"
df = pd.read_sql(query, connection)

Other sources: Pandas also provides functions to load data from various other sources, such
as JSON, HTML, and more. You can explore the pandas documentation for more details on
loading data from different sources.

These are some basic concepts of DataFrame indexing and loading in Python using the pandas
library. With these techniques, you can select specific data from a DataFrame based on your
requirements and load data from various sources to perform data analysis and manipulation.

Querying a DataFrame

Querying a DataFrame in Python refers to the process of extracting specific data or subsets of
data from a DataFrame based on certain conditions or criteria. Pandas, a popular data
manipulation library in Python, provides various methods to query DataFrames effectively.
Here are some common ways to query a DataFrame:

Basic Indexing: You can use basic indexing with square brackets [] to extract specific columns
or rows from a DataFrame. For example:

70
python code

# Access a single column


df['column_name']
# Access multiple columns
df[['column1', 'column2']]
# Access specific rows
df[start_index:end_index]

Boolean Indexing: Boolean indexing allows you to filter rows based on specific conditions
using logical operators such as ==, >, <, >=, <=, and !=. For example:

python code

# Filter rows based on a condition


df[df['column'] > 10]

# Combine multiple conditions using logical operators


df[(df['column1'] > 5) & (df['column2'] == 'value')]

loc and iloc: The .loc[] and .iloc[] indexers provide more advanced querying capabilities. .loc[]
allows you to access rows and columns by label, while .iloc[] uses integer-based indexing. For
example:

python code

# Access rows and columns by label


df.loc[row_label, column_label]
# Access rows and columns by integer position
df.iloc[row_index, column_index]

Query Method: Pandas provides the .query() method to query a DataFrame using a more
concise and expressive syntax. It allows you to write queries using a string-based syntax that
resembles SQL. For example:

python code

71
# Query using the query method
df.query('column1 > 5 and column2 == "value"')

GroupBy: The groupby() function allows you to group the DataFrame by one or more columns
and perform aggregation or apply functions to the grouped data. For example:

python code

# Group by a column and calculate the mean of another column


df.groupby('group_column')['value_column'].mean()

These are just a few examples of how you can query a DataFrame in Python using pandas. The
library provides many more functionalities, such as filtering rows based on string matching,
handling missing values, combining multiple DataFrames, and more. Pandas documentation is
a valuable resource for learning about all the available querying methods and their parameters.

Certainly! Here's an example that demonstrates how to query a DataFrame in Python using
pandas:

python code

import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie', 'Eve'],
'Age': [25, 28, 32, 30, 27],
'City': ['New York', 'Paris', 'London', 'Paris', 'Tokyo'],
'Salary': [5000, 6000, 4000, 5500, 4500]
}
df = pd.DataFrame(data)
# Querying based on conditions
# Filter rows where Age is greater than 25
filtered_df = df[df['Age'] > 25]
print("Filtered DataFrame:")
print(filtered_df)

72
# Querying based on multiple conditions
# Filter rows where Age is greater than 25 and Salary is greater than 5000
filtered_df = df[(df['Age'] > 25) & (df['Salary'] > 5000)]
print("Filtered DataFrame with multiple conditions:")
print(filtered_df)

# Querying using the query method


# Filter rows where City is Paris and Salary is less than 6000
filtered_df = df.query('City == "Paris" and Salary < 6000')
print("Filtered DataFrame using query method:")
print(filtered_df)

# Grouping and aggregation


# Calculate the average salary for each city
grouped_df = df.groupby('City')['Salary'].mean()
print("Grouped DataFrame with average salary:")
print(grouped_df)

In this example, we create a DataFrame with columns 'Name', 'Age', 'City', and 'Salary'. Then
we demonstrate different querying techniques:

● Filtering rows based on a condition: We filter rows where the 'Age' column is greater
than 25 and store the result in the variable 'filtered_df'.
● Filtering rows based on multiple conditions: We filter rows where the 'Age' is greater
than 25 and the 'Salary' is greater than 5000.
● Querying using the query method: We use the query method to filter rows where the
'City' is 'Paris' and the 'Salary' is less than 6000.
● Grouping and aggregation: We group the DataFrame by 'City' and calculate the average
salary for each city using the groupby() function.

Each query demonstrates a different way to extract specific data from the DataFrame based on
conditions or grouping. You can modify these examples to suit your own DataFrame and query
requirements.

73
Indexing Dataframes

Indexing in Python DataFrames refers to accessing and manipulating data based on row and
column labels or positions. Pandas provides various indexing methods to retrieve specific data
from DataFrames. Here are some common indexing techniques:

1.Column Indexing:

Using square brackets []: You can access a single column or multiple columns by specifying
their column names inside square brackets. For example:

python code

df['column_name'] # Access a single column

df[['column1', 'column2']] # Access multiple columns

2.Row Indexing:

Using loc and iloc: The .loc[] and .iloc[] indexers are used to access rows based on their labels
or integer positions, respectively.

python code

df.loc[row_label] # Access a single row by label


df.loc[start_label:end_label] # Access multiple rows by label range
df.iloc[row_index] # Access a single row by integer position
df.iloc[start_index:end_index] # Access multiple rows by integer position range

3.Conditional Indexing:

Boolean indexing: You can use Boolean expressions to filter rows based on specific
conditions.

Python code

df[condition] # Filter rows based on a condition

4.Indexing with Multiple Conditions:

74
Combining conditions: You can use logical operators such as & (and) and | (or) to combine
multiple conditions.

python code

df[(condition1) & (condition2)] # Filter rows based on multiple conditions

5.Indexing with both Rows and Columns:

Using loc and iloc together: You can combine row and column indexing using the .loc[] or
.iloc[] indexers.

python code

df.loc[row_label, 'column_name'] # Access a specific cell value by label


df.iloc[row_index, column_index] # Access a specific cell value by integer position

These are some of the commonly used indexing techniques in Python DataFrames. They allow
you to retrieve specific data based on labels or positions, filter rows based on conditions, and
access individual cells or subsets of data.

Merging Dataframes

Merging DataFrames in Python involves combining multiple DataFrames based on common


columns or indices. The Pandas library provides the merge() function to perform different types
of merges, including inner, outer, left, and right merges. Here's an overview of how to merge
DataFrames in Python:

1.Inner Merge:

An inner merge combines only the rows that have matching values in both DataFrames. It
retains only the common records between the DataFrames.

python code

merged_df = pd.merge(df1, df2, on='common_column')

2.Outer Merge:

75
An outer merge combines all the rows from both DataFrames and fills missing values with
NaN where there is no match.

python code

merged_df = pd.merge(df1, df2, on='common_column', how='outer')

3.Left Merge:

A left merge includes all the rows from the left DataFrame and the matching rows from the
right DataFrame. Missing values are filled with NaN where there is no match.

python code

merged_df = pd.merge(df1, df2, on='common_column', how='left')

4.Right Merge:

A right merge includes all the rows from the right DataFrame and the matching rows from the
left DataFrame. Missing values are filled with NaN where there is no match.

python code

merged_df = pd.merge(df1, df2, on='common_column', how='right')

5.Merging on Multiple Columns:

You can merge DataFrames based on multiple columns by passing a list of column names to
the on parameter.

python code

merged_df = pd.merge(df1, df2, on=['column1', 'column2'])

6.Merging on Non-matching Columns:

If the column names are different in both DataFrames, you can use the left_on and right_on
parameters to specify the column names from each DataFrame to merge on.

python code

76
merged_df = pd.merge(df1, df2, left_on='column1', right_on='column2')

These are the basic techniques for merging DataFrames in Python using the Pandas library.
They allow you to combine data from different sources based on common columns or indices.
You can choose the appropriate merge type based on your data requirements.

77
Unit 5: Data Aggregation, processing and Group Operations

Time Series, Date and Time, Data Types and Tools, Time Series Basics, Date
Ranges, Frequencies, and Shifting, Time Zone Handling, Periods and Period
Arithmetic, Resampling and Frequency Conversion, Time Series Plotting,
Moving Window Functions, Natural Language Processing, Image Processing,
Machine Learning K Nearest Neighbors Algorithm for Classification, Clustering

Time Series

Time series data refers to a sequence of data points collected and recorded over time, where
each data point is associated with a specific timestamp or time interval. Time series data is
commonly encountered in various domains such as finance, economics, weather forecasting,
stock market analysis, and more.

In Python, the Pandas library provides powerful tools and data structures for working with time
series data. Here are some key concepts and techniques related to time series in Python:

1.DateTime Index:

The DateTime Index is a specialized Pandas index object that allows for indexing and slicing
of time series data based on dates and times. It provides convenient methods for working with
time-related data.

python code

import pandas as pd
dates = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D')
df = pd.DataFrame({'date': dates, 'value': [1, 2, 3, ...]})
df.set_index('date', inplace=True)

2.Resampling:

Resampling involves changing the frequency of the time series data. It can be used to convert
higher frequency data to lower frequency (downsampling) or lower frequency data to higher

78
frequency (upsampling). Common frequency aliases include 'D' for daily, 'M' for monthly, 'Y'
for yearly, etc.

python code

df_resampled = df.resample('M').sum() # Resample to monthly frequency

3.Time-based Indexing and Slicing:

The DateTime Index allows for intuitive indexing and slicing of time series data based on
specific dates, date ranges, or time intervals.

python code

df['2021-01-01'] # Access data for a specific date


df['2021-01-01':'2021-01-31'] # Access data for a date range
df['2021-01'] # Access data for a specific month

4.Time Shifting:

Time shifting involves moving the entire time series data forward or backward in time. It can
be useful for calculating time differences, lagging or leading indicators, or aligning data from
different time periods.

python code

df_shifted = df.shift(1) # Shift the data one step forward

5.Rolling Window Functions:

Rolling window functions compute statistics over a sliding window of consecutive data points.
They are useful for calculating moving averages, rolling sums, or other time-dependent
calculations.

python code

df_rolling_mean = df.rolling(window=7).mean() # Calculate a 7-day rolling average

79
6.Time Series Visualization:

Python libraries such as Matplotlib and Seaborn provide tools for visualizing time series data.
Line plots, area plots, bar plots, and scatter plots can be used to visualize trends, patterns, and
anomalies in the data over time.

python code

import matplotlib.pyplot as plt


plt.plot(df.index, df['value'])
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series Plot')
plt.show()

These are some of the fundamental concepts and techniques related to working with time series
data in Python. Pandas and other libraries offer extensive functionality for analyzing,
manipulating, and visualizing time series data, allowing you to extract valuable insights and
make informed decisions based on temporal patterns and trends.

Date and Time

Working with dates and times in Python involves utilizing the datetime module, which provides
classes and functions for manipulating, formatting, and performing calculations with date and
time values. Here's an overview of how to work with date and time in Python:

1.Importing the datetime Module:

Start by importing the datetime module:

python code

import datetime

80
2.Creating Date and Time Objects:

The datetime module provides several classes for representing dates and times. Some
commonly used classes include:

 datetime.date for representing dates (year, month, day)


 datetime.time for representing times (hour, minute, second, microsecond)
 datetime.datetime for representing both date and time

python code

# Create a date object


date = datetime.date(2023, 5, 9)

# Create a time object


time = datetime.time(12, 30, 0)

# Create a datetime object


datetime_obj = datetime.datetime(2023, 5, 9, 12, 30, 0)

3.Current Date and Time:

You can obtain the current date and time using the datetime.now() function:

python code

current_datetime = datetime.datetime.now()

4.Formatting Dates and Times:

You can format dates and times using the strftime() method, which allows you to specify a
format string to represent the desired format:

python code
formatted_date = date.strftime('%Y-%m-%d')
formatted_time = time.strftime('%H:%M:%S')
formatted_datetime = datetime_obj.strftime('%Y-%m-%d %H:%M:%S')

81
5.Parsing Strings into Dates and Times:

You can parse strings that represent dates or times into datetime objects using the strptime()
function, specifying the format of the input string:

python code

date_str = '2023-05-09'
parsed_date = datetime.datetime.strptime(date_str, '%Y-%m-%d')

time_str = '12:30:00'
parsed_time = datetime.datetime.strptime(time_str, '%H:%M:%S')

6.Date and Time Arithmetic:

You can perform arithmetic operations on date and time objects, such as adding or subtracting
time intervals or calculating time differences:

python code

# Add 1 day to a date


new_date = date + datetime.timedelta(days=1)

# Calculate the difference between two dates


date_diff = date2 - date1

7.Timezone Handling:

If you need to work with timezones, the pytz module provides support for handling timezones
in Python.

python code

import pytz
# Set a timezone for a datetime object
tz = pytz.timezone('America/New_York')
datetime_obj = datetime_obj.astimezone(tz)

82
These are some of the basic operations and functionalities for working with date and time in
Python using the datetime module. Python's datetime module provides a comprehensive set of
tools for working with dates, times, and timezones, allowing you to handle various date and
time-related tasks in your Python programs.

Data Types and Tools

In Python, data types are used to define the type of data that a variable can hold. Different data
types have different properties and methods associated with them. Here are some commonly
used data types and tools in Python:

1.Numeric Data Types:

 Integers (int): Whole numbers without decimal points, such as 1, 2, -3.


 Floating-point numbers (float): Real numbers with decimal points, such as 3.14, -0.5.
 Complex numbers (complex): Numbers with a real and imaginary part, such as 1 + 2j.

2.Sequence Data Types:

 Strings (str): Ordered collection of characters, such as "hello", 'world'.


 Lists (list): Ordered collection of elements, enclosed in square brackets [], such as [1,
2, 3].
 Tuples (tuple): Immutable ordered collection of elements, enclosed in parentheses (),
such as (1, 2, 3).

3.Mapping Data Type:

 Dictionary (dict): Collection of key-value pairs, enclosed in curly braces {}, such as
{'name': 'John', 'age': 25}.

4.Set Data Types:

 Set (set): Unordered collection of unique elements, enclosed in curly braces {}, such as
{1, 2, 3}.
 FrozenSet (frozenset): Immutable version of a set.

83
5.Boolean Data Type:

 Boolean (bool): Represents the truth values True or False.

6.Date and Time Data Types:

 Date (date): Represents a date without a time component.


 Time (time): Represents a time without a date component.
 DateTime (datetime): Represents both date and time.

7.Tools for Data Analysis and Manipulation:

 NumPy: A powerful library for numerical computing in Python, providing support for
large, multi-dimensional arrays and mathematical functions.
 Pandas: A library for data manipulation and analysis, providing data structures like
DataFrame for handling structured data.
 Matplotlib: A plotting library for creating static, animated, and interactive
visualizations in Python.
 SciPy: A library for scientific and technical computing, providing functions for
optimization, integration, linear algebra, and more.
 Scikit-learn: A machine learning library that provides tools for data mining, analysis,
and building predictive models.
 Jupyter Notebook: An interactive computing environment that allows you to create
and share documents containing code, visualizations, and explanatory text.

These are just a few examples of data types and tools in Python. Python offers a rich ecosystem
of libraries and tools for various data analysis, manipulation, and visualization tasks, allowing
you to effectively work with different types of data and perform advanced data analysis tasks.

Time Series Basics

Time series refers to a sequence of data points collected and recorded at specific time intervals.
In the context of data analysis and forecasting, time series data is commonly used to analyze
patterns, trends, and seasonality over time. Here are some key concepts and techniques related
to time series analysis:

84
1.Time Series Data Representation:

 In Python, time series data is typically represented using pandas, a powerful library for
data manipulation and analysis. The primary data structure for time series in pandas is
the Series object, which is a one-dimensional labeled array capable of holding any data
type with associated time indices.
 The time indices can be specified as dates, timestamps, or numeric values representing
time intervals.

2.Time Series Visualization:

 Visualizing time series data is important for gaining insights and identifying patterns.
The matplotlib library provides various functions for creating line plots, scatter plots,
bar plots, and other visualizations to represent time series data.
 Additional libraries like seaborn and plotly offer more advanced plotting options and
interactive visualizations for time series data.

3.Time Series Decomposition:

 Time series data often exhibits components such as trend, seasonality, and noise.
Decomposing a time series helps separate these components for analysis and
forecasting. The statsmodels library in Python provides methods for decomposing time
series using techniques like moving averages, exponential smoothing, and seasonal
decomposition of time series (STL).

4.Time Series Analysis:

 Time series analysis involves studying the statistical properties, patterns, and
dependencies within a time series. Techniques such as autocorrelation analysis,
stationarity testing, and spectral analysis can be applied to understand the underlying
characteristics of the data.
 Python libraries like statsmodels, scipy, and numpy offer functions for performing
various time series analysis tasks, including autocorrelation functions, periodogram
analysis, and statistical tests for stationarity.

85
5.Time Series Forecasting:

 Forecasting future values of a time series is a common application of time series


analysis. Methods like moving averages, exponential smoothing, ARIMA
(Autoregressive Integrated Moving Average), and machine learning algorithms can be
used for time series forecasting.
 Python libraries such as statsmodels, scikit-learn, and prophet provide functionalities
for implementing these forecasting techniques and evaluating their performance.

6.Handling Time Series Data:

 Python's pandas library offers numerous tools and functions for handling time series
data. It provides capabilities for resampling, aggregating, and transforming time series
data, handling missing values, and handling time zone conversions.
 Pandas also supports time-based indexing, allowing you to slice and select data based
on time intervals.

Time series analysis and forecasting play a crucial role in various domains, including finance,
economics, weather forecasting, sales forecasting, and more. Python, with its rich ecosystem
of libraries and tools, provides a comprehensive environment for working with time series data,
performing analysis, visualization, modeling, and forecasting tasks.

Date Ranges, Frequencies, and Shifting

In time series analysis, working with date ranges, frequencies, and shifting data is essential for
manipulating and analyzing time-based data. Here's an explanation of these concepts in Python
using the pandas library:

1.Date Ranges:

A date range represents a sequence of dates over a specified period. In pandas, you can generate
date ranges using the pd.date_range() function. It allows you to specify the start date, end
date, and frequency of the range.

Example: date_range = pd.date_range(start='2022-01-01', end='2022-12-31', freq='D')


generates a daily date range for the year 2022.

86
2.Frequencies:

 Frequencies define the intervals at which observations occur in a time series. In pandas,
frequencies are represented using frequency strings or offsets. The freq parameter in
pandas functions accepts these frequency strings.
 Common frequency strings include 'D' for daily, 'W' for weekly, 'M' for monthly, 'Q'
for quarterly, 'A' for annually, and more. You can also specify custom frequencies.
 Example: date_range = pd.date_range(start='2022-01-01', periods=12, freq='M')
generates a monthly date range for 12 months starting from January 2022.

3.Shifting Data:

 Shifting data involves moving the values of a time series forward or backward in time.
This can be useful for calculating time-based differences or comparing values at
different time periods.
 In pandas, you can shift a time series using the shift() method. Positive values shift the
data forward, while negative values shift it backward.
 Example: shifted_series = series.shift(i) shifts the values of a series one step forward.

4.Rolling Windows:

 Rolling windows allow you to calculate aggregated statistics over a sliding window of
time. This is useful for smoothing data, calculating moving averages, or identifying
trends.
 In pandas, you can create a rolling window using the rolling() method. You can specify
the window size and apply various aggregation functions like mean, sum, min, max,
etc.
 Example: rolling_mean = series.rolling(window=3).mean() calculates the rolling
mean over a window of size 3.

These concepts provide the foundation for working with time series data in Python. They allow
you to create date ranges, specify frequencies for data intervals, and manipulate time-based
data by shifting and aggregating values. Using pandas, you can easily handle and analyze time
series data, perform calculations, and extract meaningful insights.

87
Time Zone Handling

Handling time zones is an important aspect of working with time series data, especially when
dealing with data from different regions or when performing analysis across different time
zones. In Python, the pytz and dateutil libraries, along with the capabilities of pandas, provide
functionality for working with time zones. Here's an explanation of time zone handling in
Python:

1.Time Zone Localization:

 Time zone localization involves assigning a specific time zone to a datetime object.
This is important when the original data does not have time zone information or when
converting data to a different time zone.
 The pytz library provides a comprehensive database of time zones, and you can use the
pytz.timezone() function to specify a time zone. The tz_localize() method in pandas is
used to localize a datetime object to a specific time zone.
 Eg: localized_datetime = datetime.tz_localize(pytz.timezone('America/New_York'))
assigns the 'America/New_York' time zone to a datetime object.

2.Time Zone Conversion:

 Time zone conversion involves converting datetime objects from one time zone to
another. This is useful when you want to compare or combine data from different time
zones.
 The tz_convert() method in pandas is used to convert datetime objects from one time
zone to another. It automatically adjusts the datetime values to reflect the new time
zone.
 Example: converted_datetime =
localized_datetime.tz_convert(pytz.timezone('Asia/Tokyo')) converts a localized
datetime object from the 'America/New_York' time zone to the 'Asia/Tokyo' time
zone.

88
3.Time Zone-aware Timestamps:

 In pandas, the Timestamp object can be made time zone-aware by using the tz
parameter. Time zone-aware timestamps allow for easy manipulation and comparison
of dates and times across different time zones.
 Example: aware_timestamp = pd.Timestamp('2022-01-01 12:00', tz='Europe/Paris')
creates a time zone-aware timestamp for the specified datetime in the 'Europe/Paris'
time zone.

4.Handling Time Zone Offset:

 The dateutil library provides functions to handle time zone offsets. The
dateutil.relativedelta class can be used to perform arithmetic operations with time zone-
aware datetime objects, allowing for adjustments based on specific time zone offsets.
 Example: adjusted_datetime = datetime + relativedelta(hours=2) adds 2 hours to a
time zone-aware datetime object.

By using these libraries and techniques, you can effectively handle time zones in Python.
Whether it's localizing datetime objects to a specific time zone, converting between time zones,
or performing operations with time zone offsets, Python provides the necessary tools to work
with time series data in different time zones accurately and efficiently.

Periods and Period Arithmetic

Periods in Python represent a fixed-length span of time, such as a day, month, or year. The
pandas library provides the Period class to work with periods and perform period arithmetic.
Here's an explanation of periods and period arithmetic in Python:

1.Creating Periods:

Periods can be created using the pd.Period() function by specifying a date or time string and a
frequency code. The frequency code determines the length of the period, such as 'D' for daily,
'M' for monthly, 'Y' for yearly, and so on.

Example: period = pd.Period('2022-01', freq='M') creates a monthly period for January


2022.

89
2.Period Arithmetic:

Period arithmetic allows you to perform mathematical operations on periods, such as addition,
subtraction, and comparison. The arithmetic operations respect the defined frequency and
adjust the periods accordingly.

Example:
 period1 = pd.Period('2022-01', freq='M')
 period2 = pd.Period('2022-03', freq='M')
 period_diff = period2 - period1 calculates the difference between two periods, resulting
in a new period representing the number of months between them.
 period_sum = period1 + 2 adds 2 periods to the original period, resulting in a new
period that is two months later.

3.Period Index:

Periods can be used as an index in a pandas Series or DataFrame, allowing for efficient
indexing and slicing based on periods. The pd.PeriodIndex class is used to create an index of
periods.

Example:

 periods = pd.PeriodIndex(['2022-01', '2022-02', '2022-03'], freq='M') creates a


period index for the specified periods.
 series = pd.Series([10, 20, 30], index=periods) creates a Series with the period index.

4.Frequency Conversion:

 Periods can be converted to a different frequency using the asfreq() method. This allows
you to change the length of the period while preserving the start or end timestamp.
 Example: new_period = period.asfreq('Y') converts the original monthly period to a
yearly period.

Periods and period arithmetic provide a convenient way to work with fixed-length spans of
time in Python. Whether it's creating periods, performing arithmetic operations, or using

90
periods as an index, the pandas library offers robust functionality to handle time-based data at
different frequencies accurately.

Resampling and Frequency Conversion

Resampling and frequency conversion are essential techniques for working with time series
data in Python. The pandas library provides robust functionality to perform resampling and
frequency conversion operations. Here's an explanation of how to perform resampling and
frequency conversion in Python:

1.Resampling:

 Resampling involves changing the frequency of your time series data. You can
upsample the data to a higher frequency or downsample it to a lower frequency.
 The resample() method in pandas is used to perform resampling. It takes a frequency
string as an argument to specify the new frequency.
 You can also specify an aggregation function to summarize the data within each new
frequency interval, such as sum(), mean(), max(), etc.

Example:

python code

# Upsample the data to a higher frequency (e.g., daily to hourly)


hourly_data = df.resample('H').mean()

# Downsample the data to a lower frequency (e.g., daily to monthly)


monthly_data = df.resample('M').sum()

2.Frequency Conversion:

Frequency conversion involves converting your time series data from one frequency to another.
It allows you to align the data to a different frequency or standardize it.

The asfreq() method in pandas is used to perform frequency conversion. It takes a frequency
string as an argument to specify the desired frequency.

91
The method handles the appropriate alignment or interpolation of data points based on the
specified frequency.

Example:

python code

# Convert the data to a different frequency (e.g., daily to weekly)


weekly_data = df.asfreq('W')

# Convert the data to a different frequency (e.g., daily to yearly)


yearly_data = df.asfreq('Y')

3.Resampling and Frequency Conversion Parameters:

Both resample() and asfreq() methods accept additional parameters to control the behavior of
the operation.

The how parameter can be used to specify the aggregation function for resampling (e.g., sum(),
mean(), max()).

The fill_method parameter can be used to handle missing values during upsampling by
specifying ffill (forward fill) or bfill (backward fill).

Example:

python code

# Resample the data, summing values within each new frequency interval
monthly_data = df.resample('M').sum()

# Upsample the data, forward fill missing values


daily_data = df.resample('D').ffill()

Resampling and frequency conversion allow you to manipulate and analyze time series data at
different frequencies. Whether you need to change the frequency, aggregate the data, or align

92
it with other time series, pandas provides a comprehensive set of tools to perform these
operations effectively.

Time Series Plotting

Time series plotting is an essential part of analyzing and visualizing time-based data in Python.
The pandas library, in combination with matplotlib, provides powerful tools for creating
insightful time series plots. Here's an explanation of how to plot time series data in Python:

1.Importing the necessary libraries:

python code

import pandas as pd
import matplotlib.pyplot as plt

2.Loading and preparing the time series data:

 Load the time series data into a pandas DataFrame, ensuring that the date or time
column is of the correct data type.
 Set the date or time column as the index of the DataFrame to enable time-based
indexing and plotting.

python code

# Load the time series data from a CSV file


df = pd.read_csv('data.csv')

# Convert the date column to datetime data type


df['date'] = pd.to_datetime(df['date'])

# Set the date column as the index of the DataFrame


df.set_index('date', inplace=True)

3.Plotting time series data:

 Use the plot() method of the DataFrame to create basic line plots of the time series data.

93
 Customize the plot by specifying the plot type, labels, title, gridlines, and other options.

python code

# Plot the time series data


df.plot()

# Customize the plot


plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series Plot')
plt.grid(True)

# Display the plot


plt.show()

4.Additional customization options:

 Adjust the figure size using plt.figure(figsize=(width, height)) to control the dimensions
of the plot.
 Apply different plot styles using plt.style.use('style_name'), such as 'seaborn', 'ggplot',
or custom styles.
 Add legends, change line colors, specify line styles, or add markers to the plot to
enhance readability.

python code

# Set the figure size


plt.figure(figsize=(10, 6))

# Apply a specific plot style


plt.style.use('ggplot')

94
# Plot the time series data with customized options
df.plot(color='blue', linestyle='-', linewidth=2, marker='o', markersize=5, label='Data')

# Add a legend
plt.legend()

# Display the plot


plt.show()

Time series plotting in Python allows you to visualize trends, patterns, and anomalies in your
time-based data. By leveraging the capabilities of pandas and matplotlib, you can create
informative and visually appealing plots that help you gain insights into your time series data.

Moving Window Functions

Moving window functions, also known as rolling or sliding window functions, are a class of
operations commonly used in time series analysis and data smoothing. These functions
compute an aggregate value over a fixed-size window of consecutive data points as it slides
through the time series. The window "rolls" or "slides" over the data, updating the aggregate
value at each step. The pandas library in Python provides convenient methods to perform
moving window operations. Here's an explanation of moving window functions in Python:

1.Rolling Window Operations:

 Rolling window operations compute aggregate values over a fixed-size window of


consecutive data points.
 The rolling() method in pandas is used to create a rolling window object from a time
series.
 You can specify the window size using the window parameter and choose an
aggregation function for the window using the min_periods and center parameters.

python code

# Create a rolling window object with a window size of 3


rolling_window = df.rolling(window=3)

95
# Compute the mean value over the rolling window
mean_values = rolling_window.mean()

2.Expanding Window Operations:

 Expanding window operations compute aggregate values over an expanding window


that grows with each data point.
 The expanding() method in pandas is used to create an expanding window object from
a time series.
 You can choose an aggregation function for the expanding window using the
min_periods parameter.

Example:

python code

# Create an expanding window object


expanding_window = df.expanding()

# Compute the sum of values over the expanding window


sum_values = expanding_window.sum()

3.Aggregation Functions:

Various aggregation functions can be applied to the moving windows, such as mean, sum, min,
max, standard deviation, etc.

These functions are applied to the window using the mean(), sum(), min(), max(), std(), etc.,
methods of the rolling or expanding window objects.

Example:

python code
# Compute the maximum value over the rolling window
max_values = rolling_window.max()

96
# Compute the standard deviation over the expanding window
std_values = expanding_window.std()

Moving window functions are useful for calculating rolling averages, smoothing time series
data, detecting trends or outliers, and performing various other analyses on time-based data.
By specifying the window size and choosing an appropriate aggregation function, you can
derive meaningful insights from your time series data using moving window operations in
Python.

Natural Language Processing,

Natural Language Processing (NLP) is a field of study that focuses on the interaction between
computers and human language. It involves techniques and algorithms that enable computers
to understand, interpret, and generate human language in a meaningful way. Python provides
several powerful libraries and tools for NLP, making it a popular choice among developers.
Here's an overview of NLP in Python:

1.NLTK (Natural Language Toolkit):

NLTK is a widely used library for NLP in Python. It provides various functionalities for text
processing, tokenization, stemming, tagging, parsing, and more.

It also includes a large collection of corpora, lexical resources, and models for different NLP
tasks.

python code

import nltk
# Tokenization
text = "This is an example sentence."
tokens = nltk.word_tokenize(text)
# Part-of-speech tagging
tagged = nltk.pos_tag(tokens)
# Stemming
stemmer = nltk.stem.PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]

97
2.spaCy:

spaCy is a modern NLP library that offers high-performance and efficient tools for NLP tasks
such as tokenization, part-of-speech tagging, named entity recognition, dependency parsing,
and more.

It is known for its speed, accuracy, and ease of use, making it suitable for large-scale NLP
applications.

python code

import spacy
# Load the English language model
nlp = spacy.load('en_core_web_sm')

# Tokenization
doc = nlp("This is an example sentence.")
tokens = [token.text for token in doc]

# Named entity recognition


entities = [(entity.text, entity.label_) for entity in doc.ents]

# Dependency parsing
for token in doc:
print(token.text, token.dep_, token.head.text)

3.TextBlob:

TextBlob is a user-friendly library built on top of NLTK, providing a simple API for common
NLP tasks such as sentiment analysis, part-of-speech tagging, noun phrase extraction,
translation, and more.

It also offers a straightforward interface for working with textual data and performing basic
text processing operations.

Example:

98
python code

from textblob import TextBlob


# Sentiment analysis
text = "This is a great movie!"
blob = TextBlob(text)
sentiment = blob.sentiment

# Noun phrase extraction


nouns = blob.noun_phrases

These are just a few examples of the libraries available for NLP in Python. Other popular
libraries include Gensim for topic modeling, scikit-learn for machine learning-based NLP
tasks, and Transformers for advanced deep learning models such as BERT and GPT. With
these libraries, you can perform a wide range of NLP tasks, analyze textual data, and extract
valuable insights from text using Python.

Image Processing

Image processing is a field of study that involves manipulating digital images to enhance their
quality, extract useful information, or perform specific tasks. Python provides various libraries
and tools for image processing, making it a popular choice among developers. Here's an
overview of image processing in Python:

1.OpenCV (Open Source Computer Vision Library):

OpenCV is a widely used library for computer vision and image processing in Python.

It provides a comprehensive set of functions and algorithms for image manipulation, filtering,
feature detection, object recognition, and more.

python code

import cv2
# Read an image from file
image = cv2.imread('image.jpg')

99
# Convert the image to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Apply a Gaussian blur to the image


blurred = cv2.GaussianBlur(gray, (5, 5), 0)

# Detect edges in the image using Canny edge detection


edges = cv2.Canny(blurred, 100, 200)

2.PIL (Python Imaging Library):

PIL is a library for opening, manipulating, and saving many different image file formats in
Python.

It provides functions for basic image processing tasks such as resizing, cropping, rotating, and
converting image formats.

python code
from PIL import Image

# Open an image file


image = Image.open('image.jpg')

# Resize the image


resized_image = image.resize((800, 600))

# Rotate the image


rotated_image = image.rotate(90)

# Convert the image to grayscale


grayscale_image = image.convert('L')

100
3.scikit-image:

scikit-image is a library that provides a collection of algorithms and functions for image
processing tasks in Python.

It offers various functionalities for image filtering, segmentation, morphology, and feature
extraction.

Example:

python code

import skimage.io
import skimage.filters

# Read an image from file


image = skimage.io.imread('image.jpg')

# Apply a Gaussian blur to the image


blurred = skimage.filters.gaussian(image, sigma=2)

# Perform image thresholding


thresholded = image > skimage.filters.threshold_otsu(image)

# Apply image morphological operations


opened = skimage.morphology.opening(image)

These are just a few examples of the libraries available for image processing in Python. Other
notable libraries include scikit-learn for machine learning-based image analysis, matplotlib and
seaborn for image visualization, and TensorFlow and PyTorch for deep learning-based image
processing tasks. With these libraries, you can perform a wide range of image processing tasks,
analyze and manipulate images, and develop computer vision applications using Python.

101
Machine Learning K Nearest Neighbors Algorithm for Classification

The k-nearest neighbors (KNN) algorithm is a simple yet powerful machine learning algorithm
used for both classification and regression tasks. In this explanation, we will focus on using the
KNN algorithm for classification.

Here's how the KNN algorithm works for classification:

1.Training:

 During the training phase, the algorithm simply stores the labeled data points in
memory.
 Each data point consists of a set of features (input variables) and a corresponding class
label (output variable).

2.Prediction:

 When a new unlabeled data point is given, the KNN algorithm predicts its class label
based on its similarity to the labeled data points.
 The algorithm measures the similarity using a distance metric (e.g., Euclidean distance).
 It considers the k nearest neighbors (data points with the smallest distances) to the new
data point.

3.Voting:

 For classification, the KNN algorithm employs majority voting among the k nearest
neighbors to determine the class label of the new data point.
 Each neighbor's class label contributes one vote, and the majority class label is assigned
to the new data point.

4.Choosing k:

 The value of k, the number of neighbors to consider, is a hyperparameter of the KNN


algorithm.
 It can be chosen based on the problem at hand and the characteristics of the dataset.
 A smaller k value makes the model more sensitive to noise, while a larger k value makes
it more biased.

102
 Here's an example of how to implement the KNN algorithm for classification using the
scikit-learn library in Python:

python code

from sklearn.neighbors import KNeighborsClassifier


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a KNN classifier object


knn = KNeighborsClassifier(n_neighbors=5)

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train the classifier using the training data


knn.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = knn.predict(X_test)

# Evaluate the accuracy of the model


accuracy = accuracy_score(y_test, y_pred)

In this example, X represents the feature matrix (input variables) and y represents the
corresponding class labels. The train_test_split function is used to split the data into training
and testing sets. The fit method is used to train the KNN classifier, and the predict method is
used to make predictions on the testing data. Finally, the accuracy of the model is evaluated
using the accuracy_score function.

Remember to preprocess and normalize the data as needed before applying the KNN algorithm.
Additionally, feature scaling and handling categorical variables might be necessary for certain
datasets.

103
The KNN algorithm is relatively simple to understand and implement, making it a good starting
point for classification tasks. However, it is important to choose an appropriate value for k and
handle the curse of dimensionality when working with high-dimensional data.

Clustering

Clustering is an unsupervised machine learning technique used to group similar data points
together based on their characteristics or patterns. It is often used for exploratory data analysis,
pattern recognition, and data segmentation. The goal of clustering is to discover inherent
structures or clusters in the data without any predefined class labels.

There are various clustering algorithms available, but we will focus on two commonly used
algorithms: K-means clustering and hierarchical clustering.

1.K-means Clustering:

 K-means clustering is an iterative algorithm that partitions the data into k clusters,
where k is a predefined number chosen by the user.
 The algorithm works by initially randomly selecting k centroids (representative points)
in the feature space.
 It assigns each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).
 After assigning all the data points, the algorithm updates the centroids by calculating
the mean of the points in each cluster.
 This process is repeated until the centroids no longer change significantly or a
maximum number of iterations is reached.

2.Hierarchical Clustering:

 Hierarchical clustering is a bottom-up (agglomerative) or top-down (divisive) approach


that creates a hierarchy of clusters.
 It starts with each data point as an individual cluster and then merges or divides clusters
based on their similarity.
 The algorithm iteratively combines or splits clusters based on a distance metric and a
linkage criterion.

104
 The distance metric can be Euclidean distance, Manhattan distance, or other similarity
measures.
 The linkage criterion determines how the distance between clusters is calculated, such
as complete linkage, single linkage, or average linkage.

Here's an example of how to perform K-means clustering using the scikit-learn library in
Python:

python code

from sklearn.cluster import KMeans

# Create a K-means clustering object


kmeans = KMeans(n_clusters=3)

# Fit the clustering model to the data


kmeans.fit(X)

# Get the cluster labels for each data point


labels = kmeans.labels_

# Get the cluster centers (centroids)


centroids = kmeans.cluster_centers_

In this example, X represents the feature matrix (input variables). The n_clusters parameter
specifies the number of clusters to create. The fit method is used to fit the clustering model to
the data, and the labels_ attribute provides the cluster labels for each data point. The
cluster_centers_ attribute gives the coordinates of the cluster centers.

Clustering is an iterative process, and the choice of the number of clusters (k) is crucial. You
can use various evaluation metrics, such as the silhouette score or elbow method, to determine
the optimal number of clusters.

It's important to note that clustering is an unsupervised learning technique, meaning it does not
require labeled data. However, it is often used as a preprocessing step for other tasks, such as
anomaly detection, customer segmentation, or recommendation systems.

105
Unit 6: Visualization of Data with Python 10 Hours
Using Matplotlib Create line plots, area plots, histograms, bar charts, pie charts,
box plots and scatter plots and bubble plots. Advanced visualization tools such as
waffle charts, word clouds, seaborn and Folium for visualizing geospatial data.
Creating choropleth maps

Using Matplotlib Create line plots, area plots, histograms, bar charts, pie
charts, box plots and scatter plots and bubble plots.

Line plots

Matplotlib is a popular data visualization library in Python that provides a wide range of tools
for creating various types of plots, including line plots. Line plots are commonly used to
visualize the relationship between two variables and show how the data changes over a
continuous range.

To create line plots using Matplotlib, you need to follow these basic steps:

1.Import the necessary libraries:

python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Create lists or arrays to store the x-axis values and y-axis values.
 Ensure that both lists have the same length and the values are in the correct order.

3.Create the line plot:

python code

plt.plot(x, y)

 The plot() function is used to create the line plot.

106
 Pass the x-axis values as the first argument and the y-axis values as the second
argument.
 Matplotlib will automatically connect the data points with lines.

4.Display the plot:

python code

plt.show()

 The show() function is used to display the plot on the screen.

5.Customize the plot (optional):

 Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
 Set a title for the plot using the title() function.
 Customize the line style, color, and marker using optional parameters in the plot()
function.
 Add a legend to distinguish multiple lines in the plot using the legend() function.

Matplotlib provides many more customization options to enhance the appearance of your line
plot. You can control the line style, thickness, color, marker style, and much more. You can
also add grid lines, annotations, and text to provide additional information.

Here's a simple example of creating a line plot using Matplotlib:

python code

import matplotlib.pyplot as plt


x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()

107
This code will create a line plot with the given x-axis and y-axis values. You can modify the
code based on your data and requirements to create more complex line plots with additional
customization.

Area plots

Area plots, also known as stacked area plots, are used to represent the cumulative magnitude
of different variables over a continuous range. They are useful for visualizing the composition
or distribution of multiple variables and showcasing their cumulative impact.

To create area plots using Matplotlib in Python, you can follow these steps:

1.Import the necessary libraries:

python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Create lists or arrays to store the x-axis values and y-axis values for each variable.
 Make sure the length of the x-axis values is the same for all variables.
 The y-axis values should represent the cumulative magnitude or proportion for each
variable at each point on the x-axis.

3.Create the area plot:

python code

plt.stackplot(x, y1, y2, y3, labels=['Variable 1', 'Variable 2', 'Variable 3'])

 The stackplot() function is used to create the area plot.


 Pass the x-axis values as the first argument.
 Pass the y-axis values for each variable as separate arguments.
 You can provide labels for each variable using the labels parameter.

108
4.Customize the plot (optional):

 Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
 Set a title for the plot using the title() function.
 Customize the colors, transparency, and other visual aspects of the area plot using
optional parameters in the stackplot() function.
 Add a legend using the legend() function to differentiate between different variables.

5.Display the plot:

Python code

plt.show()

Matplotlib provides additional customization options to enhance the appearance of your area
plot. You can control the colors, transparency, line styles, and markers of each variable. You
can also add grid lines, annotations, and text to provide additional information.

Here's a simple example of creating an area plot using Matplotlib:

Python code

import matplotlib.pyplot as plt


x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3, 4, 5]
y2 = [2, 4, 6, 8, 10]
y3 = [3, 6, 9, 12, 15]
plt.stackplot(x, y1, y2, y3, labels=['Variable 1', 'Variable 2', 'Variable 3'])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Area Plot')
plt.legend()
plt.show()

109
This code will create an area plot with the given x-axis and y-axis values for three variables.
Each variable is represented by a different color, and a legend is added to identify each variable.
You can modify the code based on your data and requirements to create more complex area
plots with additional customization.

Histograms,

Histograms are used to visualize the distribution of a continuous variable. They provide a
graphical representation of the frequency or count of values falling within specific intervals or
bins. Histograms help in understanding the shape, central tendency, and spread of the data.

To create histograms in Python, you can use various libraries such as Matplotlib, Seaborn, or
Pandas. Here, I will explain how to create histograms using Matplotlib.

Here are the steps to create a histogram using Matplotlib:

1.Import the necessary libraries:

python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Create a list or array containing the values of the variable you want to plot.

3.Create the histogram:

Python code

plt.hist(data, bins=10)

 The hist() function is used to create the histogram.


 Pass the data as the first argument.
 Specify the number of bins using the bins parameter. You can adjust the number of bins
to control the level of detail in the histogram.

110
4.Customize the plot (optional):

 Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
 Set a title for the plot using the title() function.
 Adjust the appearance of the histogram, such as the color, transparency, and edge color,
using optional parameters in the hist() function.

5.Display the plot:

python code

plt.show()

Matplotlib provides additional customization options to enhance the appearance of your


histogram. You can control the bin size, range, density, cumulative distribution, and more.

Here's a simple example of creating a histogram using Matplotlib:

Python code

import matplotlib.pyplot as plt


data = [1, 3, 2, 1, 4, 3, 2, 3, 4, 3, 2, 1, 3, 4, 2, 1, 3, 4]
plt.hist(data, bins=5)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

This code will create a histogram with the given data, dividing it into five bins. The x-axis
represents the value, and the y-axis represents the frequency. You can modify the code based
on your data and requirements to create more complex histograms with additional
customization.

111
Bar charts

Bar charts are a common visualization tool used to represent categorical data using rectangular
bars. They are particularly useful for displaying the frequency or count of different categories
or comparing values across different groups.

To create bar charts in Python, you can use various libraries such as Matplotlib, Seaborn, or
Plotly. Here, I will explain how to create bar charts using Matplotlib, which is a popular plotting
library.

Here are the steps to create a bar chart using Matplotlib:

1.Import the necessary libraries:

Python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Create a list or array containing the categories or labels for the x-axis.
 Create a corresponding list or array containing the values or counts for each category.

3.Create the bar chart:

Python code

plt.bar(x, height)

 The bar() function is used to create the bar chart.


 Pass the categories or labels as the first argument (x-axis).
 Pass the values or counts as the second argument (height of the bars).

4.Customize the plot (optional):

 Add labels to the x-axis and y-axis using the xlabel() and ylabel() functions.
 Set a title for the plot using the title() function.

112
 Adjust the appearance of the bars, such as the color, width, and edge color, using
optional parameters in the bar() function.

5.Display the plot:

Python code

plt.show()

Matplotlib provides various customization options to enhance the appearance of your bar chart.
You can adjust the bar width, add error bars, annotate the bars with values, change the color
scheme, and more.

Here's a simple example of creating a bar chart using Matplotlib:

Python code

import matplotlib.pyplot as plt


categories = ['Category A', 'Category B', 'Category C', 'Category D']
values = [10, 15, 7, 12]

plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Count')
plt.title('Bar Chart')
plt.show()

This code will create a bar chart with the given categories and values. Each category is
represented by a bar, and the height of the bar represents the count. You can modify the code
based on your data and requirements to create more complex bar charts with additional
customization.

Pie charts

Pie charts are a popular way to represent categorical data, showing the proportion or percentage
of each category relative to the whole. They are particularly useful for visualizing data with a
small number of categories or comparing the relative sizes of different categories.

113
To create pie charts in Python, you can use various libraries such as Matplotlib, Plotly, or
Seaborn. Here, I will explain how to create pie charts using Matplotlib, which is a widely used
plotting library.

Here are the steps to create a pie chart using Matplotlib:

1.Import the necessary libraries:

Python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Create a list or array containing the labels for each category.


 Create a corresponding list or array containing the values or sizes for each category.

3.Create the pie chart:

Python code

plt.pie(sizes, labels=labels)

 The pie() function is used to create the pie chart.


 Pass the values or sizes as the first argument.
 Specify the labels for each category using the labels parameter.

4.Customize the plot (optional):

 Add a title to the plot using the title() function.


 Adjust the appearance of the pie chart, such as the colors, shadow, and start angle, using
optional parameters in the pie() function.
 Add a legend to the plot using the legend() function.

5.Display the plot:


Python code

114
plt.show()

Matplotlib provides additional customization options to enhance the appearance of your pie
chart. You can explode or highlight specific slices, add percentage values, adjust the text
properties, and more.

Here's a simple example of creating a pie chart using Matplotlib:

Python code

import matplotlib.pyplot as plt


labels = ['Category A', 'Category B', 'Category C', 'Category D']
sizes = [30, 20, 15, 35]

plt.pie(sizes, labels=labels, autopct='%1.1f%%')


plt.title('Pie Chart')
plt.show()

This code will create a pie chart with the given labels and sizes. Each category is represented
by a slice, and the size of the slice represents the proportion or percentage. The autopct
parameter is used to display the percentage values on the chart. You can modify the code based
on your data and requirements to create more complex pie charts with additional customization.

Box plots

Box plots, also known as box-and-whisker plots, are a useful visualization tool to display the
distribution of a continuous variable across different categories or groups. They provide a
summary of key statistical measures such as the median, quartiles, and potential outliers.

To create box plots in Python, you can use various libraries such as Matplotlib, Seaborn, or
Plotly. Here, I will explain how to create box plots using Matplotlib, which is a commonly used
plotting library.

Here are the steps to create a box plot using Matplotlib:

1.Import the necessary libraries:

115
Python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Organize your data into separate groups or categories, each containing a list or array of
values.
 Optionally, provide labels for each group if you want to display them on the plot.

3.Create the box plot:

Python code

plt.boxplot(data, labels=labels)

 The boxplot() function is used to create the box plot.


 Pass the data as the first argument.
 Specify the labels for each group using the labels parameter if desired.

4.Customize the plot (optional):

 Add a title to the plot using the title() function.


 Adjust the appearance of the box plot, such as the color, linewidth, and whisker style,
using optional parameters in the boxplot() function.
 Customize the axis labels and ticks using the appropriate functions (xlabel(), ylabel(),
xticks(), yticks()).

5.Display the plot:

Python code

plt.show()

Matplotlib provides additional customization options to enhance the appearance of your box
plot. You can show or hide specific elements such as outliers, caps, or median lines, change
the orientation of the plot, add grid lines, and more.

116
Here's a simple example of creating a box plot using Matplotlib:

Python code

import matplotlib.pyplot as plt

group1 = [10, 15, 20, 25, 30]


group2 = [12, 18, 22, 28, 35]
group3 = [8, 16, 24, 32, 40]

data = [group1, group2, group3]


labels = ['Group 1', 'Group 2', 'Group 3']

plt.boxplot(data, labels=labels)
plt.title('Box Plot')
plt.ylabel('Values')
plt.show()

This code will create a box plot with three groups. Each group is represented by a box, with
the central line inside the box representing the median. The whiskers extend to the minimum
and maximum values, and any potential outliers are indicated by individual points. You can
modify the code based on your data and requirements to create more complex box plots with
additional customization.

Scatter plots

Scatter plots are used to visualize the relationship between two continuous variables. They
show the individual data points as dots on a two-dimensional coordinate system, with one
variable plotted on the x-axis and the other variable plotted on the y-axis. Scatter plots help to
identify patterns, trends, or correlations between the two variables.

To create scatter plots in Python, you can use various plotting libraries such as Matplotlib,
Seaborn, or Plotly. Here, I will explain how to create scatter plots using Matplotlib, which is a
commonly used plotting library.

117
Here are the steps to create a scatter plot using Matplotlib:

1.Import the necessary libraries:

Python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Organize your data into two arrays or lists, one for the x-values and one for the y-values.

3.Create the scatter plot:

Python code

plt.scatter(x, y)

 The scatter() function is used to create the scatter plot.


 Pass the x-values as the first argument and the y-values as the second argument.

4.Customize the plot (optional):

 Add a title to the plot using the title() function.


 Label the x-axis and y-axis using the xlabel() and ylabel() functions.
 Adjust the appearance of the scatter plot, such as the marker style, size, color, or
transparency, using optional parameters in the scatter() function.
 Add a legend or colorbar if necessary.

5.Display the plot:

Python code

plt.show()

118
Matplotlib provides additional customization options to enhance the appearance of your scatter
plot. You can add regression lines, error bars, annotations, or other plot elements to provide
more context or insights.

Here's a simple example of creating a scatter plot using Matplotlib:

Python code

import matplotlib.pyplot as plt


x = [1, 2, 3, 4, 5]
y = [10, 15, 12, 18, 20]

plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.show()

This code will create a scatter plot with the given x-values and y-values. Each data point is
represented by a dot on the plot. You can modify the code based on your data and requirements
to create more complex scatter plots with additional customization.

Bubble plots

Bubble plots, also known as bubble charts, are a variation of scatter plots where the size of the
markers (bubbles) represents a third variable. They are useful for visualizing three-dimensional
data, where the x-axis and y-axis represent two continuous variables, and the size of the bubbles
represents the magnitude or frequency of another variable.

To create bubble plots in Python, you can use various plotting libraries such as Matplotlib or
Plotly. Here, I will explain how to create bubble plots using Matplotlib, which is a commonly
used plotting library.

Here are the steps to create a bubble plot using Matplotlib:

119
1.Import the necessary libraries:

Python code

import matplotlib.pyplot as plt

2.Prepare your data:

 Organize your data into three arrays or lists: one for the x-values, one for the y-values,
and one for the bubble sizes.

3.Create the bubble plot:

Python code

plt.scatter(x, y, s=sizes)

 The scatter() function is used to create the bubble plot.


 Pass the x-values as the first argument, the y-values as the second argument, and the
bubble sizes as the s parameter.

4.Customize the plot (optional):

 Add a title to the plot using the title() function.


 Label the x-axis and y-axis using the xlabel() and ylabel() functions.
 Adjust the appearance of the bubble plot, such as the color, transparency, or edge
properties of the markers, using optional parameters in the scatter() function.
 Add a legend or colorbar if necessary.

5.Display the plot:

Python code

plt.show()

Matplotlib provides additional customization options to enhance the appearance of your bubble
plot. You can use different marker shapes, colors, or color maps to represent additional
variables or categories.

120
Here's a simple example of creating a bubble plot using Matplotlib:

Python code

import matplotlib.pyplot as plt


x = [1, 2, 3, 4, 5]
y = [10, 15, 12, 18, 20]
sizes = [30, 50, 80, 20, 70]

plt.scatter(x, y, s=sizes)
plt.title('Bubble Plot')
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.show()

This code will create a bubble plot with the given x-values, y-values, and bubble sizes. Each
data point is represented by a bubble on the plot, and the size of the bubble represents the
corresponding size value. You can modify the code based on your data and requirements to
create more complex bubble plots with additional customization.

Advanced visualization tools such as waffle charts, word clouds, seaborn and
Folium for visualizing geospatial data.

Waffle charts

Waffle charts are a type of visualization that represents proportions or percentages using square
tiles, where each tile represents a specific portion of the whole. Waffle charts can be a visually
appealing and intuitive way to convey information about categorical data.

In Python, you can create waffle charts using the pywaffle library. Here's a step-by-step
explanation of how to create a waffle chart:

1.Install the pywaffle library:


Python code

pip install pywaffle

121
2.Import the necessary libraries:

Python code

from pywaffle import Waffle


import matplotlib.pyplot as plt

3.Prepare your data:

 Create a dictionary or a pandas Series that represents the categories and their
corresponding values.
 Ensure that the values represent proportions or percentages of the whole.

4.Create the waffle chart:

Instantiate a Waffle object and provide the values and categories.

You can customize the chart by specifying parameters such as the number of rows, columns,
colors, and figure size.

Python code

# Example data
data = {'Category A': 30,
'Category B': 20,
'Category C': 50}

# Create waffle chart


fig = plt.figure(FigureClass=Waffle, rows=5, columns=10, values=data, colors=['#008080',
'#FFA500', '#800080'])

# Customize the chart


plt.title('Waffle Chart')
plt.show()

122
In this example, the waffle chart will have 5 rows and 10 columns, representing a total of 50
tiles. The values from the data dictionary will determine the number of tiles assigned to each
category, and the colors parameter sets the colors for each category.

You can further customize the chart by adding a title, adjusting the figure size, or modifying
the legend.

Waffle charts can be a great way to visually compare proportions or percentages across
different categories. They offer a unique and engaging visualization that can enhance data
communication and storytelling.

Word clouds

Word clouds are a popular visualization technique used to represent the frequency or
importance of words in a given text corpus. They provide a visual summary of the most
common words in the text, with the size of each word indicating its frequency or importance.
Python offers several libraries to create word clouds, including wordcloud and matplotlib.

Here's a step-by-step explanation of how to create a word cloud using the wordcloud library in
Python:

1.Install the wordcloud library:

Python code
pip install wordcloud

2.Import the necessary libraries:

Python code
from wordcloud import WordCloud
import matplotlib.pyplot as plt

3.Prepare your data:

 Clean and preprocess your text data, removing any irrelevant words or characters.
 Convert your text data into a string or a list of words.

123
4.Generate the word cloud:

 Create a WordCloud object and provide the necessary parameters.


 Pass the text data to the generate() method of the WordCloud object.

Python code

# Example data
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua."

# Generate word cloud


wordcloud = WordCloud().generate(text)

# Display the word cloud


plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In this example, the WordCloud object generates the word cloud based on the provided text
data. You can further customize the word cloud by specifying parameters such as the
background color, word colors, font size, and maximum number of words displayed.

You can also use additional methods and functions provided by the wordcloud library to
enhance the appearance of the word cloud, such as masking the word cloud to a specific shape
or generating word clouds from word frequency dictionaries.

Word clouds are a visually appealing way to explore and summarize textual data. They can be
useful for tasks such as analyzing customer reviews, identifying important keywords in a
document, or visually representing word frequencies in a corpus.

Seaborn

Seaborn is a powerful data visualization library in Python that is built on top of Matplotlib. It
provides a high-level interface for creating visually appealing and informative statistical

124
graphics. Seaborn simplifies the process of creating complex visualizations by offering a wide
range of plot types and built-in statistical functionalities.

Here are some key features and capabilities of Seaborn:

1. Enhanced Aesthetics: Seaborn comes with a set of aesthetically pleasing default


themes and color palettes that improve the overall appearance of your plots. It allows
you to create visually appealing visualizations with just a few lines of code.
2. High-Level Plotting Functions: Seaborn provides high-level plotting functions that
simplify the creation of various types of statistical plots. Some of the commonly used
plots include scatter plots, line plots, bar plots, box plots, violin plots, heatmaps, and
histograms.
3. Statistical Aggregation and Summarization: Seaborn integrates statistical
aggregations and summarization directly into its plotting functions. This allows you to
quickly visualize the distribution of data, calculate summary statistics, and explore
relationships between variables.
4. Categorical Data Visualization: Seaborn offers specialized functions for visualizing
categorical data. It provides options for creating grouped bar plots, box plots, violin
plots, and categorical scatter plots to analyze relationships between categorical
variables.
5. Statistical Estimation: Seaborn incorporates statistical estimation and inference into
its visualizations. It includes functionality for adding error bars, confidence intervals,
and visualizing statistical models such as regression lines and kernel density estimates.
6. Multi-Plot Grids: Seaborn allows you to create multi-plot grids that facilitate the
comparison of multiple variables or subsets of data. These grids can be customized to
display various types of plots, such as histograms, scatter plots, or box plots, across
different facets of the data.

Seaborn is widely used in the data science community for its ability to create visually appealing
and informative visualizations with minimal code. It complements the functionality of
Matplotlib and provides a higher-level interface for creating complex plots while incorporating
statistical analysis.

Certainly! Here's an example that demonstrates the usage of Seaborn to create a scatter plot:

125
Python code

import seaborn as sns


import matplotlib.pyplot as plt

# Example data
x = [1, 2, 3, 4, 5]
y = [3, 5, 2, 6, 1]

# Create scatter plot using Seaborn


sns.scatterplot(x, y)

# Set plot title and labels


plt.title('Scatter Plot Example')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Display the plot


plt.show()

In this example, we import the necessary libraries, including Seaborn and Matplotlib. We
define two lists x and y as our data points. We then use the sns.scatterplot() function from
Seaborn to create a scatter plot, passing in the x and y data. Finally, we customize the plot by
adding a title and axis labels using Matplotlib, and display the plot using plt.show().

Seaborn offers many more functionalities for customizing and enhancing the appearance of the
scatter plot. You can further customize the color, marker style, size, and other visual aspects of
the plot using Seaborn's additional parameters and options. Additionally, Seaborn provides
various statistical functionalities to add regression lines, confidence intervals, or perform
additional data analysis within the scatter plot.

Folium for visualizing geospatial data.

Folium is a Python library used for visualizing geospatial data on interactive maps. It leverages
the Leaflet.js library, which is a popular JavaScript library for creating interactive maps. Folium

126
allows you to create maps directly in Python, making it convenient for data analysis and
visualization tasks.

Here are some key concepts and features of Folium

1. Map Creation: Folium provides a simple and intuitive way to create maps by
specifying the initial center location and zoom level. You can choose from various tile
providers, such as OpenStreetMap, Mapbox, and Stamen, to set the base map style.
2. Markers: Folium allows you to add markers to the map to represent specific locations.
You can customize the markers by setting their position, icon, color, and popup
messages. Markers are commonly used to plot points of interest or to represent data
points on the map.
3. Polygons and Polylines: Folium supports the drawing of polygons and polylines on
the map. Polygons are used to highlight areas or create boundaries, while polylines are
used to draw lines between specific points. This functionality is useful for visualizing
regions, routes, or trajectories.
4. Choropleth Maps: Folium provides the capability to create choropleth maps, where
areas are shaded or colored based on a specific attribute or value. Choropleth maps are
commonly used to visualize spatial patterns or thematic data, such as population density
or economic indicators.
5. Heatmaps: Folium allows you to create heatmaps, which visualize the density or
intensity of data points on the map. Heatmaps are useful for identifying hotspots or
areas of high activity based on the concentration of data.
6. Interactive Features: Folium supports interactive features like tooltips and popups.
Tooltips provide additional information when hovering over markers, while popups
display more detailed information when markers are clicked. This interactivity
enhances the user experience and enables further exploration of the geospatial data.

Folium provides a versatile and flexible framework for visualizing geospatial data in Python.
It integrates well with popular data analysis libraries like Pandas and NumPy, allowing you to
easily combine geospatial data with other data sources for comprehensive analysis and
visualization.

127
Folium is a Python library used for visualizing geospatial data on interactive maps. It is built
on top of the Leaflet.js JavaScript library and provides a simple and intuitive interface to create
dynamic and interactive maps in Python.

Here's an example that demonstrates how to use Folium to create a basic map and plot markers
on it:

Python code

import folium

# Create a map object


m = folium.Map(location=[51.5074, -0.1278], zoom_start=12)

# Add markers to the map


folium.Marker([51.5074, -0.1278], popup='London').add_to(m)
folium.Marker([48.8566, 2.3522], popup='Paris').add_to(m)
folium.Marker([40.7128, -74.0060], popup='New York').add_to(m)

# Display the map


m

In this example, we start by importing the folium library. We create a Map object by specifying
the initial center location and zoom level. Next, we add markers to the map using the Marker
class, specifying the latitude and longitude coordinates and an optional popup message for each
marker. Finally, we display the map by calling m.

Folium provides many options to customize the map, such as setting the tile style, adding
overlays like polygons or lines, and applying different color schemes. It also supports different
tile providers like OpenStreetMap, Mapbox, and Stamen. Additionally, Folium allows you to
incorporate interactive features like tooltips, popups, and click events on the markers or other
map elements.

128
With Folium, you can easily create interactive maps to visualize geospatial data, plot markers,
draw polygons, and add other geospatial overlays to enhance your visualizations.

Creating choropleth maps

To create choropleth maps in Python, you can use various libraries such as GeoPandas, Plotly,
or Matplotlib. Here, I will explain how to create choropleth maps using GeoPandas, which is a
powerful library for working with geospatial data.

Here are the steps to create choropleth maps using GeoPandas:

1.Import the necessary libraries:

Python code

import geopandas as gpd


import matplotlib.pyplot as plt

2.Read the shapefile or GeoJSON file containing the geographic boundaries and attribute
data:

Python code

data = gpd.read_file('path/to/shapefile.shp')

 Replace 'path/to/shapefile.shp' with the actual file path of your shapefile or GeoJSON
file.

3.Explore the data:

 Use the head() function to preview the attribute data in the GeoDataFrame.
 Verify that the data includes the necessary information for creating the choropleth map,
such as a column with the values to be mapped.

129
4.Create the choropleth map:

Python code

data.plot(column='column_name', cmap='color_map', linewidth=0.8, edgecolor='0.8',


legend=True)

 Replace 'column_name' with the name of the column containing the values to be
mapped.
 Specify the desired color map ('color_map') to represent the data values. Matplotlib
provides various color maps, such as 'viridis', 'magma', 'coolwarm', etc.
 Adjust the linewidth, edge color, and legend properties according to your preferences.

5.Customize the map:

 Add a title to the map using the plt.title() function.


 Customize the legend using the plt.legend() function.
 Adjust the figure size, color, or other plot properties using Matplotlib functions.

6.Display the map:

Python code

plt.show()

GeoPandas also provides additional functionality for manipulating and analyzing geospatial
data. You can perform spatial joins, overlays, or spatial queries to enhance your analysis and
visualization.

Here's a simple example of creating a choropleth map using GeoPandas:

Python code

import geopandas as gpd


import matplotlib.pyplot as plt

# Read shapefile
data = gpd.read_file('path/to/shapefile.shp')

130
# Create choropleth map
data.plot(column='population', cmap='viridis', linewidth=0.8, edgecolor='0.8', legend=True)

# Customize the map


plt.title('Population Choropleth Map')
plt.legend(title='Population')

# Display the map


plt.show()
This code will create a choropleth map using the 'population' column from the shapefile. The
color map 'viridis' will be used to represent the population values, and a legend will be added
to the map. Adjust the code based on your data and requirements to create choropleth maps for
different variables or regions.

-------------THANK-YOU-------------

131

You might also like