Data Visualization1
Data Visualization1
Presented by,
Dr. Dhakshayani J
Assistant Professor
Department of Computer Science & Engineering
Indian Institute of Information Technology Kottayam
E-mail: [email protected]
What will we cover today?
▪Motivation
▪Useful Python Libraries
▪Types of Plots
▪Exploratory data analysis
2
Why visualization?
3
Visualization Objectives
▪ Record information
▪ Analyze data to support reasoning
▪ Confirm hypotheses
▪ Communicate ideas to others
Why Visualize
To record information
Why Visualize
To identify trends and patterns
Why Visualize
To point out interesting things
Why Visualize
To communicate information in better way
Why Visualize
To analyze data
“I can store numbers and other objects in a Python list and do all sorts
of computations and manipulations through list comprehensions, for-
loops etc. What do I need a NumPy array for?”
In: Out:
Differences between lists and ndarrays
• The key difference between an array and a list is that arrays are
designed to handle vectorised operations while a python lists are not.
• That means, if you apply a function, it is performed on every item in
the array, rather than on the whole array object.
• Let’s suppose you want to add the number 2 to every item in the list.
The intuitive way to do this is something like this:
In: Out:
• That was not possible with a list, but you can do that on an array:
In: Out:
• It should be noted here that, once a Numpy array is created, you
cannot increase its size.
• To do so, you will have to create a new array.
Create a 2d array from a list of list
• You can pass a list of lists to create a matrix-like a 2d array.
In:
Out:
The dtype argument
• You can specify the data-type by setting the dtype() argument.
• Some of the most commonly used NumPy dtypes are: float, int, bool, str,
and object.
In:
Out:
The astype argument
• You can also convert it to a different data-type using the astype method.
In: Out:
• Remember that, unlike lists, all items in an array have to be of the same
type.
dtype=‘object’
• However, if you are uncertain about what data type your array will
hold, or if you want to hold characters and numbers in the same
array, you can set the dtype as 'object'.
In: Out:
The tolist() function
• You can always convert an array into a list using the tolist() command.
In: Out:
Inspecting a NumPy array
• There are a range of functions built into NumPy that allow you to
inspect different aspects of an array:
In:
Out:
Extracting specific items from an array
• You can extract portions of the array using indices, much like when
you’re working with lists.
• Unlike lists, however, arrays can optionally accept as many
parameters in the square brackets as there are number of dimensions
In: Out:
Boolean indexing
• A boolean index array is of the same shape as the array-to-be-filtered,
but it only contains TRUE and FALSE values.
In: Out:
Pandas
• Pandas, like NumPy, is one of the most popular Python libraries for
data analysis.
• It is a high-level abstraction over low-level NumPy, which is written in
pure C.
• Pandas provides high-performance, easy-to-use data structures and
data analysis tools.
• There are two main structures used by pandas; data frames and
series.
Indices in a pandas series
• A pandas series is similar to a list, but differs in the fact that a series
associates a label with each element. This makes it look like a dictionary.
• If an index is not explicitly provided by the user, pandas creates a RangeIndex
ranging from 0 to N-1.
• Each series object also has a data type.
In: Out
:
• As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.
In: Out
:
Out:
• It is easy to retrieve several elements of a series by their indices or
make group assignments.
Out:
In:
Pandas data frame
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.
Out:
• You can also create a data frame from a list.
In: Out:
• You can ascertain the type of a column with the type() function.
In:
Out:
• A Pandas data frame object as two indices; a column index and row
index.
• Again, if you do not provide one, Pandas will create a RangeIndex from 0
to N-1.
In:
Out:
• There are numerous ways to provide row indices explicitly.
• For example, you could provide an index when creating a data frame:
In: Out:
• or do it during runtime.
• Here, I also named the index ‘country code’.
Out:
In:
• Row access using index can be performed in several ways.
• First, you could use .loc() and provide an index label.
In: Out:
In: Out:
• A selection of particular rows and columns can be selected this way.
In: Out:
Filtering
• Filtering is performed using so-called Boolean arrays.
Deleting columns
• You can delete a column using the drop() function.
In: Out:
In: Out:
Reading from and writing to a file
• Pandas supports many popular file formats including CSV, XML, HTML,
Excel, SQL, JSON, etc.
• Out of all of these, CSV is the file format that you will work with the
most.
• You can read in the data from a CSV file using the read_csv() function.
• Similarly, you can write a data frame to a csv file with the to_csv()
function.
• Pandas has the capacity to do much more than what we have covered
here, such as grouping data and even data visualisation.
• However, as with NumPy, we don’t have enough time to cover every
aspect of pandas here.
Matplotlib
Basic type of plots
• Line plot
• Bar plot
• Scatter plot
Basic type of plots
• Histogram plot
• Box plot
To summarize a set of data. The shape of the boxplot shows how the data is distributed and it also
shows any outliers. It is a useful way to compare different sets of data as you can draw more than
one boxplot per graph. These can be displayed alongside a number line, horizontally or vertically.
Line plot
Bar plot
Thank you