0% found this document useful (0 votes)
18 views88 pages

DVP First Module

Uploaded by

padma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views88 pages

DVP First Module

Uploaded by

padma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 88

Data visualization using python

Code: BCS358D
Course: Data Visualization with Python
Credits: 1
CIE: 50 Marks
L:T:P - 1:0:0
SEE: 50 Marks
SEE Hour: 1
Total Marks: 100
Module-1
1. Why Is Data Visualization Important?
2. Why Do Modern Businesses Need Data Visualization?
3. The Future of Data Visualization
4. How Data Visualization Is Used for Business Decision-Making Introducing
Data Visualization Techniques
Module-2
• Data Gathering and Cleaning
• Cleaning Data Checking for Missing Values
• Handling the Missing Values Reading and Cleaning CSV Data
• Merging and Integrating Data
Module-3
Data Exploring and Analysis

Series Data Structures, Data Frame Data Structures


Data Analysis, Statistical Analysis, Data Grouping, Iterating Through Groups,
Aggregations, Transformations, Filtration

Data Visualization

Direct Plotting, Seaborn Plotting System


Matplotlib Plot
TEXT BOOK

Data Analysis and Visualization Using Python,


Dr. Ossama Embarak, 2018
Why Is Data Visualization Important
A picture is worth a thousand words

Humans can understand data better through pictures rather than by reading
numbers in rows and columns.

Accordingly if the data is presented in a graphical format people are more


able to effectively find correlations and raise important questions.
Data visualization
Data visualization is the process of interpreting data and
presenting it in a pictorial or graphical format.
Data visualization helps the business to achieve numerous goals.

– Converting the business data into interactive graphs for


dynamic interpretation to serve the business goals

– Transforming data into visually appealing, interactive


dashboards of various data sources to serve the business
with the insights

– Creating more attractive and informative dashboards of


various graphical data representations

– Making appropriate decisions by drilling into the data


and finding the insights
Figuring out the patterns, trends, and correlations in the
data being analyzed to determine where they must
improve their operational processes and thereby grow
their business

– Giving a fuller picture of the data under analysis

– Organizing and presenting massive data intuitively to


present important findings from the data

– Making better, quick, and informed decisions with data


visualization
Why Do Modern Businesses Need Data
Visualization?
Data visualization helps companies to analyze its different processes so the
management can focus on the areas for improvement to generate more
revenue and improve productivity.

It brings business intelligence to life.


It applies a creative approach to understanding the hidden information within
the business data.
it provides a better and faster way to identify patterns,
trends, and correlation in the data sets that would remain
undetected with just text.
How Data Visualization Is Used for
Business Decision-Making
Data visualization is a real asset for any business to help make real-time business
decisions.

It visualizes extracted information into logical and meaningful parts and helps
users avoid information overload by keeping things simple, relevant, and clear.

There are many ways in which visualizations help a business to improve its
decision-making.
• Faster Responses

Quick response to customers’ or users’ requirements is important for any


company to retain their clients, as well as to keep their loyalty.

With the massive amount of data collected daily via social networks or via
companies’ systems, it becomes incredibly useful to put useful interpretations
of the collected data into the hands of managers and decision-makers so they
can quickly identify issues and improve response times.
Simplicity
It is impossible to make efficient decisions based on large amounts of raw
data.
Easier Pattern Visualization

Data visualization provides easier approaches to identifying upcoming


trends and patterns within data sets and hence enables businesses to
make
efficient decisions and prepare strategies in advance.
Introducing Data Visualization
Techniques
Data visualization aims to understand data by extracting and graphing
information to show patterns, spot trends, and identify outliers.
There are two basic types of data visualization.

• Exploration helps to extract information from the


collected data.
• Explanation demonstrates the extracted information.

There are many types of 2D data visualizations, such as temporal,


multidimensional, hierarchical, and network.
Loading Libraries

Some libraries are bundled with Python, while others should be directly
downloaded and installed.
For instance, you can install Matplotlib using pip as follows:
python -m pip install -U pip setuptools
python -m pip install matplotlib
Popular Libraries for Data Visualization
in Python
The Python language provides numerous data visualization libraries for
plotting data.
The most used and common data visualization libraries are Pygal, Altair, VisPy,
PyQtGraph, Matplotlib, Bokeh, Seaborn, Plotly, and
ggplot,
Tabular Data and Data Formats

Data is available in different forms.


It can be unstructured data, semi structured data, or structured data.
Python provides different structures to maintain data and to
manipulate it such as variables, lists, dictionaries, tuples, series, panels,
and data frames.
pandas
It is a package useful for data analysis and manipulation.
Pandas provide an easy way to create, manipulate and wrangle the
data.
Pandas provide powerful and easy-to-use data structures, as well
as the means to quickly perform operations on these structures.
A Pandas Series

A series is a one-dimensional labeled array capable of


holding data of any type (integer, string, float, Python
objects, etc.).
Using the Series() .

We can simply turn a list, tuple, or dictionary into a


Series.

The row labels of the Series are referred to as


the index in the Pandas Series. The Panda series
can only have one column.
dtype: Return the dtype.

ndim: Return the Number of dimensions

size: Return the number of elements.

name: Return the name of the Series.

index: The index of the series

head(): Return the first n rows.

tail(): Return the last n rows.


import pandas as pd

# Data to be stored in the Pandas Series


data = [10, 20, 40, 80, 100]

# Create a Series using the Series() method


s = pd.Series(data)

# Display the Series


print("Series: \n", s)

# Dimensions
print("\nSeries Dimensions: ", s.ndim)
• size
• The pandas.series.size is used to return the number of
elements in the Pandas Series.
import pandas as pd

# Data to be stored in the Pandas Series


data = [10, 20, 40, 80, 100]

# Create a Series using the Series() method


s = pd.Series(data)

# Display the Series


print("Series: \n", s)

# Return the number of elements in the Series


print("\nSeries Size: ", s.size)
import pandas as pd

# Data to be stored in the Pandas Series


data = [10, 20, 40, 80, 100]

# Create a Series using the Series() method


s = pd.Series(data, index=["RowA", "RowB", "RowC", "RowD", "RowE"])

# Display the Series


print("Series (with custom index labels): \n", s)

# Return the index of the Series


print("\nSeries Index: ", s.index)
import pandas as pd
data = pd.Series([145, 142, 38, 13], name='counts')
print(data)
The syntax for the Series() method, which is used to construct pandas series
objects, is as follows:

Syntax:

pandas.Series(data=None, index=None, dtype=None, name=None,


copy=False)
Series- Series is a one-dimensional array like structure with
homogeneous data, which can be used to handle and manipulate data.

What makes it special is its index attribute, which has incredible


functionality and is heavily mutable.
it has two parts

1. Data part (An array of actual data)


2. Associated index with data (associated array of indexes or data labels)
We can say that Series is a labeled one-dimensional array
which can hold any type of data.

✓ Data of Series is always mutable, means it can be changed.


✓ But the size of Data of Series is always immutable, means it
cannot be changed.
✓ Series may be considered as a Data Structure with two
arrays out which one array works as Index (Labels) and the
second array works as original Data.
✓ Row Labels in Series are called Index.
Syntax to create a Series
<Series Object>=pandas.Series (data, index=idx (optional))

Where data may be python sequence (Lists), ndarray, scalar value or a


python dictionary.
A Pandas Data Frame

A data frame is a two-dimensional data structure. In other words, data is


aligned in a tabular fashion in rows and columns.
In the following table,
you have two columns and three rows of data.
import pandas as pd

data = [['Ahmed',35],['Ali',17],['Omar',25]]

DataFrame1 = pd.DataFrame(data,columns=['Name','Age’])

print (DataFrame1)
To retrieve data from a data frame starting from index 1 up to the end
of rows.
DataFrame1[1:]
We can create a data frame using a dictionary.

import pandas as pd

data = {'Name':['Ahmed', 'Ali', 'Omar','Salwa’], 'Age':[35,17,25,30]}

dataframe2 = pd.DataFrame(data, index=[100, 101, 102, 103])

print (dataframe2)
We can select only the first two rows in a data frame.

dataframe2[:2]
we can select only the name column in a
data frame.
dataframe2['Name']
#create a variable with integer value.
a=100
print("The type of variable having value", a, " is ", type(a))

#create a variable with float value.


b=10.2345
print("The type of variable having value", b, " is ", type(b))

#create a variable with complex value.


c=100+3j
print("The type of variable having value", c, " is ", type(c))
Python Lists
Lists are used to store multiple items in a single variable.

Lists are one of 4 built-in data types in Python used to store


collections of data
the other 3 are Tuple, Set,
and Dictionary all with different qualities and usage.

Lists are created using square brackets:


mylist = [“NIE", “JCE", “BMS"]

thislist = [“abc", “def", “ghi"]

print(thislist)

List Items
List items are ordered, changeable, and allow duplicate values.

List items are indexed, the first item has index [0], the second item has
index [1]
dictionary
Dict = {"Name": "Gayle", "Age": 25}
PANDAS
Python's pandas library is used for data analysis.

Pandas makes importing, analyzing, and visualizing data much easier.

It builds on packages like NumPy and matplotlib to give you a single,


convenient, place to do most of your data analysis and visualization
work.
pandas
Import the package:

import pandas as pd
import is a key word used to import the packages along with
the libraries
Two main data structures

1. Series

2. DataFrame:
• A Series is a one-dimensional labeled array capable of
holding any data type (integers, strings, floating
point numbers, Python objects, etc.).

• It has to be remembered that unlike Python lists,

• a Series will always contain data of the same type.


• The Pandas Series can be defined as a one-dimensional array that is
capable of storing various data types.

• We can easily convert the list, tuple, and dictionary into series using
"series' method.

• The row labels of series are called the index. A Series cannot contain
multiple columns.
It has the following parameter:

1. data: It can be any list, dictionary, or scalar value.

2. index: The value of the index should be unique. It must be of the same
length as data.

If we do not pass any index, default np.arrange(n) will be used.

3. dtype: It refers to the data type of series. 4. copy: It is used for copying
the data.
In Python, we are used to working with lists as such:

num = [1, 7, 2]

The Series data structure in Pandas is the equivalent of a list in python.

It is a single dimensional data structure, and is represented as a column.

A Python list can be converted into a series in Pandas like so:

num = [1, 7, 2]

n = pd.Series(num)

print(n)
We can also create a Series from dict.

If the dictionary object is being passed as an input and the index is not
specified, then the dictionary keys are taken in a sorted order to construct
the index.

If index is passed, then values correspond to a particular label in the index


will be extracted from the dictionary.
cities = ['Kolkata', 'Chicago', 'Toronto', 'Lisbon']
populations = [14.85, 2.71, 2.93, 0.51]
city_series = pd.Series(populations, index=cities)
city_series.index
Methods of Series
head(n) Returns the first n members of the series.
If the value for n is not passed, then by default n takes 5 and the first
five members are displayed.
count() Returns the number of non-NaN values in the Series

tail(n) Returns the last n members of the series.


If the value for n is not passed, then by default n takes 5 and the last
five members are displayed.
seriesA = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])

seriesB = pd.Series([10,20,-10,-50,100], index = ['z', 'y', 'a', 'c', 'e'])

seriesA + seriesB
seriesA = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e’])
seriesB = pd.Series([10,20,-10,-50,100],index = ['z', 'y', 'a', 'c', 'e'])

seriesA + seriesB
import pandas as pd
data1=[1,2,3,4,5,6]
data2=[2,3,4,5,6,7]
a=pd.Series(data1)
b=pd.Series(data2)

print(a.add(data2))
seriesA.add(seriesB, fill_value=0)

Subtraction of two Series

seriesA – seriesB

seriesA.sub(seriesB, fill_value=1000)
DataFrame:
• pandas.DataFrame(data=None, index=None, columns=None,
dtype=None, copy=False)

• Tabular format similar to excel

• Two-dimensional, potentially heterogeneous tabular data

• structure with labeled axes (rows and columns).


Row and columns index Can be thought of as a dict-like container for
Series objects.
pandas DataFrame is a two-dimensional array.
• Data is aligned in rows and columns.
The general format: pandas. DataFrame (data, index, columns, dtype)
data: It is like ndarray, series, list, dict, constants or another DataFrame.

index: It is for row label, by default 0..n-1 (np.arange(n))

columns: It is for column label, by default 0...n-1 (np.arange(n))

dtype: data type of each column Heterogeneous data Size Mutable Data
Mutable
• Heterogeneous data
• Size Mutable
• Data Mutable
Creating DataFrame using list
Creating DataFrame using list

import pandas as pd
D = pd.DataFrame([[10,20], [30,40]])
print(D)
Creating DataFrame with row index and
column label
import pandas as pd
data = [[10,20],[30,40]]
D = pd.DataFrame(data, columns = ['col1', 'col1'], index = ['row1', 'row2’])
print(D)
Creating DataFrame using dictionary

import pandas as pd
data = {'Name': ['Anu', 'Sia'],'Marks':[19,25]}
D = pd.DataFrame(data, index = [1,2])
Creating DataFrame from dictionary of
Series
import pandas as pd
d = {'one': pd.Series([10, 20, 30, 40],
index =['a', 'b', 'c', 'd']), 'two': pd.Series([10, 20, 30, 40],
index =['a', 'b', 'c', 'd’])}

df = pd.DataFrame(d)

print(df)
Creating DataFrame from list of
dictionary
import pandas as pd
data = [{'b': 2, 'c':3}, {'a': 10, 'b': 20, 'c': 30}]
df = pd.DataFrame(data, index =['first', 'second’])
df = pd.DataFrame(d)
print(df)
Writing DataFrame to csv file
import pandas as pd
data = {'Name': ['Anu', 'Sia'],'Marks':[19,25]}
D = pd.DataFrame(data, index = [1,2])
print(D)

D.to_csv("E:\stu.csv")
DataFrame attributes

• index
• columns
• axes
• dtypes
• size
• shape
• ndim
• empty
•T
• values
index
There are two types of index in a DataFrame one is the row index and the
other is the column index.
The index attribute is used to display the row labels of a data frame
object.
The row labels can be of 0,1,2,3,… form and can be of names.

Syntax: dataframe_name.index
columns

• This attribute is used to fetch the label values for columns present in a
particular data frame.

• Syntax: dataframe_name.columns
import pandas as pd

dict = {"Sales": {'Name': 'Shyam',


'Age': 23, 'Gender': 'Male'},
"Marketing": {'Name': 'Neha',
'Age': 22, 'Gender': 'Female'}}

# Creating a data frame object


data_frame = pd.DataFrame(dict)

# printing this data frame on output screen


display(data_frame)

# Implementing index attribute for this


# data frame
print(data_frame.columns)
import pandas as pd

# Dataset
data = {
'Student': ["Amit", "John", "Jacob", "David", "Steve"],
'Rank': [1, 4, 3, 5, 2],
'Marks': [95, 70, 80, 60, 90]
}

# Create a DataFrame using the DataFrame() method with index

res = pd.DataFrame(data, index=['RowA', 'RowB', 'RowC', 'RowD', 'RowE'], )

# Display the Records


print("Student Records\n\n", res)

# Number of Dimensions in the DataFrame


print("\nNumber of Dimensions:\n", res.ndim)
size
The pandas.DataFrame.size is used to return the number of elements
in the DataFrame.
import pandas as pd

# Dataset
data = {
'Student': ["Amit", "John", "Jacob", "David", "Steve"],
'Rank': [1, 4, 3, 5, 2],
'Marks': [95, 70, 80, 60, 90]
}

# Create a DataFrame using the DataFrame() method with index


res = pd.DataFrame(data, index=['RowA', 'RowB', 'RowC', 'RowD', 'RowE'], )

# Display the Records


print("Student Records\n\n", res)

# Number of elements in the DataFrame


print("\nNumber of Elements:\n", res.size)
shape
• The pandas.DataFrame.shape is used to return the dimensionality of the
DataFrame in the form of a tuple.
T

• The pandas.DataFrame.T is used to Transpose the rows and columns.


import pandas as pd

# Dataset
data = {
'Student': ["Amit", "John", "Jacob", "David", "Steve"],
'Rank': [1, 4, 3, 5, 2],
'Marks': [95, 70, 80, 60, 90]
}

# Create a DataFrame using the DataFrame() method with index


res = pd.DataFrame(data, index=['RowA', 'RowB', 'RowC', 'RowD', 'RowE'], )

# Display the Records


print("Student Records\n\n", res)

# Return the Transpose


head()

The pandas.DataFrame.head()

is used to return the first n rows.


tail()

The pandas.DataFrame.tail() is used to return the last 5 rows.

You might also like