Reading Data
In this lecture
File
formats
Commonly used file formats
Read data from
◦ .csv format
◦ .xlsx format
◦ .txt format
Python for Data Science 2
File format
Standard way in which data is collected and stored
Most commonly used format for storing data is the
spreadsheet format where data is stored in rows
and columns
◦ Each row is called a record
◦ Each column in a spreadsheet holds data
belonging to same data type
Commonly used spreadsheet formats are comma
separated values and excel sheets
Other formats include plain text, json, html,
mp3,mp4 etc.
Python for Data Science 3
Comma separated values
Spreadsheet format
Format ‘.csv’
Each record is separated by a comma
Files where records are separated
using a tab are called tab separated
values
.csv files can be opened with notepad
or Microsoft excel
Python for Data Science 4
Comma separated values
Python for Data Science 5
Excel spreadsheets
Spreadsheet format
Part of Microsoft Office
Format ‘.xlsx’
Python for Data Science 6
Text format
Consists of plain text or records
Format ‘.txt’
Python for Data Science 7
Importing Data
Python for Data Science 8
Importing data into Spyder
Importing necessary libraries
‘os’ library to change the working directory
‘pandas’ library to work with dataframes
Changing the working directory
Python for Data Science 9
Comma separated values
Importing data
Blank cells read as ‘nan’
Python for Data Science 10
Comma separated values
Removing the extra id column by passing
Replacing ‘??’ and ‘# # #’ as missing values
Python for Data Science 11
Comma separated values
Junk
values can be converted to missing values by passing
them as a list to the parameter ‘ ’
Python for Data Science 12
Excel spreadsheets
Importing data
Remove index column and replace ‘??’ and ‘# # #’ as missing
values Sheet name
Python for Data Science 13
Text format
Importing data
All columns read and stored in a single column of dataframe
In order to avoid this, provide a delimiter to the parameters
‘ ’ or ‘ ’
Python for Data Science 14
Text format
Default delimiter is tab represented by ‘\t’
Tab delimiter might not always work
Python for Data Science 15
Text format
Other commonly used delimiters are commas and blanks
In this case using a comma as a delimiter also gives the
earlier output
Now if we use a blank as a delimiter then
Python for Data Science 16
Text format
Remove index column and replace ‘??’ and ‘# # #’ as missing
values
Instead of using read_table(), read_csv() can also be
used to read .txt files
Python for Data Science 17
THANK YOU
Pandas Dataframes
Part I
In this lecture
Introduction to pandas
Importing data into Spyder
Creating copy of original data
Attributes of data
Indexing and selecting data
Python for Data Science 2
Introduction to Pandas
Provides high-performance, easy-to-use
data structures and analysis tools for the
Python programming language
Open-source Python library providing high-
performance data manipulation and analysis
tool using its powerful data structures
Name pandas is derived from the word
Panel Data – an econometrics term for
multidimensional data
Python for Data Science 3
Pandas
Pandas deals with dataframes
Name Dimension Description
Dataframe 2 two-dimensional size-mutable
potentially heterogeneous tabular
data structure with labeled axes
(rows and columns)
Python for Data Science 4
Importing data into Spyder
Importing necessary libraries
‘os’ library to change the working directory
‘pandas’ library to work with dataframes
‘numpy’ library to perform numeric operations
Changing the working directory
Python for Data Science 5
Importing data into Spyder
Importing data
o By passing , first column becomes the index column
Python for Data Science 6
Creating copy of original data
In Python, there are two ways to create copies
o Shallow copy
o Deep copy
Shallow copy Deep copy
Function
o It only creates a new variable o In case of deep copy, a copy of
Description that shares the reference of object is copied in other object
the original object with no reference to the original
o Any changes made to a copy o Any changes made to a copy of
of object will be reflected in object will not be reflected in
the original object as well the original object
Python for Data Science 8
Attributes of data
DataFrame.index
➢ To get the index (row labels) of the dataframe
Python for Data Science 9
Attributes of data
DataFrame.columns
➢ To get the column labels of the dataframe
Python for Data Science 10
Attributes of data
DataFrame.size
➢ To get the total number of elements from the
dataframe
DataFrame.shape
➢ To get the dimensionality of the dataframe
1436 rows & 10 columns
Python for Data Science 11
Attributes of data
DataFrame.memory_usage([index, deep])
➢ The memory usage of each column in bytes
DataFrame.ndim
➢ The number of axes / array dimensions
A two-dimensional array stores data in a
format consisting of rows and columns
Python for Data Science 12
Indexing and selecting data a
• Python slicing operator ‘[ ]’ and attribute/
dot operator ‘. ’ are used for indexing
• Provides quick and easy access to pandas
data structures
Python for Data Science 13
Indexing and selecting data
DataFrame.head([n])
➢ The function head returns the first n rows from the dataframe
By default, the head() returns first 5 rows
Python for Data Science 14
Indexing and selecting data
➢ The function tail returns the last n rows for the object based on position
✓ It is useful for quickly verifying data
✓ Ex: after sorting or appending rows.
Python for Data Science 15
Indexing and selecting data
• To access a scalar value, the fastest way
is to use the at and iat methods
○ at provides label-based scalar lookups
○ iat provides integer-based lookups
Python for Data Science 16
Indexing and selecting data
To access a group of rows and columns by
label(s) .loc[ ] can be used
a
Python for Data Science 17
THANK YOU
Pandas Dataframes
Part II
Pandas Dataframes - Recap
In the previous lecture, we have seen about
Introduction to pandas
Importing data into Spyder
Creating copy of original data
Attributes of data
Indexing and selecting data
Python for Data Science 2
In this lecture
Data types
◦ Numeric
◦ Character
Checking data types of each column
Count of unique data types
Selecting data based on data types
Concise summary of dataframe
Checking format of each column
Getting unique elements of each column
Python for Data Science 3
Data types
The way information gets stored in a dataframe or
a python object affects the analysis and outputs of
calculations
There are two main types of data
◦ numeric and character types
Numeric data types includes integers and floats
◦ For example: integer – 10, float – 10.53
Strings are known as objects in pandas which can
store values that contain numbers and / or
characters
◦ For example:‘category1’
Python for Data Science 4
Numeric types
Pandas and base Python uses different names for data types
Python data type Pandas data type Description
int int64 Numeric characters
float float64 Numeric characters with decimals
◦ ‘64’ simply refers to the memory allocated to store data in each cell which
effectively relates to how many digits it can store in each “cell”
◦ 64 bits is equivalent to 8 bytes
◦ Allocating space ahead of time allows computers to optimize storage and
processing efficiency
Python for Data Science 5
Character types
Difference between category & object
category object
◦ A string variable ◦ The column will be assigned
consisting of only a few as object data type when it
different values. has mixed types (numbers
Converting such a and strings). If a column
string variable to a contains ‘nan’(blank cells),
categorical variable will pandas will default to object
save some memory datatype.
◦ A categorical variable ◦ For strings, the length is not
takes on a limited, fixed fixed
number of possible
values
Python for Data Science 6
Checking data types of each column
dtypes returns a series with the data type of
each column
Syntax: DataFrame.dtypes
Python for Data Science 7
Count of unique data types
get_dtype_counts()returns counts of
unique data types in the dataframe
Syntax: DataFrame.get_dtype_counts()
Python for Data Science 8
Selecting data based on data types
pandas.DataFrame.select_dtypes() returns a
subset of the columns from dataframe based on the column
dtypes
Syntax: DataFrame.select_dtypes(include=None,
exclude=None)
Python for Data Science 9
Concise summary of dataframe
info() returns a concise summary of a
dataframe
data type of index
data type of columns
count of non-null values
memory usage
Syntax: DataFrame.info()
Python for Data Science 10
Checking format of each column
By using info(), we can see
‘KM’ has been read as object instead of integer
‘HP’ has been read as object instead of integer
‘MetColor’ and ‘Automatic’ have been read as
float64 and int64 respectively since it has values 0/1
Ideally, ‘Doors’ should’ve been read as int64 since it
has values 2, 3, 4, 5. But it has been read as object
Missing values present in few variables
Let’s encounter the reason !
Python for Data Science 11
Unique elements of columns
unique() is used to find the unique
elements of a column
Syntax: numpy.unique(array)
‘KM’ has special character to it -
Hence, it has been read as object instead of int64
Python for Data Science 12
Unique elements of columns
Variable ‘HP’ :
‘HP’ has special character to it -
Hence, it has been read as object instead of int64
Variable ‘MetColor’ :
‘MetColor’ have been read as float64 since it has values 0. & 1.
Python for Data Science 13
Unique elements of columns
Variable ‘Automatic’ :
‘Automatic’ has been read as int64 since it has values 0 & 1
Variable ‘Doors’ :
‘Doors’ has been read as object instead of int64 because of
values ‘five’ ‘four’ ‘three’ which are strings
Python for Data Science 14
Summary
Data types
◦ Numeric
◦ Character
Checked data types of each column
Count of unique data types
Selected data based on data types
Concise summary of dataframe
Checked format of each column
Got unique elements of each column
Python for Data Science 15
THANK YOU
Pandas Dataframes
Part III
Pandas Dataframes - Recap
In the previous lecture, we have seen about
Data types
◦ Numeric
◦ Character
Checking data types of each column
Count of unique data types
Selecting data based on data types
Concise summary of dataframe
Checking format of each column
Getting unique elements of each column
Python for Data Science 2
In this lecture
Importing data
Concise summary of dataframe
Converting variable’s data types
Category vs Object data type
Cleaning column ‘Doors
Getting count of missing values
Python for Data Science 3
Importing data
We need to know how missing values are
represented in the dataset in order to make
reasonable decisions
The missing values exist in the form of ‘nan’
◦ Python, by default replace blank values with ‘nan’
Now, importing the data considering other forms
of missing values in a dataframe
Python for Data Science 4
Concise summary of dataframe
Summary - before replacing Summary - after replacing special
special characters with nan characters with nan
Python for Data Science 5
Converting variable’s data types
astype() method is used to explicitly convert
data types from one to another
Syntax: DataFrame.astype(dtype)
Converting ‘MetColor’ , ‘Automatic’ to object data type:
Python for Data Science 6
category vs object data type
nbytes() is used to get the total bytes
consumed by the elements of the columns
Syntax: ndarray.nbytes
If ‘FuelType’ is of object data type,
If ‘FuelType’ is of category data type,
Python for Data Science 7
Re-checking the data type of variables
Re-checking the data type of variables after all the
conversions
Python for Data Science 8
Cleaning column ‘Doors’
Checking unique values of variable ‘Doors’ :
Try out !
numpy.where()
replace() is used to replace a value with the desired
value
Syntax: DataFrame.replace([to_replace,
value, …])
Python for Data Science 9
Converting ‘Doors’ data type
Converting ‘Doors’ to int64:
cars_data['Doors']=cars_data['Doors'].astype('int64')
Python for Data Science 10
To detect missing values
To check the count of missing values present in each
column Dataframe.isnull.sum() is used
Python for Data Science 11
Summary
Imported data
Concise summary of dataframe
Converted variable’s data types
Category vs Object data type
Cleaned column ‘Doors
Got count of missing values
Python for Data Science 12
THANK YOU
Control structures & Functions
In this lecture
Control structures
◦ If elif family
◦ For
◦ While
Functions
Python for Data Science 2
Control Structures in Python
Execute certain commands only
when certain condition(s) is (are)
satisfied (if-then-else)
Executecertain commands
repeatedly and use a certain logic to
stop the iteration (for, while loops)
Python for Data Science 3
If else family of constructs
If, If else and If-elif - else are a
family of constructs where:
◦ A condition is first checked, if it is
satisfied then operations are
performed
◦ If condition is not satisfied, code exits
construct or moves on to other
options
Python for Data Science 4
If else family of constructs
Task Command
• If construct: • if expression:
statements
• If – else • If expression:
construct: statements
else:
statements
• If – elif - else • If expression1:
construct statements
elif expression2:
statements
else:
statements
Python for Data Science 5
For loop
Execute certain commands Task Command
repeatedly and use a certain for for iter in sequence:
logic to stop the iteration (for statements
loop)
Execute multiple commands
repeatedly as per the specified
logic (nested for loop)
Python for Data Science 6
while loop
A while loop is used when a
set of commands are to be
executed depending on a
specific condition
Task Command
while while (condition is satisfied):
statements
Python for Data Science 7
Example: if else and for loops
• We will create 3 bins from the ‘Price’
variable using If Else and For Loops
• The binned values will be stored as classes
in a new column, ‘Price Class’
• Hence, inserting a new column
Python for Data Science 8
Example: if else and for loops
• A for loop is implemented and the observations are
separated into three categories:
o Price
• up to 8450
• between 8450 and 11950
• greater than 11950
• The classes have been stored in a new column ‘Price Class’
Python for Data Science 9
Example: while loop
• A while loop is used whenever you want to execute
statements until a specific condition is violated
• Here a while loop is used over the length of the
column ‘Price_Class’ and an if else loop is used to bin
the values and store it as classes
Python for Data Science 10
Example: while loop
• Series.value_counts() returns series
containing count of unique values
Python for Data Science 11
Functions in Python
• A function accepts input arguments and produces
an output by executing valid commands present in
the function
• Function name and file names need not be the def function_name(parameters):
same statements
• A file can have one or more function definitions
• Functions are created using the command def and
a colon with the statements to be executed
indented as a block
• Since statements are not demarcated explicitly, It
is essential to follow correct indentation practises
Python for Data Science 12
Example: functions
• Converting the Age variable from months to
years by defining a function
• The converted values will be stored in a
new column, ‘Age_Converted’
• Hence, inserting a new column
Python for Data Science 13
Example: functions
• Here, a function c_convert has been defined
• The function takes arguments and returns one value
Python for Data Science 14
Function with multiple inputs and outputs
Function with multiple inputs and
outputs
• Functions in Python takes
multiple input objects but return
only one object as output
• However lists, tuples or
dictionaries can be used to
return multiple outputs as
required
Python for Data Science 15
Example: function with
multiple inputs and outputs
• Converting the Age variable from months
to years and getting kilometers (KM) run
per month
• The converted values of kilometer will be
stored in a new column,‘km_per_month’
• Hence, inserting a new column
Python for Data Science 16
Example: function with multiple
inputs and outputs
• A multiple input multiple output function c_convert
has been defined
• The function takes in two inputs
• The output is returned in the form of a list
Python for Data Science 17
Example: function with multiple inputs and outputs
• Here, Age and KM columns of the data set are input to the
function
• The outputs are assigned to ‘Age_Converted’ and
‘km_per_month’
Python for Data Science 18
Summary
Control structures
◦ If elif family
◦ For
◦ While
Functions
Python for Data Science 19
THANK YOU
Exploratory data analysis
In this lecture
Frequency tables
Two-way tables
Two-way table - joint probability
Two-way table - marginal probability
Two-way table - conditional probability
Correlation
Python for Data Science 2
Importing data into Spyder
Importing necessary libraries
‘os’ library to change the working directory
‘pandas’ library to work with dataframes
Changing the working directory
Python for Data Science 3
Importing data into Spyder
Importing data
Creating copy of original data
Python for Data Science 4
Frequency tables
pandas.crosstab()
Size of data
• To compute a simple cross-tabulation of one, two (or more)
1436– Original data
factors
• By default computes a frequency table of the factors
1336 – after dropping
nan values
Most of the cars have petrol as fuel type
Python for Data Science 5
Two-way tables
pandas.crosstab() Automatic
0- Manual gear box
• To look at the frequency distribution of gearbox types
with respect to different fuel types of the cars
1- Automatic gearbox
Python for Data Science 6
Two-way table - joint probability
pandas.crosstab()
• Joint probability is the likelihood of two independent events
happening at the same time
Python for Data Science 7
Two-way table - marginal probability
pandas.crosstab()
• Marginal probability is the probability of the occurrence of
the single event
probability of cars having manual
gear box when the fuel type are
CNG or Diesel or Petrol is 0.95
Python for Data Science 8
Two-way table - conditional probability
pandas.crosstab()
• Conditional probability is the probability of an event ( A ),
given that another event ( B ) has already occurred
• Given the type of gear box, probability of different fuel type
Row sum = 1
Python for Data Science 9
Two-way table - conditional probability
pandas.crosstab()
• Conditional probability is the probability of an event ( A ),
given that another event ( B ) has already occurred
Python for Data Science Column sum = 1 10
Correlation
Correlation: the strength of association
between two variables
Visual representation of correlation:
Scatter plots
Positive trend Negative trend Little or no correlation
Python for Data Science 11
Correlation
DataFrame.corr(self, method='pearson’)
• To compute pairwise correlation of columns excluding NA/null
values
• Excluding the categorical variables to find the Pearson’s
correlation
• Let’ s check the no. of variables available under numerical_data
Python for Data Science 12
Correlation
DataFrame.corr(self, method='pearson’)
• Correlation between numerical variables
Python for Data Science 13
Summary
Frequency tables
Two-way tables
Two-way table - joint probability
Two-way table - marginal probability
Two-way table - conditional probability
Correlation
Python for Data Science 14
THANK YOU
Data visualization
Part I
In this lecture
We will learn how to create basic plots using matplotlib library
• Scatter plot
• Histogram
• Bar plot
Python for Data Science 2
Data Visualization
• Data visualization allows us to quickly interpret the data
and adjust different variables to see their effect
• Technology is increasingly making it easier for us to do so
Why visualize data?
o Observe the patterns
o Identify extreme values that could be anomalies
o Easy interpretation
Python for Data Science 3
Popular plotting libraries in Python
Python offers multiple graphing libraries that offers diverse
features
• matplotlib • to create 2D graphs and plots
• pandas visualization • easy to use interface, built on
Matplotlib
• seaborn • provides a high-level interface
for drawing attractive and
informative statistical graphics
• ggplot • based on R’s ggplot2, uses
Grammar of Graphics
• plotly • can create interactive plots
Python for Data Science 4
Matplotlib
• Matplotlib is a 2D plotting library which
produces good quality figures
• Although it has its origins in emulating the
MATLAB graphics commands, it is independent
of MATLAB
• It makes heavy use of NumPy and other
extension code to provide good performance
even for large arrays
Python for Data Science 5
Scatter plot
Python for Data Science 6
Scatter Plot
What is a scatter plot?
• A scatter plot is a set of points that represents
the values obtained for two different variables
plotted on a horizontal and vertical axes
When to use scatter plots?
• Scatter plots are used to convey the relationship
between two numerical variables
• Scatter plots are sometimes called correlation
plots because they show how two variables are
correlated
Python for Data Science 7
Importing data into Spyder
Importing necessary libraries
‘pandas’ library to work with dataframes
‘numpy’ library to do numerical operations
‘matplotlib’ library to do visualization
Python for Data Science 8
Importing data into Spyder
Importing data
Removing missing values from the dataframe
Python for Data Science 9
Scatter plot
x y
Python for Data Science 10
Scatter plot
The price of the car decreases as age of the car increases
Python for Data Science 11
Histogram
Python for Data Science 12
Histogram
What is a histogram?
• It is a graphical representation of data using
bars of different heights
• Histogram groups numbers into ranges and
the height of each bar depicts the frequency
of each range or bin
When to use histograms?
• To represent the frequency distribution of
numerical variables
Python for Data Science 13
Histogram
x
Histogram with default arguments
Python for Data Science 14
Histogram
Python for Data Science 15
Histogram
Frequency distribution of kilometre of the cars shows that
most of the cars have travelled between 50000 – 100000 km
and there are only few cars with more distance travelled
Python for Data Science 16
Bar plot
Python for Data Science 17
Bar plot
What is a bar plot?
• A bar plot is a plot that presents categorical
data with rectangular bars with lengths
proportional to the counts that they
represent
When to use bar plot?
• To represent the frequency distribution of
categorical variables
• A bar diagram makes it easy to compare sets
of data between different groups
Python for Data Science 18
Bar plot
x height of the bars
Python for Data Science 19
Bar plot
Frequency distribution of fuel type
Python for Data Science 20
Bar plot
x height of the bars
Set the labels of the xticks
Set the location of the xticks
Python for Data Science 21
Bar plot
Bar plot of fuel type shows that most of the cars have petrol as
fuel type
Python for Data Science 22
Summary
We have learnt how to create basic plots using matplotlib library
• Scatter plot
• Histogram
• Bar plot
Python for Data Science 23
THANK YOU
Data visualization
Part II
In the previous lecture
We learnt how to create basic plots using matplotlib library
• Scatter plot
• Histogram
• Bar plot
Python for Data Science 2
In this lecture
We will learn how to create basic plots using seaborn library:
• Scatter plot
• Histogram
• Bar plot
• Box and whiskers plot
• Pairwise plots
Python for Data Science 3
Seaborn
• Seaborn is a Python data visualization library
based on matplotlib
• It provides a high-level interface for drawing
attractive and informative statistical graphics
Python for Data Science 4
Scatter plot
Python for Data Science 5
Importing libraries
Importing necessary libraries
‘pandas’ library to work with dataframes
‘numpy’ library to do numerical operations
‘matplotlib’ library to do visualization
‘seaborn’ library to do visualization
Python for Data Science 6
Importing data into Spyder
Importing data
Removing missing values from the dataframe
Python for Data Science 7
Scatter plot
Scatter plot of Price vs Age with default arguments
o By default, fit_reg = True
o It estimates and plots a regression
model relating the x and y variables
Python for Data Science 8
Scatter plot
Scatter plot of Price vs Age without the regression fit line
Python for Data Science 9
Scatter plot
Scatter plot of Price vs Age by customizing the appearance of markers
Python for Data Science 10
Scatter plot
Scatter plot of Price vs Age by FuelType
Using hue parameter, including another variable to show the fuel
types categories with different colors
Python for Data Science 11
Scatter plot
Scatter plot of Price vs Age by FuelType
Similarly, custom the appearance of the markers
using
o transparency
o shape
o size
Python for Data Science 12
Histogram
Python for Data Science 13
Histogram
Histogram with default kernel density estimate
Python for Data Science 14
Histogram
Histogram without kernel density estimate
Python for Data Science 15
Histogram
Histogram with fixed no. of bins
Python for Data Science 16
Bar plot
Python for Data Science 17
Bar plot
Frequency distribution of fuel type of the cars
Python for Data Science 18
Grouped bar plot
Grouped bar plot of FuelType and Automatic
Python for Data Science 19
Box and whiskers plot
Python for Data Science 20
Box and whiskers plot – numerical variable
Box and whiskers plot of Price to visually interpret the
five-number summary
Python for Data Science 21
Box and whiskers plot
Box and whiskers plot for numerical vs categorical variable
Price of the cars for various fuel types
Python for Data Science 22
Grouped box and whiskers plot
Grouped box and whiskers plot of Price vs FuelType and Automatic
Python for Data Science 23
Box-whiskers plot and Histogram
Let’s plot box-whiskers plot and histogram on the same window
Split the plotting window into 2 parts
Python for Data Science 24
Box-whiskers plot and Histogram
Now, add create two plots
Python for Data Science 25
Pairwise plots
Itis used to plot pairwise relationships in a dataset
Creates scatterplots for joint relationships and histograms for
univariate distributions
Code:
sns.pairplot(cars_data, kind="scatter", hue="FuelType")
plt.show()
Python for Data Science 26
Pairwise plots
Output:
Python for Data Science 27
Summary
We have learnt how to create basic plots using seaborn library:
• Scatter plot
• Histogram
• Bar plot
o Grouped bar plot
• Box and whiskers plot
o Grouped box and whiskers plot
• Pairwise plots
Python for Data Science 28
THANK YOU