2 Unit 2 Python Library for Data Wrangling
2 Unit 2 Python Library for Data Wrangling
Syllabus
2.2 Numpy
• NumPy, short for Numerical Python, is the core library for scientific computing in
Python. It has been designed specifically for performing basic and advanced array
operations. It primarily supports multi-dimensional arrays and vectors for complex
arithmetic operations.
•A library is a collection of files (called modules) that contains functions for use by
other programs. A Python library is a reusable chunk of code that you may want to
include in your programs.
• Many popular Python libraries are NumPy, SciPy, Pandas and Scikit-Learn. Python
visualization libraries are matplotlib and Seaborn.
• NumPy has risen to become one of the most popular Python science libraries and just
secured a round of grant funding.
• NumPy's multidimensional array can perform very large calculations much more
easily and efficiently than using the Python standard data types.
• To get started, NumPy has many resources on their website, including
documentation and tutorials.
• NumPy (Numerical Python) is a perfect tool for scientific computing and performing
basic and advanced array operations.
• The library offers many handy features performing operations on a n-arrays and
matrices in Python. It helps to process arrays that store values of the same data type
and makes performing math operations on arrays easier. In fact, the vectorization of
mathematical operations on the NumPy array type increases performance and
accelerates the execution time.
• Numpy is the core library for scientific computing in Python. It provides a high
performance multidimensional array object and tools for working with these arrays.
• NumPy is the fundamental package needed for scientific computing with Python. It
contains:
a) A powerful N-dimensional array object
b) Basic linear algebra functions
c) Basic Fourier transforms
d) Sophisticated random number capabilities
e) Tools for integrating Fortran code
f) Tools for integrating C/C++ code.
• NumPy is an extension package to Python for array programming. It provides "closer
to the hardware" optimization, which in Python means C implementation.
2.3 Basics of Numpy Arrays
• Numpy array is a powerful N-dimensional array object which is in the form of rows
and columns. We can initialize NumPy arrays from nested Python lists and access it
elements. NumPy array is a collection of elements that have the same data type.
• A one-dimensional NumPy array can be thought of as a vector, a two-dimensional
array as a matrix (i.e., a set of vectors), and a three-dimensional array as a tensor (i.e.,
a set of matrices).
• To define an array manually, we can use the np.array() function.
• Basic array manipulations are as follows :
1. Attributes of arrays: It define the size, shape, memory consumption, and data
types of arrays.
2. Indexing of arrays: Getting and setting the value of individual array elements. 3.
Slicing of arrays: Getting and setting smaller subarrays within a larger array.
4. Reshaping of arrays: Changing the shape of a given array.
5. Joining and splitting of arrays: Combining multiple arrays into one, and splitting
one array into many.
a) Attributes of array
• In Python, arrays from the NumPy library, called N-dimensional arrays or the
ndarray, are used as the primary data structure for representing data.
• The main data structure in NumPy is the ndarray, which is a shorthand name for N-
dimensional array. When working with NumPy, data in an ndarray is simply referred to
as an array. It is a fixed-sized array in memory that contains data of the same type,
such as integers or floating point values.
•The data type supported by an array can be accessed via the "dtype" attribute on the
array. The dimensions of an array can be accessed via the "shape" attribute that
returns a tuple describing the length of each dimension.
• Array attributes are essential to find out the shape, dimension, item size etc.
• ndarray.shape: By using this method in numpy, we can know the array dimensions.
It can also be used to resize the array. Each array has attributes ndim (the number of
dimensions), shape (the size of each dimension), and size (the total size of the array).
• ndarray.size: The total number of elements of the array. This is equal to the product
of the elements of the array's shape.
• ndarray.dtype: An object describing the data type of the elements in the array.
Recall that NumPy's ND-arrays are homogeneous: they can only posses numbers of a
uniform data type.
b) Indexing of arrays
• Array indexing always refers to the use of square brackets ("[ ]') to index the elements
of the array. In order to access a single element of an array we can refer to its index.
• Fig. 4.4.1 shows the indexing of an ndarray mono-dimensional.
Boolean array:
• A boolean array is a numpy array with boolean (True/False) values. Such array can
be obtained by applying a logical operator to another numpy array:
importnumpyasnp
a = np.reshape(np.arange(16), (4,4)) # create a 4x4 array of integers
print(a)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
large values (a>10) # test which elements of a are greated than 10
print(large_values)
[[False FalseFalse False]
[False FalseFalse False]
[False Falsefalse True]
[ TrueTrueTrue True]]
even_values = (a%2==0) # test which elements of a are even
print(even_values)
[[True False True False]
[True False True False]
[True False True False]
[True False True False]]
Logical operations on boolean arrays
• Boolean arrays can be combined using logical operators :
Drop duplicates
df.drop_duplicates()
• Drop duplicates in the first name column, but take the last observation in the
duplicated set
df.drop_duplicates (['first_name'], keep='last')
Creating a Data Map and Data Plan
• Overview of dataset is given by data map. Data map is used for finding potential
problems in data, such as redundant variables, possible errors, missing values and
variable transformations.
• Try creating a Python script that converts a Python dictionary into a Pandas
DataFrame, then print the DataFrame to screen.
import pandas as pd
scottish_hills={'Ben Nevis': (1345, 56.79685, -5.003508),
'Ben Macdui': (1309, 57.070453, -3.668262),
'Braeriach': (1296, 57.078628, -3.728024),
'Cairn Toul': (1291, 57.054611, -3.71042),
'Sgòr an Lochain Uaine': (1258, 57.057999, -3.725416)}
dataframe = pd.DataFrame(scottish_hills)
print(dataframe)
Manipulating and Creating Categorical Variables
• Categorical variable is one that has a specific value from a limited selection of values.
The number of values is usually fixed.
• Categorical features can only take on a limited, and usually fixed, number of possible
values. For example, if a dataset is about information related to users, then you will
typically find features like country, gender, age group, etc. Alternatively, if the data you
are working with is related to products, you will find features like product type,
manufacturer, seller and so on.
• Method for creating a categorical variable and then using it to check whether some
data falls within the specified limits.
import pandas as pd
cycle_colors=pd.Series(['Blue', 'Red', 'Green'], dtype='category')
cycle_data = pd.Series( pd.Categorical(['Yellow', 'Green', 'Red', 'Blue', 'Purple'],
categories=cycle_colors, ordered=False))
find_entries = pd.isnull(cycle_data)
print cycle_colors
print
print cycle_data
print
print find_entries [find_entries==True]
• Here cycle_color is a categorical variable. It contains the values Blue, Red, and Green
as color.
Renaming Levels and Combining Levels
• Data frame variable names are typically used many times when wrangling data. Good
names for these variables make it easier to write and read wrangling programs.
• Categorical data has a categories and a ordered property, which list their possible
values and whether the ordering matters or not.
• Renaming categories is done by assigning new values to the Series.cat.categories
property or by using the Categorical.rename_categories() method :
In [41]: s = pd.Series(["a","b","c","a"], dtype="category")
In [41]: s
Out[43]:
0a
1b
2C
3a
dtype: category
Categories (3, object): [a, b, c]
In [44]: s.cat.categories=["Group %s" % g for g in s.cat.categories]
In [45]: s
Out[45]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]
In [46]: s.cat.rename_categories([1,2,3])
Out[46]:
01
12
23
31
dtype: category
Categories (3, int64): [1, 2, 3]
Dealing with Dates and Times Values
• Dates are often provided in different formats and must be converted into single
format Date Time objects before analysis.
• Python provides two methods of formatting date and time.
1. str() = It turns a datetime value into a string without any formatting.
2. strftime() function= It define how user want the datetime value to appear after
conversion.
1. Using pandas.to_datetime() with a date
import pandas as pd
#input in mm.dd.yyyy format
date = ['21.07.2020']
# output in yyyy-mm-dd format
print(pd.to_datetime(date))
2. Using pandas.to_datetime() with a date and time
import pandas as pd
# date (mm.dd.yyyy) and time (H:MM:SS)
date [21.07.2020 11:31:01 AM']
# output in yyyy-mm-dd HH:MM:SS
print(pd.to_datetime(date))
• We can convert a string to datetime using strptime() function. This function is
available in datetime and time modules to parse a string to datetime and time objects
respectively.
• Python strptime() is a class method in datetime class. Its syntax is :
datetime.strptime(date_string, format)
• Both the arguments are mandatory and should be string
import datetime
format="%a %b %d %H:%M:%S %Y"
today = datetime.datetime.today()
print 'ISO:', today
s = today.strftime(format)
print 'strftime:', s
d = datetime.datetime.strptime(s, format)
print 'strptime:', d.strftime(format)
$ python datetime_datetime_strptime.py
ISO : 2013-02-21 06:35:45.707450
strftime: Thu Feb 21 06:35:45 2013
strptime: Thu Feb 21 06:35:45 2013
• Time Zones: Within datetime, time zones are represented by subclasses of tzinfo.
Since tzinfo is an abstract base class, you need to define a subclass and provide
appropriate implementations for a few methods to make it useful.
2.11 Missing Data
• Data can have missing values for a number of reasons such as observations that
were not recorded and data corruption. Handling missing data is important as many
machine learning algorithms do not support data with missing values.
• You can load the dataset as a Pandas DataFrame and print summary statistics on
each attribute.
# load and summarize the dataset
from pandas import read_csv
# load the dataset
dataset = read_csv('csv file name', header=None)
# summarize the dataset
print(dataset.describe())
• In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as
NaN. Values with a NaN value are ignored from operations like sum, count, etc.
• Use the isnull() method to detect the missing values. Pandas Dataframe provides a
function isnull(), it returns a new dataframe of same size as calling dataframe, it
contains only True and False only. With True at the place NaN in original dataframe
and False at other places.
Encoding missingness:
• The fillna() function is used to fill NA/NaN values using the specified method.
• Syntax :
DataFrame.fillna(value=None, method=None, axis=None, inplace=False,
limit=None, downcast=None, **kwargs)
Where
1. value: It is a value that is used to fill the null values.
2. method: A method that is used to fill the null values.
3. axis: It takes int or string value for rows/columns.
4. inplace: If it is True, it fills values at an empty place.
5. limit: It is an integer value that specifies the maximum number of consecutive
forward/backward NaN value fills.
6. downcast: It takes a dict that specifies what to downcast like Float64 to int64.
2.12 Hierarchical Indexing
• Hierarchical indexing is a method of creating structured group relationships in data.
• A MultiIndex or Hierarchical index comes in when our DataFrame has more than two
dimensions. As we already know, a Series is a one-dimensional labelled NumPy array
and a DataFrame is usually a two-dimensional table whose columns are Series. In
some instances, in order to carry out some sophisticated data analysis and
manipulation, our data is presented in higher dimensions.
• A MultiIndex adds at least one more dimension to the data. A Hierarchical Index as
the name suggests is ordering more than one item in terms of their ranking.
• To createDataFrame with player ratings of a few players from the Fifa 19 dataset.
In [1]: import pandas as pd
In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",
'MF', 'MF", 'MF', 'CF', 'CF', 'CF'],
'Name': ['De Gea', 'Coutois', 'Allison', 'VanDijk',
'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo'
'Messi', 'Neymar'],
'Overall': ['91','88', '89', '89', '91', '90', '91', '90', '92', '94', '93', '92'],
'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd', '3rd', '1st', '1st', '2nd', '3rd']}
In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name', 'Overall', 'Rank'])
In [4]: fifa19
Out[4]:
• From above Dataframe, we notice that the index is the default Pandas index; the
columns 'Position' and 'Rank' both have values or objects that are repeated. This could
sometimes pose a problem for us when we want to analyse the data. What we would
like to do is to use meaningful indexes that uniquely identify each row and makes it
easier to get a sense of the data we are working with. This is where MultiIndex or
Hierarchical Indexing comes in.
• We do this by using the set_index() method. For Hierarchical indexing, we use
set_index() method for passing a list to represent how we want the rows to be identified
uniquely.
In [5]: fif19.set_index(['Position', 'Rank'], drop = False)
In [6]: fifa19
Out[6];
• We can see from the code above that we have set our new indexes to 'Position' and
'Rank' but there is a replication of these columns. This is because we passed drop-
False which keeps the columns where they are. The default method, however, is drop-
True so without indicating drop=False the two columns will be set as the indexes and
the columns deleted automatically.
In [7]: fifa19.set_index(['Position', 'Rank'])
Out[7]: Name Overall
Position Rank
GK 1st De Gea91
GK 3rd Coutios88
GK 2nd Allison 89
DF 3rd Van Dijk 89
DF 1st Ramos 91
DF 2nd Godin 90
MF 2nd Hazard 91
MF 3rd Kante90
MF 1st De Bruyne 92
CF 1st Ronaldo 94
CF 2nd Messi93
CF 3rd Neymar92
• We use set_index() with an ordered list of column labels to make the new indexes. To
verify that we have indeed set our DataFrame to a hierarchical index, we call the .index
attribute.
In [8]: fifa19-fifa 19.set_index(['Position', 'Rank'])
In [9]: fifa19.index
Out[9]: MultiIndex(levels = [['CF', 'DF', 'GK', 'MF'],
['1st', '2nd', '3rd']],
codes = [[2, 2, 2, 1, 1, 1, 3, 3, 3, 0, 0, 0],
[0, 2, 1, 2,0,1, 1, 2, 0, 0, 1, 2]],
names= ['Position', 'Rank'])
• The date column can be parsed using the extremely handy dateutil library.
import pandas as pd
importdateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
• Once the data has been loaded into Python, Pandas makes the calculation of different
statistics very simple. For example, mean, max, min, standard deviations and more for
columns are easily calculable:
# How many rows the dataset
data['item'].count()
Out[38]: 830
# What was the longest phone call / data entry?
data['duration'].max()
Out[39]: 10528.0
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0
# How many entries are there for each month?
data['month'].value_counts()
Out[41]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
# Number of non-null unique network entries
data['network'].nunique()
Out[42]: 9
groupby() function :
• groupby essentially splits the data into different groups depending on a variable of
user choice.
• The groupby() function returns a GroupBy object, but essentially describes how the
rows of the original data set has been split. The GroupBy object groups variable is a
dictionary whose keys are the computed unique groups and corresponding values
being the axis labels belonging to each group.
• Functions like max(), min(), mean(), first(), last() can be quickly applied to the
GroupBy object to obtain summary statistics for each group.
• The GroupBy object supports column indexing in the same way as the DataFrame
and returns a modified GroupBy object.