0% found this document useful (0 votes)
9 views45 pages

Practice 1

The document provides an overview of data preparation using the pandas library in Python, emphasizing its importance for accurate data analysis. It covers key steps such as loading, cleaning, and transforming data, as well as handling missing values with methods like dropna() and fillna(). Additionally, it discusses various techniques for filling missing data, including forward fill and using statistical measures like mean and median.

Uploaded by

ahmed.s.g9800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views45 pages

Practice 1

The document provides an overview of data preparation using the pandas library in Python, emphasizing its importance for accurate data analysis. It covers key steps such as loading, cleaning, and transforming data, as well as handling missing values with methods like dropna() and fillna(). Additionally, it discusses various techniques for filling missing data, including forward fill and using statistical measures like mean and median.

Uploaded by

ahmed.s.g9800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Preparation with pandas

• We will see some basic concepts and steps for data preparation.
• This is the step when you pre-process raw data into a form that can
be easily and accurately analyzed.
• Proper data preparation allows for efficient analysis - it can eliminate
errors and inaccuracies that could have occurred during the data
gathering process and can thus help in removing some bias resulting
from poor data quality.
• Therefore a lot of an analyst's time is spent on this vital step.
• Loading data, cleaning data (removing unnecessary data or erroneous
data), transforming data formats, and rearranging data are the various
steps involved in the data preparation step.
• Here you will work with Python's Pandas library for data preparation.
Pandas
• Pandas is a software library written for Python.
• It is very famous in the data science community because it offers
powerful, expressive, and flexible data structures that make data
manipulation, analysis easy and it is freely available.
• To use the pandas library, you need to first import it. Just type this in
your python console:

• This code imports the pandas library in Python and assigns it the alias "pd".
• This allows the user to access the functions and methods provided by the pandas library in their code by using
the prefix "pd." before the function or method name.
Loading Data
• If you have a .csv file, you can easily load it up in your system using
the .read_csv() function in pandas.
• You can then type:

• However, in this section, you shall be working with some smaller,


easy-to-understand DataFrames and Series created on the fly.
Missing Data
Handling Missing Data
• This is a widespread problem in the data analysis world.
• Missing data can arise in the dataset due to multiple reasons:
• the data for the specific field was not added by the user/data collection
application,
• data was lost while transferring manually, a programming error, etc.
• It is sometimes essential to understand the cause because this will
influence how you deal with such data.
• Let's focus on what to do if you have missing data...
• For numerical data, pandas uses a floating point value NaN (Not a
Number) to represent missing data.
• It is a unique value defined under the library Numpy so we will need
to import it as well.
• NaN is the default missing value marker for reasons of computational
speed and convenience.
• This is a sentinel value, in the sense that it is a dummy data or flag
value that can be easily detected and worked with using functions in
pandas.
• This code imports the NumPy library and creates a
Pandas series called "data" with 10 elements, including
a NaN (not a number) value.
• The "isnull()" method is then called on the "data"
series to check which index contains a null value.
• The output will be a boolean series with "True" values
where there are null values and "False" values where
there are not.

• A Pandas Series object that contains


boolean values.
• The index of the Series ranges from 0 to 9,
and each value in the Series is either True
or False.
• In this particular example, only the sixth
value is True, while all other values are
False.
• Above, we used the function isnull() which returns a boolean true or false value.
• True, when the data at that particular index is actually missing or NaN.
• The opposite of this is the notnull() function.

# To check where the dataset does not contain null value - opposite of isnull()
data.notnull()

• This code uses the notnull() function to check where the dataset does not contain null values.
• The notnull() function returns a boolean value of True for each element in the dataset that is not null and
False for each element that is null.
• In this code, data is assumed to be a Pandas DataFrame or Series object.
• The notnull() function is called on this object, which returns a new DataFrame or Series object with the
same shape as the original object, but with boolean values indicating whether each element is null or not.
• Overall, this code is a simple way to check for null values in a dataset and can be useful for data cleaning
and preprocessing tasks.
• This code snippet is a pandas Series object that contains boolean
values.
• The index of the Series ranges from 0 to 9, and each value in the
Series is either True or False.
• The Series contains mostly True values, with only one False value at
index 6.
• Furthermore, we can use the dropna() function to filter out missing
data and to remove the null (missing) value and see only the non-null
values.
• However, the NaN value is not really deleted and can still be found in
the original dataset.

• This piece of code uses the dropna() method to remove any rows from the data
dataframe that contain null (NaN) values.
• The method returns a new dataframe with the null values removed.
• In this case, the original dataframe is not modified, as the method is not assigned
to a variable.
• The comment above the code indicates that the row with index 6 will not be
included in the new dataframe because it contains a null value.
• This piece of code is displaying a Pandas Series object that contains a sequence of
floating-point numbers.
• The numbers are indexed by integers starting from 0 and increasing by 1 for each
subsequent element.
• The dtype attribute indicates that the data type of the elements in the Series is float64.

• This piece of code is a Pandas Series object that contains 10


elements, each with an index and a corresponding floating-point
value.
• The index values range from 0 to 9, and the values are 0.0, 1.0,
2.0, 3.0, 4.0, 5.0, NaN (Not a Number), 6.0, 7.0, and 8.0.
• The dtype of the Series object is float64, which means that all
the values are floating-point numbers.
• What you can do to really "drop" or delete the NaN value is either
store the new dataset (without NaN) so that the original data Series is
not tampered or apply a drop inplace.
• The inplace argument has a default value of false.

• The code snippet drops all the rows from the DataFrame 'data' that contain
missing values (NaN) and assigns the resulting DataFrame to the variable
'not_null_data’.
• The 'dropna()' method is used to remove the rows with missing values.
• The second line of code simply prints the resulting DataFrame
'not_null_data'.
• This piece of code is simply displaying a Pandas Series object that contains a sequence
of floating-point numbers.
• The numbers are indexed by integers starting from 0 and increasing by 1 for each
subsequent element.
• The dtype attribute indicates that the data type of the elements in the Series is
float64.

• The code snippet drops the row with index 6 from the DataFrame 'data' because it contains a NaN
(Not a Number) value.
• The 'dropna' method is used to remove any rows with missing values.
• The 'inplace=True' parameter is used to modify the original DataFrame 'data' instead of creating a
new one.
• Finally, the updated DataFrame is printed using the 'data' variable.
• This piece of code is displaying a Pandas Series object that contains a sequence of
floating-point numbers.
• The numbers are indexed by integers starting from 0 and increasing by 1 for each
subsequent element.
• The dtype attribute indicates that the data type of the elements in the Series is
float64.
• However, dataframes can be more complex and be 2 dimensions,
meaning they contain rows and columns.
# Creating a dataframe with 4 rows and 4 columns (4*4 matrix)
data_dim = pd.DataFrame([[1,2,3,np.nan],[4,5,np.nan,np.nan],[7,np.nan,np.nan,np.nan],[np.nan,np.nan,np.nan,np.nan]])
data_dim

• This piece of code creates a pandas DataFrame called data_dim with 4 rows and 4 columns, where each
element in the DataFrame is either a number or a missing value represented by np.nan.
• The DataFrame is created by passing a list of lists to the pd.DataFrame() function.
• Each inner list represents a row in the DataFrame, and the elements in the list represent the values in each
column of that row.
• In this case, the first row has the values 1, 2, 3, and a missing value, the second row has the values 4, 5, and
two missing values, the third row has the value 7 and three missing values, and the fourth row has four
missing values.
• Now let's say you only want to drop rows or columns that are all null or only
those that contain a certain amount of null values.
• Check out the various scenarios, remember that the drop is not happening
inplace, so the real dataset is not tampered.
• Pay attention to the arguments passed to the dropna() function to determine
how you drop the missing data.
• This piece of code drops all rows and columns containing NaN (Not a Number) values from the
DataFrame data_dim.
• The dropna() method is called on the DataFrame data_dim, which returns a new DataFrame with
all rows and columns containing NaN values removed.
• However, it's important to note that this method does not modify the original DataFrame, so if
you want to update the original DataFrame, you need to assign the result of dropna() back to
data_dim.
• For example:``pythondata_dim = data_dim.dropna()`This will update the data_dim` DataFrame
with the new DataFrame that has NaN values removed.
• The code snippet drops all rows and columns that contain entirely NaN (Not a Number) values from the DataFrame
called data_dim.
• The dropna() method is called on the DataFrame data_dim with the parameter how='all’.
• This parameter specifies that only rows or columns containing entirely NaN values should be dropped.
• The method returns a new DataFrame with the specified rows and columns dropped.
• However, the original DataFrame data_dim is not modified unless the inplace=True parameter is specified

• This piece of code drops only the columns in the DataFrame data_dim that contain entirely NaN (Not a Number)
values.
• The axis parameter is set to 1, which signifies columns, and the how parameter is set to 'all', which means that only
columns containing all NaN values will be dropped.
• The dropna() method returns a new DataFrame with the specified columns dropped.
• However, it does not modify the original DataFrame data_dim.
• If you want to modify the original DataFrame, you need to assign the result of the dropna() method back to data_dim.
• This piece of code drops all columns from the DataFrame data_dim that have
more than 2 NaN (Not a Number) values.
• The dropna() method is used to remove missing values from a DataFrame.
• The axis parameter is set to 1, which means that the operation is performed on
columns.
• The thresh parameter is set to 2, which means that any column with less than 2
non-missing values will be dropped.
• The resulting DataFrame will have all columns with 2 or fewer NaN values.
• Note that this code does not modify the original DataFrame, but instead returns
a new DataFrame with the specified columns removed.
• This piece of code drops all rows from the DataFrame data_dim that have more
than 2 NaN (Not a Number) values.
• The dropna() method is used to remove missing values from a DataFrame.
• The thresh parameter specifies the minimum number of non-null values that a row
or column must have in order to be kept.
• In this case, thresh = 2 means that any row with less than 2 non-null values will be
dropped.
• Note that this code does not modify the original DataFrame data_dim, but instead
returns a new DataFrame with the specified rows removed.
• If you want to modify the original DataFrame, you can use the inplace=True
parameter.

Summary:
• identify and drop missing values - whether to simply see the resultant dataset or do an inplace deletion.

In many cases, simply dropping the null values is not a feasible option,
and you might want to fill in the missing data with some other value.
Let's see how you can do that:
Filling in Missing Data
To replace or rather "fill in" the null data, you can use the fillna() function. For example, let's try to use the same
dataset as above and try to fill in the NaN values with 0.

The piece of code is printing the value of a variable named data_dim.


• And like with dropna() you can also do many other things depending
on the kind of argument you pass. Also a reminder that passing the
inplace = True argument will make the change to the original dataset.

• The code snippet is using the fillna() method to fill missing values in a pandas DataFrame called data_dim.
• The fillna() method is being passed a dictionary as an argument, where the keys represent the column indices
and the values represent the values to be used to fill the missing values in each column.
• In this case, the dictionary is {0: 0, 1: 8, 2: 9, 3: 10}, which means that missing values in column 0 will be filled
with 0, missing values in column 1 will be filled with 8, missing values in column 2 will be filled with 9, and
missing values in column 3 will be filled with 10.
• The resulting DataFrame with the filled values is then printed using the data_dim_fill variable.
You can pass a method argument to the fillna() function that automatically propogates non-null values
forward (ffill or pad) or backward (bfill or backfill).
• This piece of code fills missing values in a DataFrame called data_dim using the
fillna() method.
• The method parameter is set to 'ffill', which stands for forward fill.
• This means that any missing values in the DataFrame will be filled with the last
non-null value in the same column.
• The resulting DataFrame is stored in a new variable called data_dim_fill and is
printed to the console.
• You can also limit the number of fills above.
• For example, fill up only two places in the columns...
• Also, if you pass axis = 1 this will fill out row value accordingly.

• This piece of code fills missing values in a DataFrame called data_dim using the fillna() method. •
• The method parameter is set to 'ffill', which stands for forward fill.
• This means that missing values will be filled with the last known value in the column.
• The limit parameter is set to 2, which means that the forward fill will only be applied to a maximum of
2 consecutive missing values.
• The resulting DataFrame is stored in a new variable called data_dim_fill and is printed to the console
• This new DataFrame will have missing values replaced with the last known value in the column, up to a
maximum of 2 consecutive missing values.
• This piece of code fills in missing values in a DataFrame called
data_dim using the forward fill method.
• The fillna() method is called on the DataFrame with the axis
parameter set to 1, indicating that we want to fill in missing values
along the rows.
• The method parameter is set to 'ffill', which stands for forward fill,
meaning that missing values are filled with the last non-null value in
the same column.
• The resulting DataFrame with filled values is then printed to the
console.
• With some understanding of the data and your use-case, you can use the fillna() function in many other ways
than simply filling it with numbers.
• You could fill it up using the mean using the mean() or the median value median() as well...
• This piece of code fills the missing values (NaN) in a pandas DataFrame called data_dim with the mean value of
each column.
• The fillna() method is called on the data_dim DataFrame with the argument data_dim.mean(), which calculates
the mean value of each column in the DataFrame.
• The resulting DataFrame with the filled values is assigned to a new variable called data_dim_fill.
• Finally, the data_dim_fill DataFrame is printed to the console.
Data Transformation
Replacing Values
• So far, you have only worked with missing data (NaN), but there could
be situations where you would want to replace a non-null value with
a different value.
• Or maybe a null value is recorded as a random number, and hence
needs to be processed as NaN rather than a number.
• This is where the replace() function comes in handy...
Let's create a different dataset this time. • This piece of code creates a Pandas Series object called
"data" that contains a list of integers. •
• The integers include 1, 2, -99, 4, 5, -99, 7, 8, and -99.
• The "pd" in the first line refers to the Pandas library,
which is imported at the beginning of the code.
• The "Series" function is used to create a one-dimensional
labeled array that can hold any data type.
• The "data" variable is then printed to the console.
• This piece of code replaces all occurrences of the value -99 in the variable data with NaN (Not a Number)
• The replace() method is called on the data object and takes two arguments: the value to be replaced (-99) and
the value to replace it with (np.nan).
• np.nan is a constant from the NumPy library that represents NaN.
• This piece of code is useful for cleaning up data that contains placeholder values that are not valid data points.
• You will no longer see the -99, because it is replaced by NaN and hence not shown.
• Similarly, you can pass multiple values to be replaced.
• To do this, we will create another series and then concatenate the original data series with the
new series and then apply the multiple value replace function.
Concatenating Pandas Series
• To do this, we can use the concat() function in pandas.
• To continue the indexing after applying the concatenation, you can pass the ignore_index = True argument to it.

• This piece of code creates a new Pandas Series called new_data with four elements: -100, 11, 12, and 13.
• Then, it uses the pd.concat() function to concatenate this new series with an existing series called data.
• The ignore_index=True parameter ensures that the resulting series has a new index that starts from 0 and
increments by 1 for each element.
• Finally, the resulting combined series is printed to the console.
• This piece of code replaces the values of
-99 and -100 with NaN (Not a Number)
in the combined_series object.
• The replace() method is called on the
combined_series object with the values
to be replaced and the replacement
value as arguments.
• In this case, the values to be replaced
are -99 and -100, and the replacement
value is np.nan, which is a constant
from the NumPy library that represents
NaN.
• The resulting object with replaced
values is stored in the data_replaced
variable and printed to the console.
• This piece of code replaces specific values in a Pandas
Series object called combined_series.
• The replace() method is used with a dictionary
argument that maps the values to be replaced with
their corresponding replacement values.
• In this case, the values -99 are replaced with np.nan
and the values -100 are replaced with 0.
• The resulting Series object with the replaced values is
stored in a new variable called data_replaced.
• The code also includes a comment that shows an
alternative way to use the replace() method by passing
two lists of values to be replaced and their
corresponding replacement values.
• Overall, this code demonstrates how to replace specific
values in a Pandas Series object using the replace()
method.
Adding knowledge - Map Function

data_number = pd.DataFrame({'english’: ['zero','one','two','three','four','five'],'digits': [0,1,2,3,4,5]})


data_number

• This piece of code creates a pandas


DataFrame called data_number with
two columns: english and digits.
• The english column contains strings
representing the English words for
the numbers 0-5, while the digits
column contains the corresponding
numerical values.
• The pd.DataFrame() function is used
to create the DataFrame, with the
data passed in as a dictionary where
the keys are the column names and
the values are lists of data for each
column.
• Finally, the data_number DataFrame
is printed to the console.
• Let's say you now want to add another column indicating multiples of two as 'Yes'
and the rest as 'No’.
• You can write down a mapping of each distinct English calls to it's corresponding
'Yes' or 'No'.

• Then you can call the map function to add the column for when the English column is a multiple of 2.
• What gets filled in the other non-multiple number columns? Let's find out...

• This piece of code creates a new column called 'multiple' in the 'data_number'
DataFrame.
• The values in this column are obtained by applying the 'english_to_multiple'
function to the values in the 'english' column of the same DataFrame using the
'map' method.
• The other columns are filled with NaN values, and you already know how to
further work with missing values.
Discretization - Cut Function
• Sometimes you might want to categorize based on some logic and put all the data into discrete buckets or
bins for analysis purpose.
• You can use the cut() function for this.
• For example, let's first create a dataset containing 30 random numbers between 1 - 100.
• Say we want to categorize those in terms of some bucket we define ourselves: numbers between
1 - 25, then 25 - 35, 40 - 60 and then 60 - 80 and then the rest.
• So we define a bucket... • This piece of code uses the pandas library.
• The code defines a list of values called bucket which will be
• Then we will use the cut function. used to create bins for the data.
• These values represent the starting point for each bin.
• The pd.cut() function is then used to cut the data into these
bins.
• The data variable is the data that will be cut into bins.
• The resulting output will be a new pandas Series object that
contains the binned data.
• Overall, this code is used to discretize continuous data into
categorical data by creating bins based on the specified
values in the bucket list.

[(35, 60], (35, 60], (1, 25], (80, 100], (25, 35], ..., (60, 80], (60, 80], (80, 100], (35, 60], (35, 60]]
Length: 30
Categories (5, interval[int64]): [(1, 25] < (25, 35] < (35, 60] < (60, 80] < (80, 100]]

• The variable has been divided into 5 categories based on the intervals of values it
takes.
• The intervals are represented as tuples of integers, and the categories are represented
as intervals of type interval[int64].
Dummy variables and One-Hot Encodings
• This topic is particularly useful when you want to be able to convert some categorical data into numerical values
so that you can use it in your data analysis model.
• This is particularly handy, especially when doing machine learning modeling, where the concept of one-hot
encoding is famous. Using more technical words: one-hot encoding is the process of converting categorical values
into a 1-dimensional numerical vector.
• One way of doing this using pandas is to use the get_dummies() function.
• If a column in your dataframe has 'n' distinct values, the function will derive a matrix with 'n' columns containing
all 1s and 0s.
• Let's see this with an example:

• This code creates a Pandas Series object called data by converting a list of individual
characters 'abcdababcdabcd' into a Series.
• The pd.Series() function is used to create the Series object, and the list() function is used to
convert the string of characters into a list.
• The resulting Series contains each individual character as a separate element.
• The output of the code is the data Series object, which is printed to the console.
• Let's say now you want to have individual vectors indicating the appearance of each character to feed it to a
function.
• Something like this: for 'a' = [1,0,0,0,1,0,1,0,0,0,1,0,0,0] where 1 is in the position where 'a' exists and 0 for where it
does not.
• Using the get_dummies() function will make the task easier.

You might also like