0% found this document useful (0 votes)
17 views

Python (Unit - 2)

Uploaded by

nandiniasadi01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Python (Unit - 2)

Uploaded by

nandiniasadi01
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Python Programming

Unit 2

Data Wrangling

• Data Wrangling is the process of transforming data from its original "raw" form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.

• Data wrangling is also called as data munging.

• The primary purpose of data wrangling can be described as getting data in coherent shape.
In other words, it is making raw data usable. It provides substance for further proceedings.

• Data wrangling covers the following processes:

1. Getting data from the various source into one place

2. Piecing the data together according to the determined setting

3. Cleaning the data from the noise or erroneous, missing elements.

• Data wrangling is the process of cleaning, structuring and enriching raw data into a desired
format for better decision making in less time.

Significance of Data Cleaning


Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying
and correcting or handling errors, inconsistencies, and inaccuracies in datasets. The purpose
of data cleaning is to improve the quality of the data, making it more reliable, accurate, and
suitable for analysis.
 Accuracy and Reliability: Clean data is more accurate and reliable. Removing errors,
duplications, and inconsistencies ensures that the data reflects the true values and
relationships in the real world. This, in turn, improves the quality of analyses and
decision-making based on the data.

 Consistency: Data cleaning helps maintain consistency within a dataset. Consistent


data is easier to work with and less prone to misinterpretation. It ensures that the same
type of information is represented uniformly across the dataset.

 Completeness: Data cleaning addresses missing values in a dataset. Missing data can
lead to biased analyses and incorrect conclusions. By filling in or appropriately
handling missing values, data cleaning ensures that the dataset is more complete.
 Relevance: Cleaning data involves identifying and removing irrelevant or unnecessary
information. This improves the overall efficiency of data analysis by focusing on the
most relevant variables and reducing noise in the dataset.

 Conformity to Standards: Data cleaning helps ensure that the dataset adheres to
specific standards and formatting conventions. This is particularly important when
combining data from different sources or when preparing data for specific applications
or systems.

 Improved Data Integration: When combining data from multiple sources, data
cleaning is essential to align different datasets. This includes handling variations in
naming conventions, units of measurement, and other discrepancies that may arise
when integrating diverse datasets.

 Better Decision-Making: Clean data leads to better-informed decisions. Decision-


makers rely on accurate and reliable information to make strategic choices and
develop insights. Data cleaning contributes to the creation of trustworthy datasets that
support sound decision-making.

 Increased Productivity: Working with clean data is more efficient. Analysts and data
scientists spend less time troubleshooting and addressing data issues, allowing them to
focus on deriving meaningful insights and adding value to the organization.

Uncleaned data may lead to:


 Inaccuracies in Analysis: Uncleaned data may contain errors, outliers, or
inconsistencies that can lead to inaccurate analyses. Misleading results can, in turn,
influence decision-making based on flawed insights.

 Biased Conclusions: Data that is not cleaned may have biases due to missing values,
outliers, or inconsistencies. Biased data can result in skewed conclusions, leading to
inaccurate assessments and predictions.

 Misinterpretation of Trends: Incomplete or inconsistent data can lead to


misinterpretation of trends or patterns. Analysts may draw incorrect conclusions about
the relationships between variables or fail to identify significant insights due to noise
in the data.

 Data Integration Challenges: When combining data from multiple sources, the lack
of cleaning can make it challenging to integrate datasets. Incompatible formats, units,
or naming conventions may hinder the merging of data, leading to inconsistencies and
errors.

 Operational Inefficiencies: Uncleaned data may require more time and effort to
handle during analysis. Analysts may spend a significant portion of their time
troubleshooting issues, rather than focusing on deriving meaningful insights and value
from the data.
 Missed Opportunities: Valuable insights may be overlooked or missed if data is not
cleaned. Inaccurate or incomplete data can obscure important trends and opportunities
that could otherwise be leveraged for strategic decision-making.

DataFrames
 The majority of data analysis in Python is performed in pandas DataFrames. These
are rectangular datasets consisting of rows and columns.

 An observation contains all the values or variables related to a single instance of the
objects being analyzed. For example, in a dataset of movies, each movie would be an
observation.
 A variable is an attribute for the object, across all the observations. For example, the
release dates for all the movies.

 Tidy data provides a standard way to organize data. Having a consistent shape for
datasets enables you to worry less about data structures and more on getting useful
results. The principles of tidy data are:
1. Every column is a variable.
2. Every row is an observation.
3. Every cell is a single value.

 Pandas features a number of functions for reading tabular data as a DataFrame object.
The following table summarizes some of them; pandas.read_csv() is one of the most
frequently used.

Function Description
read_csv Load delimited data from a file, URL, or file-like
object; use comma as default delimiter
read_excel Read tabular data from an Excel XLS or XLSX file
read_html Read all tables found in the given HTML document
read_json Read data from a JSON (JavaScript Object Notation)
string representation, file, URL, or file-like object
read_spss Read a data file created by SPSS
read_sql Read the results of a SQL query
read_sql_table Read a whole SQL table; equivalent to using a query
that selects everything in that table using read_sql
read_xml Read a table of data from an XML file

Handling missing values


Handling missing values is a crucial step in data preprocessing to ensure that your dataset is
ready for analysis. Missing values can occur for various reasons, such as data collection
errors, equipment malfunctions, or the nature of the data itself.
Some common strategies for handling missing values in Python using libraries like pandas:

Identify Missing Values:


Before handling missing values, it's important to identify them. We can use the following
functions to detect missing values in a DataFrame:

isna() Return Boolean values indicating which values are missing/NA.


notna() Negation of isna, returns True for non-NA values and False for NA values.

Example:
import pandas as pd

# Load your dataset


df = pd.read_csv('your_dataset.csv')

# Check for missing values


print(df.isnull().sum())

Dropping the rows and columns having missing values:


If the number of missing values is relatively small or if the entire row or column is not
critical, you can choose to remove them using the dropna() function.
dropna() by default drops any row containing a missing value.

Passing how="all" will drop only rows that are all NA:
These functions return new objects by default and do not modify the contents of the original
object. To make the changes in the original dataframe we need to set inplace = True.
To drop columns in the same way, pass axis = 1;

Imputation (Filling Missing Values):


Imputation involves filling missing values with estimated or calculated values. Common
imputation methods include using the mean, median, or mode of the column. Calling fillna()
with a constant replaces missing values with that value.

Removing duplicates
Duplicates can distort analyses and lead to inaccurate results. Python's pandas library
provides convenient methods for identifying and removing duplicate rows from a
DataFrame.
Identify and Display Duplicates:
The DataFrame method duplicated() returns a Boolean Series indicating whether each row is
a duplicate (its column values are exactly equal to those in an earlier row) or not:
Remove Duplicates:
drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered
out

This method by default consider all of the columns; alternatively, you can specify any subset
of them to detect duplicates. Suppose we had an additional column of values and wanted to
filter duplicates based only on the "k1" column:

duplicated and drop_duplicates by default keep the first observed value combination. Passing
keep="last" will return the last one:

Transforming data using a function or mapping


For many datasets, you may wish to perform some transformation based on the values in an
array, Series, or column in a DataFrame.

Consider the following dataset:


Suppose we want to add a column indicating the labels High, Medium and Low respectively
for A, B and C. We create a mapping of each different labels to the respective categories.

The map() method on a Series accepts a function or dictionary-like object containing a


mapping to do the transformation of values:

We can also pass a function the defines the mapping:

Thus Using map is a convenient way to perform element-wise transformations and other data
cleaning-related operations.
Discretization and Binning
Discretization and binning are techniques used in data preprocessing to convert continuous
data into discrete intervals or bins. This process is often useful for simplifying complex data,
reducing noise, and preparing data for analysis.

Discretization Using pandas cut Function:

The cut function in pandas is commonly used for discretization. It divides the continuous
data into bins and assigns discrete labels to each bin. Here's an example:

The object pandas cut method returns is a special Categorical object. The output you see
describes the bins computed by pandas.cut. Each bin is identified by a special (unique to
pandas) interval value type containing the lower and upper limit of each bin

In the string representation of an interval, a parenthesis means that the side is open
(exclusive), while the square bracket means it is closed (inclusive). You can change which
side is closed by passing right=False

Also You can override the default interval-based bin labeling by passing a list or array to the
labels option:
Binning Using pandas qcut Function:

The qcut function in pandas is useful when you want to bin data based on quantiles, ensuring
an approximately equal number of data points in each bin.

Filtering outliers
Outliers are data points that deviate significantly from the rest of the data in a dataset. They
are observations that lie an abnormal distance from other values in a random sample from a
population.

Outliers can have a substantial impact on summary statistics such as the mean and standard
deviation. The presence of outliers can lead to inaccurate estimates of central tendency and
dispersion.

They may disproportionately affect parameter estimates, leading to biased predictions or


inaccurate model behavior.

Outliers can often be visually identified through data visualization techniques such as box
plots

For example:

Outliers can be filtered using percentiles.


As observed 90% of the data is less than 0.98 and then 100% of the data is observed to be
less than 50708.00, which indicates presence of outlier in the last ten percent of the data.

We then further break down the last 10% of the data for a more detailed view.

From the output it is clear than there is some anomaly in the last 1% of the data which
indicates significance of outlier in the last one percent of the data.

We can then simply remove this 1% of the data and work with the remaining 99% of data
which is free of any outliers.

Computing Dummy variables


Dummy variables, also known as indicator variables or binary variables, are used in statistical
modeling, particularly in the context of regression analysis, to represent categorical data. These
variables are binary, taking on values of 0 or 1, and are used to encode categorical information
into a format that can be easily incorporated into models.
Many statistical models, such as linear regression, require numerical inputs. Dummy variables
allow the representation of categorical variables (e.g., gender, region, product type) as binary
values.

For a categorical variable with more than two categories, such as "Region" (North, South, East,
West), dummy variables are created for each category:

North (N): 1 if the individual is in the North region, 0 otherwise.


South (S): 1 if the individual is in the South region, 0 otherwise.
East (E): 1 if the individual is in the East region, 0 otherwise.
West (W): 1 if the individual is in the West region, 0 otherwise.

In general, for a categorical variable with n categories, n-1 dummy variables are created. This
is to avoid the "dummy variable trap"(multicollinearity) a situation where perfect
multicollinearity occurs because one dummy variable can be predicted from the others. This
can be done by passing drop_first = True.

Group BY: Split, Apply, Combine


Categorizing a dataset and applying a function to each group, whether an aggregation or
transformation, can be a critical component of a data analysis workflow. Pandas provides a
versatile groupby() interface, enabling you to slice, dice, and summarize datasets in a natural
way.
We can perform quite complex group operations by expressing them as custom Python
functions that manipulate the data associated with each group.
 Split a pandas object into pieces using one or more keys (in the form of functions,
arrays, or DataFrame column names)

 Calculate group summary statistics, like count, mean, or standard deviation, or a user-
defined function

 Apply within-group transformations or other manipulations, like normalization, linear


regression, rank, or subset selection

Hadley Wickham coined the term split-apply-combine for describing group operations. In the
first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or
otherwise, is split into groups based on one or more keys that you provide. The splitting is
performed on a particular axis of an object. For example, a DataFrame can be grouped on its
rows (axis="index") or its columns (axis="columns"). Once this is done, a function is applied
to each group, producing a new value. Finally, the results of all those function applications
are combined into a result object. The form of the resulting object will usually depend on
what’s being done to the data.

For Example:

We can find multiple aggregation functions of a particular column grouped by another


column.
You can apply different aggregate functions to different columns by specifying a dictionary
of column-to-function mappings. This allows you to customize the aggregation for each
column within each group.

Pivot Tables and cross-tabulation


A pivot table is a data summarization tool frequently found in spreadsheet programs and
other data analysis software. It aggregates a table of data by one or more keys, arranging the
data in a rectangle with some of the group keys along the rows and some along the columns.
DataFrame has a pivot_table method, and there is also a top-level pandas.pivot_table
function.

Some key arguments of pivot_table:


Argument Description
Column name or names to aggregate; by default, aggregates all
values
numeric columns
Column names or other group keys to group on the rows of the
index
resulting pivot table
Column names or other group keys to group on the columns of the
columns
resulting pivot table
Aggregation function or list of functions ("mean" by default); can be
aggfunc
any function valid in a groupby context
fill_value Replace missing values in the result table
dropna If True, do not include columns whose entries are all NA
margins Add row/column subtotals and grand total (False by default)
Argument Description
Name to use for the margin row/column labels when
margins_name
passing margins=True; defaults to "All"

Example:

You can use multi-index grouping in pivot-table and apply the aggregates.

Cross-Tabulation:
A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes
group frequencies. Here is an example:
Arguments :
index : array-like, Series, or list of arrays/Series, Values to group by in the rows.
columns : array-like, Series, or list of arrays/Series, Values to group by in the columns.
values : array-like, optional, array of values to aggregate according to the factors. Requires
`aggfunc` be specified.
rownames : sequence, default None, If passed, must match number of row arrays passed.
colnames : sequence, default None, If passed, must match number of column arrays passed.
aggfunc : function, optional, If specified, requires `values` be specified as well.
margins : bool, default False, Add row/column margins (subtotals).
margins_name : str, default ‘All’, Name of the row/column that will contain the totals when
margins is True.
dropna : bool, default True, Do not include columns whose entries are all NaN.

Pivot_table vs Cross-tabulation:
Feature Pivot Table Crosstabulation (crosstab)
Functionality More versatile, handles complex Specialized for creating contingency
scenarios, handles missing values, tables, quick summary of counts,
supports multi-level indices simpler functionality
Function df.pivot_table(...) pd.crosstab(...)
Call
Use Cases Aggregating numerical data based Counting frequencies of
on multiple criteria, handling combinations of two categorical
missing values, advanced variables, exploring relationships
reshaping between categorical variables
Syntax More parameters, supports multi- Simpler syntax, focused on counts
level indices and frequencies
Flexibility More flexibility, multiple Simpler and focused on creating
aggregation functions, handles contingency tables
missing values
Complexity More complex, suitable for a wide Simpler and more straightforward for
range of scenarios specific use cases
String Maniputation: String Object Methods
Python has long been a popular raw data manipulation language in part due to its ease of use
for string and text processing. Most text operations are made simple with the string object’s
built-in methods.
Method Description
capitalize() Converts the first character to upper case
endswith() Returns true if the string ends with the specified value
find() Search the string for a specified value and returns the position where it is found
isalnum() Returns True if all characters in the string are alphanumeric
isalpha() Returns True if all characters in the string are in the alphabet
isdigit() Returns True if all characters in the string are digits
islower() Returns True if all characters in the string are lower case
isnumeric() Returns True if all characters in the string are numeric
isspace() Returns True if all characters in the string are whitespaces
istitle() Returns True if the string follows the rules of a title
isupper() Returns True if all characters in the string are upper case
lower() Converts a string into lower case
replace() Returns a string where a specified value is replaced with a specified value
split() Splits the string at the specified separator, and returns a list
splitlines() Splits the string at line breaks and returns a list
startswith() Returns true if the string starts with the specified value
strip() Returns a trimmed version of the string
swapcase() Swaps cases, lower case becomes upper case and vice versa
title() Converts the first character of each word to upper case
upper() Converts a string into upper case

Regular expressions
Regular expressions provide a flexible way to search or match (often more complex) string
patterns in text. A single expression, commonly called a regex, is a string formed according
to the regular expression language. Python’s built-in re module is responsible for applying
regular expressions to strings.

Metacharacters are characters with a special meaning:


Character Description Example
[] A set of characters "[a-m]"
\ Signals a special sequence (can also be used to escape special "\d"
characters)
. Any character (except newline character) "he..o"
^ Starts with "^hello"
$ Ends with "planet$"
* Zero or more occurrences "he.*o"
+ One or more occurrences "he.+o"
? Zero or one occurrences "he.?o"
{} Exactly the specified number of occurrences "he.{2}o"
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a
special meaning:

Character Description Example


\d Returns a match where the string contains digits (numbers "\d"
from 0-9)
\D Returns a match where the string DOES NOT contain digits "\D"
\s Returns a match where the string contains a white space "\s"
character
\S Returns a match where the string DOES NOT contain a white "\S"
space character
\w Returns a match where the string contains any word characters "\w"
(characters from a to Z, digits from 0-9, and the underscore _
character)
\W Returns a match where the string DOES NOT contain any "\W"
word characters

RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:
Function Description
findall Returns a list containing all matches
search Returns a Match object if there is a match anywhere in the string
split Returns a list where the string has been split at each match
sub Replaces one or many matches with a string

Example:
1.) re.findall()
The re.findall() method returns a list of strings containing all matches.
If the pattern is not found, re.findall() returns an empty list.

2.) re.split()
The re.split method splits the string where there is a match and returns a list of strings where
the splits have occurred.
If the pattern is not found, re.split() returns a list containing the original string.
3.) re.sub()
The syntax of re.sub() is:

re.sub(pattern, replace, string)

The method returns a string where matched occurrences are replaced with the content of
replace variable.
If the pattern is not found, re.sub() returns the original string.

4.) re.search()
The re.search() method takes two arguments: a pattern and a string. The method looks for the
first location where the RegEx pattern produces a match with the string.

If the search is successful, re.search() returns a match object; if not, it returns None.
Vectorized string functions in pandas
Pandas provides a set of vectorized string functions that allow you to efficiently perform operations on entire
columns of string data without needing to loop through each element. These string functions are part of the
.str accessor in pandas.

The purpose of vectorized string functions in pandas is to efficiently perform operations on entire columns
of string data without needing to loop through each element. These functions are designed to handle strings
in a vectorized manner, taking advantage of optimized underlying implementations and providing better
performance compared to element-wise operations or using traditional string methods in Python.

Vectorized string functions gracefully handle missing (NaN) values in a DataFrame, preventing errors that
might arise from attempting to operate on missing data using traditional string methods.
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list
of Pandas str methods that mirror Python string methods:

len() lower() translate() islower()


ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()

Examples:
1.) .str.lower() and .str.upper(): Convert strings to lowercase or uppercase.

2.) .str.len(): Compute the length of each string.


3.) .str.startswith() and .str.endswith(): Check if a string starts or ends with a specified substring.

4.) .str.split(): Split a string into a list of substrings based on a specified delimiter.

High-performance Pandas: eval() and query()


eval() and query() are two features in pandas designed to enhance the performance of certain
operations, especially for large DataFrames. They provide a way to perform expression
evaluation and filtering in a more efficient manner.

eval(): High-Performance Expression Evaluation

The eval() function allows you to efficiently evaluate string expressions containing pandas
objects as variables. It is particularly useful for performing arithmetic and boolean operations
on large DataFrames.

eval() performs the expression evaluation using a highly optimized engine, which can lead to
significant performance improvements for large datasets.
It provides a concise and readable syntax for expressing operations.
query(): High-Performance DataFrame Querying
The query() method allows you to filter a DataFrame using a query expression written in a
query language that resembles SQL WHERE clauses.

query() is optimized for querying large DataFrames, providing faster performance compared
to traditional boolean indexing. It allows you to express filtering conditions in a readable and
SQL-like syntax.

Both eval() and query() are designed for performance improvements, especially when
dealing with large datasets. They leverage optimized engines for expression evaluation and
filtering.

Use Cases:
Use eval() for performing element-wise operations on columns.
Use query() for filtering rows based on conditions.

You might also like