Python (Unit - 2)
Python (Unit - 2)
Unit 2
Data Wrangling
• Data Wrangling is the process of transforming data from its original "raw" form into a more
digestible format and organizing sets from various sources into a singular coherent whole for
further processing.
• The primary purpose of data wrangling can be described as getting data in coherent shape.
In other words, it is making raw data usable. It provides substance for further proceedings.
• Data wrangling is the process of cleaning, structuring and enriching raw data into a desired
format for better decision making in less time.
Completeness: Data cleaning addresses missing values in a dataset. Missing data can
lead to biased analyses and incorrect conclusions. By filling in or appropriately
handling missing values, data cleaning ensures that the dataset is more complete.
Relevance: Cleaning data involves identifying and removing irrelevant or unnecessary
information. This improves the overall efficiency of data analysis by focusing on the
most relevant variables and reducing noise in the dataset.
Conformity to Standards: Data cleaning helps ensure that the dataset adheres to
specific standards and formatting conventions. This is particularly important when
combining data from different sources or when preparing data for specific applications
or systems.
Improved Data Integration: When combining data from multiple sources, data
cleaning is essential to align different datasets. This includes handling variations in
naming conventions, units of measurement, and other discrepancies that may arise
when integrating diverse datasets.
Increased Productivity: Working with clean data is more efficient. Analysts and data
scientists spend less time troubleshooting and addressing data issues, allowing them to
focus on deriving meaningful insights and adding value to the organization.
Biased Conclusions: Data that is not cleaned may have biases due to missing values,
outliers, or inconsistencies. Biased data can result in skewed conclusions, leading to
inaccurate assessments and predictions.
Data Integration Challenges: When combining data from multiple sources, the lack
of cleaning can make it challenging to integrate datasets. Incompatible formats, units,
or naming conventions may hinder the merging of data, leading to inconsistencies and
errors.
Operational Inefficiencies: Uncleaned data may require more time and effort to
handle during analysis. Analysts may spend a significant portion of their time
troubleshooting issues, rather than focusing on deriving meaningful insights and value
from the data.
Missed Opportunities: Valuable insights may be overlooked or missed if data is not
cleaned. Inaccurate or incomplete data can obscure important trends and opportunities
that could otherwise be leveraged for strategic decision-making.
DataFrames
The majority of data analysis in Python is performed in pandas DataFrames. These
are rectangular datasets consisting of rows and columns.
An observation contains all the values or variables related to a single instance of the
objects being analyzed. For example, in a dataset of movies, each movie would be an
observation.
A variable is an attribute for the object, across all the observations. For example, the
release dates for all the movies.
Tidy data provides a standard way to organize data. Having a consistent shape for
datasets enables you to worry less about data structures and more on getting useful
results. The principles of tidy data are:
1. Every column is a variable.
2. Every row is an observation.
3. Every cell is a single value.
Pandas features a number of functions for reading tabular data as a DataFrame object.
The following table summarizes some of them; pandas.read_csv() is one of the most
frequently used.
Function Description
read_csv Load delimited data from a file, URL, or file-like
object; use comma as default delimiter
read_excel Read tabular data from an Excel XLS or XLSX file
read_html Read all tables found in the given HTML document
read_json Read data from a JSON (JavaScript Object Notation)
string representation, file, URL, or file-like object
read_spss Read a data file created by SPSS
read_sql Read the results of a SQL query
read_sql_table Read a whole SQL table; equivalent to using a query
that selects everything in that table using read_sql
read_xml Read a table of data from an XML file
Example:
import pandas as pd
Passing how="all" will drop only rows that are all NA:
These functions return new objects by default and do not modify the contents of the original
object. To make the changes in the original dataframe we need to set inplace = True.
To drop columns in the same way, pass axis = 1;
Removing duplicates
Duplicates can distort analyses and lead to inaccurate results. Python's pandas library
provides convenient methods for identifying and removing duplicate rows from a
DataFrame.
Identify and Display Duplicates:
The DataFrame method duplicated() returns a Boolean Series indicating whether each row is
a duplicate (its column values are exactly equal to those in an earlier row) or not:
Remove Duplicates:
drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered
out
This method by default consider all of the columns; alternatively, you can specify any subset
of them to detect duplicates. Suppose we had an additional column of values and wanted to
filter duplicates based only on the "k1" column:
duplicated and drop_duplicates by default keep the first observed value combination. Passing
keep="last" will return the last one:
Thus Using map is a convenient way to perform element-wise transformations and other data
cleaning-related operations.
Discretization and Binning
Discretization and binning are techniques used in data preprocessing to convert continuous
data into discrete intervals or bins. This process is often useful for simplifying complex data,
reducing noise, and preparing data for analysis.
The cut function in pandas is commonly used for discretization. It divides the continuous
data into bins and assigns discrete labels to each bin. Here's an example:
The object pandas cut method returns is a special Categorical object. The output you see
describes the bins computed by pandas.cut. Each bin is identified by a special (unique to
pandas) interval value type containing the lower and upper limit of each bin
In the string representation of an interval, a parenthesis means that the side is open
(exclusive), while the square bracket means it is closed (inclusive). You can change which
side is closed by passing right=False
Also You can override the default interval-based bin labeling by passing a list or array to the
labels option:
Binning Using pandas qcut Function:
The qcut function in pandas is useful when you want to bin data based on quantiles, ensuring
an approximately equal number of data points in each bin.
Filtering outliers
Outliers are data points that deviate significantly from the rest of the data in a dataset. They
are observations that lie an abnormal distance from other values in a random sample from a
population.
Outliers can have a substantial impact on summary statistics such as the mean and standard
deviation. The presence of outliers can lead to inaccurate estimates of central tendency and
dispersion.
Outliers can often be visually identified through data visualization techniques such as box
plots
For example:
We then further break down the last 10% of the data for a more detailed view.
From the output it is clear than there is some anomaly in the last 1% of the data which
indicates significance of outlier in the last one percent of the data.
We can then simply remove this 1% of the data and work with the remaining 99% of data
which is free of any outliers.
For a categorical variable with more than two categories, such as "Region" (North, South, East,
West), dummy variables are created for each category:
In general, for a categorical variable with n categories, n-1 dummy variables are created. This
is to avoid the "dummy variable trap"(multicollinearity) a situation where perfect
multicollinearity occurs because one dummy variable can be predicted from the others. This
can be done by passing drop_first = True.
Calculate group summary statistics, like count, mean, or standard deviation, or a user-
defined function
Hadley Wickham coined the term split-apply-combine for describing group operations. In the
first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or
otherwise, is split into groups based on one or more keys that you provide. The splitting is
performed on a particular axis of an object. For example, a DataFrame can be grouped on its
rows (axis="index") or its columns (axis="columns"). Once this is done, a function is applied
to each group, producing a new value. Finally, the results of all those function applications
are combined into a result object. The form of the resulting object will usually depend on
what’s being done to the data.
For Example:
Example:
You can use multi-index grouping in pivot-table and apply the aggregates.
Cross-Tabulation:
A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes
group frequencies. Here is an example:
Arguments :
index : array-like, Series, or list of arrays/Series, Values to group by in the rows.
columns : array-like, Series, or list of arrays/Series, Values to group by in the columns.
values : array-like, optional, array of values to aggregate according to the factors. Requires
`aggfunc` be specified.
rownames : sequence, default None, If passed, must match number of row arrays passed.
colnames : sequence, default None, If passed, must match number of column arrays passed.
aggfunc : function, optional, If specified, requires `values` be specified as well.
margins : bool, default False, Add row/column margins (subtotals).
margins_name : str, default ‘All’, Name of the row/column that will contain the totals when
margins is True.
dropna : bool, default True, Do not include columns whose entries are all NaN.
Pivot_table vs Cross-tabulation:
Feature Pivot Table Crosstabulation (crosstab)
Functionality More versatile, handles complex Specialized for creating contingency
scenarios, handles missing values, tables, quick summary of counts,
supports multi-level indices simpler functionality
Function df.pivot_table(...) pd.crosstab(...)
Call
Use Cases Aggregating numerical data based Counting frequencies of
on multiple criteria, handling combinations of two categorical
missing values, advanced variables, exploring relationships
reshaping between categorical variables
Syntax More parameters, supports multi- Simpler syntax, focused on counts
level indices and frequencies
Flexibility More flexibility, multiple Simpler and focused on creating
aggregation functions, handles contingency tables
missing values
Complexity More complex, suitable for a wide Simpler and more straightforward for
range of scenarios specific use cases
String Maniputation: String Object Methods
Python has long been a popular raw data manipulation language in part due to its ease of use
for string and text processing. Most text operations are made simple with the string object’s
built-in methods.
Method Description
capitalize() Converts the first character to upper case
endswith() Returns true if the string ends with the specified value
find() Search the string for a specified value and returns the position where it is found
isalnum() Returns True if all characters in the string are alphanumeric
isalpha() Returns True if all characters in the string are in the alphabet
isdigit() Returns True if all characters in the string are digits
islower() Returns True if all characters in the string are lower case
isnumeric() Returns True if all characters in the string are numeric
isspace() Returns True if all characters in the string are whitespaces
istitle() Returns True if the string follows the rules of a title
isupper() Returns True if all characters in the string are upper case
lower() Converts a string into lower case
replace() Returns a string where a specified value is replaced with a specified value
split() Splits the string at the specified separator, and returns a list
splitlines() Splits the string at line breaks and returns a list
startswith() Returns true if the string starts with the specified value
strip() Returns a trimmed version of the string
swapcase() Swaps cases, lower case becomes upper case and vice versa
title() Converts the first character of each word to upper case
upper() Converts a string into upper case
Regular expressions
Regular expressions provide a flexible way to search or match (often more complex) string
patterns in text. A single expression, commonly called a regex, is a string formed according
to the regular expression language. Python’s built-in re module is responsible for applying
regular expressions to strings.
RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:
Function Description
findall Returns a list containing all matches
search Returns a Match object if there is a match anywhere in the string
split Returns a list where the string has been split at each match
sub Replaces one or many matches with a string
Example:
1.) re.findall()
The re.findall() method returns a list of strings containing all matches.
If the pattern is not found, re.findall() returns an empty list.
2.) re.split()
The re.split method splits the string where there is a match and returns a list of strings where
the splits have occurred.
If the pattern is not found, re.split() returns a list containing the original string.
3.) re.sub()
The syntax of re.sub() is:
The method returns a string where matched occurrences are replaced with the content of
replace variable.
If the pattern is not found, re.sub() returns the original string.
4.) re.search()
The re.search() method takes two arguments: a pattern and a string. The method looks for the
first location where the RegEx pattern produces a match with the string.
If the search is successful, re.search() returns a match object; if not, it returns None.
Vectorized string functions in pandas
Pandas provides a set of vectorized string functions that allow you to efficiently perform operations on entire
columns of string data without needing to loop through each element. These string functions are part of the
.str accessor in pandas.
The purpose of vectorized string functions in pandas is to efficiently perform operations on entire columns
of string data without needing to loop through each element. These functions are designed to handle strings
in a vectorized manner, taking advantage of optimized underlying implementations and providing better
performance compared to element-wise operations or using traditional string methods in Python.
Vectorized string functions gracefully handle missing (NaN) values in a DataFrame, preventing errors that
might arise from attempting to operate on missing data using traditional string methods.
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list
of Pandas str methods that mirror Python string methods:
Examples:
1.) .str.lower() and .str.upper(): Convert strings to lowercase or uppercase.
4.) .str.split(): Split a string into a list of substrings based on a specified delimiter.
The eval() function allows you to efficiently evaluate string expressions containing pandas
objects as variables. It is particularly useful for performing arithmetic and boolean operations
on large DataFrames.
eval() performs the expression evaluation using a highly optimized engine, which can lead to
significant performance improvements for large datasets.
It provides a concise and readable syntax for expressing operations.
query(): High-Performance DataFrame Querying
The query() method allows you to filter a DataFrame using a query expression written in a
query language that resembles SQL WHERE clauses.
query() is optimized for querying large DataFrames, providing faster performance compared
to traditional boolean indexing. It allows you to express filtering conditions in a readable and
SQL-like syntax.
Both eval() and query() are designed for performance improvements, especially when
dealing with large datasets. They leverage optimized engines for expression evaluation and
filtering.
Use Cases:
Use eval() for performing element-wise operations on columns.
Use query() for filtering rows based on conditions.