0% found this document useful (0 votes)
5 views

DAP Module4 Notes

Module 4 of the Data Analytics using Python course covers data loading and wrangling techniques, including merging datasets, reshaping data with hierarchical indexing, and reading text data using Pandas. It also discusses handling missing data with sentinel values, data transformation mechanisms, discretization, and pattern matching using regular expressions. Additionally, it provides practical examples of data cleaning and manipulation, as well as creating visualizations with Matplotlib.

Uploaded by

batch0406sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DAP Module4 Notes

Module 4 of the Data Analytics using Python course covers data loading and wrangling techniques, including merging datasets, reshaping data with hierarchical indexing, and reading text data using Pandas. It also discusses handling missing data with sentinel values, data transformation mechanisms, discretization, and pattern matching using regular expressions. Additionally, it provides practical examples of data cleaning and manipulation, as well as creating visualizations with Matplotlib.

Uploaded by

batch0406sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module 4 [20MCA31] Data Analytics using Python

Module 4
Data Loading and Data Wrangling

1. Explain merging of datasets for the following situations:


i)Many to one ii) Many to many
Merge or join operations combine data sets by linking rows using one or more keys. The
merge function in pandas is used to combine datasets.

Let’s start with a simple example:

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2

• The below examples shows many-to-one merge situation; the data in df1 has multiple
rows labeled a and b, whereas df2 has only one row for each value in the key column.

# performs inner join # performs outer join


# pd.merge(df1, df2)
pd.merge(df1, df2, how='outer')
pd.merge(df1, df2, on='key')

Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.

ROOPA.H.M, Dept of MCA, RNSIT Page 1


Module 4 [20MCA31] Data Analytics using Python

• The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

df1 df2

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in
the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method
only affects the distinct key values appearing in the result:

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

2. Describe reshaping with hierarchical indexing with suitable examples.


Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.
There are two primary actions:
• stack: this “rotates” or pivots from the columns in the data to the rows
• unstack: this pivots from the rows into the columns

Consider a small DataFrame with string arrays as row and column indexes:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'], name='number'))
data

ROOPA.H.M, Dept of MCA, RNSIT Page 2


Module 4 [20MCA31] Data Analytics using Python

• Using the stack method on this data pivots the columns into the rows, producing a Series:

result = data.stack()
result result.unstack()

The data can rearranged back into a DataFrame with unstack().

3. What are the different functions available in Pandas library to read text or tabular
data? Give examples.
Pandas features a number of functions for reading tabular data as a DataFrame object.
Table below shows a summary of all of them:

read_csv and read_table are most used functions.

Let’s start with a small comma-separated (CSV) text file:

ex1.csv df = pd.read_csv('ex1.csv') pd.read_table('ex1.csv', sep=',')

Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file
contains any other delimiters then, read_table can be used by specifying the delimiter.

ROOPA.H.M, Dept of MCA, RNSIT Page 3


Module 4 [20MCA31] Data Analytics using Python

4. What are sentinel values? How they can be converted to NAN value?
Missing data is usually either not present (empty string) or marked by some value which
are called as sentinel value.

By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and
NULL.

Let’s start with a small comma-separated (CSV) text file:

Sample.csv result = pd.read_csv('Sample.csv')

• The na_values option can take either a list or set of strings to consider missing values:

result = pd.read_csv('Sample.csv', na_values=['NULL'])

• Different NA sentinels can be specified for each column in a dict:

sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

pd.read_csv('Sample.csv', na_values=sentinels)

5. Describe the following data transformation mechanisms:


i) Removing duplicates ii) Filtering outliers iii) Replacing Values
i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly,
drop_duplicates returns a DataFrame where the duplicated array is True.

ROOPA.H.M, Dept of MCA, RNSIT Page 4


Module 4 [20MCA31] Data Analytics using Python

data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )

data.duplicated() data.drop_duplicates()
data

# rows 1, 4 and 6 are


dropped

ii) Filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations.


Consider a DataFrame with some normally distributed data. (Note : while writing
answers, write your own random numbers between 0 and 1)

• Suppose we wanted to find values in one of the columns exceeding one in magnitude:

• To select all rows having a value exceeding 1 or -1, we can use the any method on a
boolean DataFrame:

ROOPA.H.M, Dept of MCA, RNSIT Page 5


Module 4 [20MCA31] Data Analytics using Python

iii) Replacing Values

• Some times it is necessary to replace missing values with some specific values or NAN
values. It can be done by using replace method. Let’s consider this Series:

data = Series([1., -999., 2., -999., -1000., 3.])


data

• The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:

data.replace(-999, np.nan)

• If we want to replace multiple values at once, you instead pass a list then the
substitute value:

data.replace([-999, -1000], np.nan)

• To use a different replacement for each value, pass a list of substitutes:

data.replace([-999, -1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})

ROOPA.H.M, Dept of MCA, RNSIT Page 6


Module 4 [20MCA31] Data Analytics using Python

6. Write a short note on discretization and binning.


• Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose we have data about a group of people in a study, and we want to group them into
discrete age buckets:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do
so, we have to use cut, a function in pandas:

import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

• The object pandas returns is a special Categorical object. We can treat it like an array of
strings indicating the bin name; internally it contains a levels array indicating the distinct
category names along with a labeling for the ages data in the labels attribute:

cats.labels

cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)

Consistent with mathematical notation for intervals, a parenthesis means that the side is
open while the square bracket means it is closed (inclusive).

7. List and describe different functions used for pattern matching in re module with
example.
• Regular expressions provide a flexible way to search or match string patterns in text.

• A single expression, commonly called a regex, is a string formed according to the regular
expression language. Python’s built-in re module is responsible for applying regular

ROOPA.H.M, Dept of MCA, RNSIT Page 7


Module 4 [20MCA31] Data Analytics using Python

expressions to strings.

• The re module functions fall into three categories:


— splitting
— pattern matching
— substitution

Splitting

• To split a string with a variable number of whitespace characters (tabs, spaces, and
newlines). The regex describing one or more whitespace characters is \s+:

import re
text = "good better\t best\t excellent" ['good', 'better', 'best', 'excellent']
re.split('\s+', text)

When we call re.split('\s+', text), the regular expression is first compiled, then its split
method is called on the passed text.

• We can compile the regex yourself with re.compile, forming a reusable regex object:

regex = re.compile('\s+') ['good', 'better', 'best', 'excellent']


regex.split(text)

Creating a regex object with re.compile is highly recommended if you intend to apply the
same expression to many strings; doing so will save CPU cycles.

Pattern matching
The re module offers a set of functions that allows us to search a string for a match. By
using these functions we can search required pattern. They are as follows:

• findall() : Find all substrings where the RE matches, and returns them as a list. It
searches from start or end of the given string and returns all occurrences of the pattern.

import re
text = """Steve [email protected] ['[email protected]',
Rob [email protected] '[email protected]',
Ryan [email protected] '[email protected]']
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive


regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

• match(): re.match() determine if the RE matches at the beginning of the string. The
method returns a match object if the search is successful. If not, it returns None.
Output:
print(regex.match(text))
None

ROOPA.H.M, Dept of MCA, RNSIT Page 8


Module 4 [20MCA31] Data Analytics using Python

• search(): The search( ) function searches the string for a match, and returns a Match
object if there is a match. If there is more than one match found, only the first occurrence
of the match will be returned.

Output:
m = regex.search(text) <re.Match object; span=(6, 21),
m
match='[email protected]'>

Substitution

sub( ) will return a new string with occurrences of the pattern replaced by the new string:
Output:
print(regex.sub('Son', text)) Steve Son
Rob Son
Ryan Son

8. Write python program to demonstrate i) importing datasets ii)cleaning the data


iii)dataframe manipulation using NumPy.
cars_data=pd.read_csv("Toyota.csv")
cars_data.head()

#Cleaning the Data


cars_data2 =cars_data.copy()
cars_data3 =cars_data.copy()

cars_data2 = cars_data.drop(['Doors','Weight'],axis='columns')
cars_data2.shape

# identifying missing values(NaN -> Not a Number)


cars_data2.isna().sum()

#subsetting the rows that have one or more missing values


missing = cars_data2[cars_data2.isnull().any(axis=1)]
print(missing)

#To fill the numerical values


cars_data2.describe()

#calculating the meanvalue of the 'Age' variable


cars_data2['Age'].mean()

#To fill NA/NAN values using the specified value


cars_data2['Age'].fillna(cars_data2['Age'].mean(), inplace = True)

ROOPA.H.M, Dept of MCA, RNSIT Page 9


Module 4 [20MCA31] Data Analytics using Python

#imputing missing values of categorical variables


cars_data2['FuelType'].value_counts()

# to get the mode value of FuelType


cars_data2['FuelType'].value_counts().index[0]
cars_data2['FuelType'].fillna(cars_data2['FuelType'].value_counts().index[0], inplace = True)

cars_data2['MetColor'].fillna(cars_data2['MetColor'].mode()[0], inplace = True)

#Data frame manipulation using Numpy


print(cars_data.shape)
print(cars_data.index)
cars_data.ndim

9. Explain how text files are read in pieces? Give examples.


• When processing very large files or figuring out the right set of arguments to correctly
process a large file, sometimes we may only want to read in a small piece of a file or iterate
through smaller chunks of the file.

• If you want to only read out a small number of rows (avoiding reading the entire file),
specify that with nrows:

• To read out a file in pieces, specify a chunksize as a number of rows:

• The TextParser object returned by read_csv allows us to iterate over the parts of the file
according to the chunksize.

ROOPA.H.M, Dept of MCA, RNSIT Page 10


Module 4 [20MCA31] Data Analytics using Python

Module 5

1. Explain , how simple line plot can be created using matplotlib? Show the adjustments
done to the plot w.r.t line colors.
The simplest of all plots is the visualization of a single function y = f (x ). Here we will create
simple line plot.

In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of as a single
container that contains all the objects representing axes, graphics, text, and labels. The
axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks
and labels, which will eventually contain the plot elements that make up the visualization.

Alternatively, we can use the pylab interface, which creates the figure and axes in the
background. Ex: plt.plot(x, np.sin(x))

Adjusting the Plot: Line Colors

The plt.plot() function takes additional arguments that can be used to specify the color
keyword, which accepts a string argument representing virtually any imaginable color. The
color can be specified in a variety of ways.

ROOPA.H.M, Dept of MCA, RNSIT Page 11


Module 4 [20MCA31] Data Analytics using Python

2. What is Matplotlib? Mention its advantages.


Matplotlib is a multiplatform data visualization library built on NumPy arrays.

Advantages

• One of Matplotlib’s most important features is its ability to play well with many
operating systems and graphics backends. Matplotlib supports dozens of backends and
output types, which means you can count on it to work regardless of which operating
system you are using or which output format you wish. This cross-platform, everything-
to-everyone approach has been one of the great strengths of Matplotlib.

• It has led to a large userbase, which in turn has led to an active developer base and
Matplotlib’s powerful tools and ubiquity within the scientific Python world.

• Pandas library itself can be used as wrappers around Matplotlib’s API. Even with
wrappers like these, it is still often useful to dive into Matplotlib’s syntax to adjust the
final plot output.

3. Distinguish MATLAB style and object-oriented interfaces of Matplotlib.


MATLAB style interface Object-oriented interface

Matplotlib was originally written as a Python The object-oriented interface is available for
alternative for MATLAB users, and much of its these more complicated situations, and for
syntax reflects that fact. when we want more control over your
figure.
The MATLAB-style tools are contained in the
pyplot (plt) interface.

ROOPA.H.M, Dept of MCA, RNSIT Page 12


Module 4 [20MCA31] Data Analytics using Python

Interface is stateful: it keeps track of the Rather than depending on some notion of
current” figure and axes, where all plt an “active” figure or axes, in the object-
commands are applied. once the second panel oriented interface the plotting functions are
is created, going back and adding something methods of explicit Figure and Axes
to the first is bit complex. objects.

4. Write the lines of code to create a simple histogram using matplotlib library.

A simple histogram can be useful in understanding a dataset. the below code creates a
simple histogram.

5. What are the two ways to adjust axis limits of the plot using Matplotlib? Explain with the example
for each.

Matplotlib does a decent job of choosing default axes limits for your plot, but some‐ times
it’s nice to have finer control.

The two ways to adjust axis limits are:

• using plt.xlim() and plt.ylim() methods

ROOPA.H.M, Dept of MCA, RNSIT Page 13


Module 4 [20MCA31] Data Analytics using Python

• using plt.axis()

The plt.axis( ) method allows you to set the x and y limits with a single call, by passing a
list that specifies [xmin, xmax, ymin, ymax].

6. List out the dissimilarities between plot() and scatter() functions while plotting scatter plot.

• The difference between the two functions is: with pyplot.plot() any property you apply
(color, shape, size of points) will be applied across all points whereas in pyplot.scatter() you
have more control in each point’s appearance. That is, in plt.scatter() you can have the color,
shape and size of each dot (datapoint) to vary based on another variable.

• While it doesn’t matter as much for small amounts of data, as datasets get larger than a
few thousand points, plt.plot can be noticeably more efficient than plt.scatter. The reason is
that plt.scatter has the capability to render a different size and/or color for each point, so
the renderer must do the extra work of constructing each point individually. In plt.plot, on
the other hand, the points are always essentially clones of each other, so the work of
determining the appearance of the points is done only once for the entire set of data.

• For large datasets, the difference between these two can lead to vastly different performance,
and for this reason, plt.plot should be preferred over plt.scatter for large datasets.

7. How to customize the default plot settings of Matplotlib w.r.t runtime configuration
and stylesheets? Explain with the suitable code snippet.
• Each time Matplotlib loads, it defines a runtime configuration (rc) containing the default
styles for every plot element we create.

• We can adjust this configuration at any time using the plt.rc convenience routine.

ROOPA.H.M, Dept of MCA, RNSIT Page 14


Module 4 [20MCA31] Data Analytics using Python

• To modify the rc parameters, we’ll start by saving a copy of the current rcParams
dictionary, so we can easily reset these changes in the current session:

IPython_default = plt.rcParams.copy()

• Now we can use the plt.rc function to change some of these settings:

8. Elaborate on Seaborn versus Matplotlib with suitable examples.


Seaborn library is basically based on Matplotlib. Here is a detailed comparison
between the two:

ROOPA.H.M, Dept of MCA, RNSIT Page 15


Module 4 [20MCA31] Data Analytics using Python

Seaborn Matplotlib

Seaborn, on the other hand, provides Matplotlib is mainly deployed for


a variety of visualization patterns. It basic plotting. Visualization using
uses fewer syntax and has easily Matplotlib generally consists of
interesting default themes. It bars, pies, lines, scatter plots and
Functionality
specializes in statistics visualization so on.
and is used if one has to summarize
data in visualizations and also show
the distribution in the data.

Seaborn automates the creation of Matplotlib has multiple figures


Handling Multiple multiple figures. This sometimes can be opened, but need to be
Figures leads to OOM (out of memory) closed explicitly. plt.close() only
issues. closes the current figure.
plt.close(‘all’) would close them
all.

Seaborn is more integrated for Matplotlib is a graphics package


working with Pandas data frames. for data visualization in Python.
It extends the Matplotlib library It is well integrated with NumPy
for creating beautiful graphics and Pandas. The pyplot module
Visualization
with Python using a more mirrors the MATLAB plotting
straightforward set of methods. commands closely. Hence,
MATLAB users can easily
transit to plotting with Python.

Seaborn works with the dataset as Matplotlib works with data


a whole and is much more frames and arrays. It has
intuitive than Matplotlib. For different stateful APIs for
Seaborn, replot() is the entry API plotting. The figures and aces
Data frames and with ‘kind’ parameter to specify are represented by the object
Arrays the type of plot which could be and therefore plot() like calls
line, bar, or many of the other without parameters suffices,
types. Seaborn is not stateful. without having to manage
Hence, plot() would require parameters.
passing the object.

Flexibility Seaborn avoids a ton of boilerplate Matplotlib is highly


by providing default themes which customizable and powerful.
are commonly used.

Seaborn is for more specific use Pandas uses Matplotlib. It is a


cases. Also, it is Matplotlib under neat wrapper around
Use Cases
the hood. It is specially meant for Matplotlib.
statistical plotting.

ROOPA.H.M, Dept of MCA, RNSIT Page 16


Module 4 [20MCA31] Data Analytics using Python

Let us assume

x=[10,20,30,45,60]

y=[0.5,0.2,0.5,0.3,0.5]

Matplotlib Seaborn
#to plot the graph #to plot the graph
import matplotlib.pyplot as plt import seaborn as sns
plt.style.use('classic') sns.set()
plt.plot(x, y) plt.plot(x, y)
plt.legend('ABCDEF',ncol=2, plt.legend('ABCDEF',ncol=2,
loc='upper left') loc='upper left')

9. List and describe different categories of colormaps with the suitable code snippets.
Three different categories of colormaps:

Sequential colormaps : These consist of one continuous sequence of colors


(e.g., binary or viridis).

Divergent colormaps : These usually contain two distinct colors, which show positive and
negative deviations from a mean (e.g., RdBu or PuOr).

Qualitative colormaps : These mix colors with no particular sequence (e.g., rainbow or jet).

speckles = (np.random.random(I.shape) < 0.01)


I[speckles] = np.random.normal(0, 3, np.count_nonzero(speckles))
plt.figure(figsize=(10, 3.5))
plt.subplot(1, 2, 1)
plt.imshow(I, cmap='RdBu')

ROOPA.H.M, Dept of MCA, RNSIT Page 17

You might also like