DAP Module4 Notes
DAP Module4 Notes
Module 4
Data Loading and Data Wrangling
import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2
• The below examples shows many-to-one merge situation; the data in df1 has multiple
rows labeled a and b, whereas df2 has only one row for each value in the key column.
Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.
df1 df2
Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in
the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method
only affects the distinct key values appearing in the result:
Consider a small DataFrame with string arrays as row and column indexes:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'], name='number'))
data
• Using the stack method on this data pivots the columns into the rows, producing a Series:
result = data.stack()
result result.unstack()
3. What are the different functions available in Pandas library to read text or tabular
data? Give examples.
Pandas features a number of functions for reading tabular data as a DataFrame object.
Table below shows a summary of all of them:
Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file
contains any other delimiters then, read_table can be used by specifying the delimiter.
4. What are sentinel values? How they can be converted to NAN value?
Missing data is usually either not present (empty string) or marked by some value which
are called as sentinel value.
By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and
NULL.
• The na_values option can take either a list or set of strings to consider missing values:
pd.read_csv('Sample.csv', na_values=sentinels)
data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )
data.duplicated() data.drop_duplicates()
data
• Suppose we wanted to find values in one of the columns exceeding one in magnitude:
• To select all rows having a value exceeding 1 or -1, we can use the any method on a
boolean DataFrame:
• Some times it is necessary to replace missing values with some specific values or NAN
values. It can be done by using replace method. Let’s consider this Series:
• The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:
data.replace(-999, np.nan)
• If we want to replace multiple values at once, you instead pass a list then the
substitute value:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do
so, we have to use cut, a function in pandas:
import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
• The object pandas returns is a special Categorical object. We can treat it like an array of
strings indicating the bin name; internally it contains a levels array indicating the distinct
category names along with a labeling for the ages data in the labels attribute:
cats.labels
cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)
Consistent with mathematical notation for intervals, a parenthesis means that the side is
open while the square bracket means it is closed (inclusive).
7. List and describe different functions used for pattern matching in re module with
example.
• Regular expressions provide a flexible way to search or match string patterns in text.
• A single expression, commonly called a regex, is a string formed according to the regular
expression language. Python’s built-in re module is responsible for applying regular
expressions to strings.
Splitting
• To split a string with a variable number of whitespace characters (tabs, spaces, and
newlines). The regex describing one or more whitespace characters is \s+:
import re
text = "good better\t best\t excellent" ['good', 'better', 'best', 'excellent']
re.split('\s+', text)
When we call re.split('\s+', text), the regular expression is first compiled, then its split
method is called on the passed text.
• We can compile the regex yourself with re.compile, forming a reusable regex object:
Creating a regex object with re.compile is highly recommended if you intend to apply the
same expression to many strings; doing so will save CPU cycles.
Pattern matching
The re module offers a set of functions that allows us to search a string for a match. By
using these functions we can search required pattern. They are as follows:
• findall() : Find all substrings where the RE matches, and returns them as a list. It
searches from start or end of the given string and returns all occurrences of the pattern.
import re
text = """Steve [email protected] ['[email protected]',
Rob [email protected] '[email protected]',
Ryan [email protected] '[email protected]']
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
• match(): re.match() determine if the RE matches at the beginning of the string. The
method returns a match object if the search is successful. If not, it returns None.
Output:
print(regex.match(text))
None
• search(): The search( ) function searches the string for a match, and returns a Match
object if there is a match. If there is more than one match found, only the first occurrence
of the match will be returned.
Output:
m = regex.search(text) <re.Match object; span=(6, 21),
m
match='[email protected]'>
Substitution
sub( ) will return a new string with occurrences of the pattern replaced by the new string:
Output:
print(regex.sub('Son', text)) Steve Son
Rob Son
Ryan Son
cars_data2 = cars_data.drop(['Doors','Weight'],axis='columns')
cars_data2.shape
• If you want to only read out a small number of rows (avoiding reading the entire file),
specify that with nrows:
• The TextParser object returned by read_csv allows us to iterate over the parts of the file
according to the chunksize.
Module 5
1. Explain , how simple line plot can be created using matplotlib? Show the adjustments
done to the plot w.r.t line colors.
The simplest of all plots is the visualization of a single function y = f (x ). Here we will create
simple line plot.
In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of as a single
container that contains all the objects representing axes, graphics, text, and labels. The
axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks
and labels, which will eventually contain the plot elements that make up the visualization.
Alternatively, we can use the pylab interface, which creates the figure and axes in the
background. Ex: plt.plot(x, np.sin(x))
The plt.plot() function takes additional arguments that can be used to specify the color
keyword, which accepts a string argument representing virtually any imaginable color. The
color can be specified in a variety of ways.
Advantages
• One of Matplotlib’s most important features is its ability to play well with many
operating systems and graphics backends. Matplotlib supports dozens of backends and
output types, which means you can count on it to work regardless of which operating
system you are using or which output format you wish. This cross-platform, everything-
to-everyone approach has been one of the great strengths of Matplotlib.
• It has led to a large userbase, which in turn has led to an active developer base and
Matplotlib’s powerful tools and ubiquity within the scientific Python world.
• Pandas library itself can be used as wrappers around Matplotlib’s API. Even with
wrappers like these, it is still often useful to dive into Matplotlib’s syntax to adjust the
final plot output.
Matplotlib was originally written as a Python The object-oriented interface is available for
alternative for MATLAB users, and much of its these more complicated situations, and for
syntax reflects that fact. when we want more control over your
figure.
The MATLAB-style tools are contained in the
pyplot (plt) interface.
Interface is stateful: it keeps track of the Rather than depending on some notion of
current” figure and axes, where all plt an “active” figure or axes, in the object-
commands are applied. once the second panel oriented interface the plotting functions are
is created, going back and adding something methods of explicit Figure and Axes
to the first is bit complex. objects.
4. Write the lines of code to create a simple histogram using matplotlib library.
A simple histogram can be useful in understanding a dataset. the below code creates a
simple histogram.
5. What are the two ways to adjust axis limits of the plot using Matplotlib? Explain with the example
for each.
Matplotlib does a decent job of choosing default axes limits for your plot, but some‐ times
it’s nice to have finer control.
• using plt.axis()
The plt.axis( ) method allows you to set the x and y limits with a single call, by passing a
list that specifies [xmin, xmax, ymin, ymax].
6. List out the dissimilarities between plot() and scatter() functions while plotting scatter plot.
• The difference between the two functions is: with pyplot.plot() any property you apply
(color, shape, size of points) will be applied across all points whereas in pyplot.scatter() you
have more control in each point’s appearance. That is, in plt.scatter() you can have the color,
shape and size of each dot (datapoint) to vary based on another variable.
• While it doesn’t matter as much for small amounts of data, as datasets get larger than a
few thousand points, plt.plot can be noticeably more efficient than plt.scatter. The reason is
that plt.scatter has the capability to render a different size and/or color for each point, so
the renderer must do the extra work of constructing each point individually. In plt.plot, on
the other hand, the points are always essentially clones of each other, so the work of
determining the appearance of the points is done only once for the entire set of data.
• For large datasets, the difference between these two can lead to vastly different performance,
and for this reason, plt.plot should be preferred over plt.scatter for large datasets.
7. How to customize the default plot settings of Matplotlib w.r.t runtime configuration
and stylesheets? Explain with the suitable code snippet.
• Each time Matplotlib loads, it defines a runtime configuration (rc) containing the default
styles for every plot element we create.
• We can adjust this configuration at any time using the plt.rc convenience routine.
• To modify the rc parameters, we’ll start by saving a copy of the current rcParams
dictionary, so we can easily reset these changes in the current session:
IPython_default = plt.rcParams.copy()
• Now we can use the plt.rc function to change some of these settings:
Seaborn Matplotlib
Let us assume
x=[10,20,30,45,60]
y=[0.5,0.2,0.5,0.3,0.5]
Matplotlib Seaborn
#to plot the graph #to plot the graph
import matplotlib.pyplot as plt import seaborn as sns
plt.style.use('classic') sns.set()
plt.plot(x, y) plt.plot(x, y)
plt.legend('ABCDEF',ncol=2, plt.legend('ABCDEF',ncol=2,
loc='upper left') loc='upper left')
9. List and describe different categories of colormaps with the suitable code snippets.
Three different categories of colormaps:
Divergent colormaps : These usually contain two distinct colors, which show positive and
negative deviations from a mean (e.g., RdBu or PuOr).
Qualitative colormaps : These mix colors with no particular sequence (e.g., rainbow or jet).