0% found this document useful (0 votes)

5 views

DAP Module4 Notes

Module 4 of the Data Analytics using Python course covers data loading and wrangling techniques, including merging datasets, reshaping data with hierarchical indexing, and reading text data using Pandas. It also discusses handling missing data with sentinel values, data transformation mechanisms, discretization, and pattern matching using regular expressions. Additionally, it provides practical examples of data cleaning and manipulation, as well as creating visualizations with Matplotlib.

Uploaded by

batch0406sem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

DAP Module4 Notes

Uploaded by

batch0406sem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Module 4 [20MCA31] Data Analytics using Python

Module 4
Data Loading and Data Wrangling

1. Explain merging of datasets for the following situations:

i)Many to one ii) Many to many
Merge or join operations combine data sets by linking rows using one or more keys. The
merge function in pandas is used to combine datasets.

Let’s start with a simple example:

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2

• The below examples shows many-to-one merge situation; the data in df1 has multiple
rows labeled a and b, whereas df2 has only one row for each value in the key column.

# performs inner join # performs outer join

# pd.merge(df1, df2)
pd.merge(df1, df2, how='outer')
pd.merge(df1, df2, on='key')

Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.

ROOPA.H.M, Dept of MCA, RNSIT Page 1

Module 4 [20MCA31] Data Analytics using Python

• The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

df1 df2

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in
the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method
only affects the distinct key values appearing in the result:

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

2. Describe reshaping with hierarchical indexing with suitable examples.

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.
There are two primary actions:
• stack: this “rotates” or pivots from the columns in the data to the rows
• unstack: this pivots from the rows into the columns

Consider a small DataFrame with string arrays as row and column indexes:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'], name='number'))
data

ROOPA.H.M, Dept of MCA, RNSIT Page 2

Module 4 [20MCA31] Data Analytics using Python

• Using the stack method on this data pivots the columns into the rows, producing a Series:

result = data.stack()
result result.unstack()

The data can rearranged back into a DataFrame with unstack().

3. What are the different functions available in Pandas library to read text or tabular
data? Give examples.
Pandas features a number of functions for reading tabular data as a DataFrame object.
Table below shows a summary of all of them:

read_csv and read_table are most used functions.

Let’s start with a small comma-separated (CSV) text file:

ex1.csv df = pd.read_csv('ex1.csv') pd.read_table('ex1.csv', sep=',')

Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file
contains any other delimiters then, read_table can be used by specifying the delimiter.

ROOPA.H.M, Dept of MCA, RNSIT Page 3

Module 4 [20MCA31] Data Analytics using Python

4. What are sentinel values? How they can be converted to NAN value?
Missing data is usually either not present (empty string) or marked by some value which
are called as sentinel value.

By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and
NULL.

Let’s start with a small comma-separated (CSV) text file:

Sample.csv result = pd.read_csv('Sample.csv')

• The na_values option can take either a list or set of strings to consider missing values:

result = pd.read_csv('Sample.csv', na_values=['NULL'])

• Different NA sentinels can be specified for each column in a dict:

sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

pd.read_csv('Sample.csv', na_values=sentinels)

5. Describe the following data transformation mechanisms:

i) Removing duplicates ii) Filtering outliers iii) Replacing Values
i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly,
drop_duplicates returns a DataFrame where the duplicated array is True.

ROOPA.H.M, Dept of MCA, RNSIT Page 4

Module 4 [20MCA31] Data Analytics using Python

data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )

data.duplicated() data.drop_duplicates()
data

# rows 1, 4 and 6 are

dropped

ii) Filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations.

Consider a DataFrame with some normally distributed data. (Note : while writing
answers, write your own random numbers between 0 and 1)

• Suppose we wanted to find values in one of the columns exceeding one in magnitude:

• To select all rows having a value exceeding 1 or -1, we can use the any method on a
boolean DataFrame:

ROOPA.H.M, Dept of MCA, RNSIT Page 5

Module 4 [20MCA31] Data Analytics using Python

iii) Replacing Values

• Some times it is necessary to replace missing values with some specific values or NAN
values. It can be done by using replace method. Let’s consider this Series:

data = Series([1., -999., 2., -999., -1000., 3.])

data

• The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:

data.replace(-999, np.nan)

• If we want to replace multiple values at once, you instead pass a list then the
substitute value:

data.replace([-999, -1000], np.nan)

• To use a different replacement for each value, pass a list of substitutes:

data.replace([-999, -1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})

ROOPA.H.M, Dept of MCA, RNSIT Page 6

Module 4 [20MCA31] Data Analytics using Python

6. Write a short note on discretization and binning.

• Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose we have data about a group of people in a study, and we want to group them into
discrete age buckets:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do
so, we have to use cut, a function in pandas:

import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

• The object pandas returns is a special Categorical object. We can treat it like an array of
strings indicating the bin name; internally it contains a levels array indicating the distinct
category names along with a labeling for the ages data in the labels attribute:

cats.labels

cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)

Consistent with mathematical notation for intervals, a parenthesis means that the side is
open while the square bracket means it is closed (inclusive).

7. List and describe different functions used for pattern matching in re module with
example.
• Regular expressions provide a flexible way to search or match string patterns in text.

• A single expression, commonly called a regex, is a string formed according to the regular
expression language. Python’s built-in re module is responsible for applying regular

ROOPA.H.M, Dept of MCA, RNSIT Page 7

Module 4 [20MCA31] Data Analytics using Python

expressions to strings.

• The re module functions fall into three categories:

— splitting
— pattern matching
— substitution

Splitting

• To split a string with a variable number of whitespace characters (tabs, spaces, and
newlines). The regex describing one or more whitespace characters is \s+:

import re
text = "good better\t best\t excellent" ['good', 'better', 'best', 'excellent']
re.split('\s+', text)

When we call re.split('\s+', text), the regular expression is first compiled, then its split
method is called on the passed text.

• We can compile the regex yourself with re.compile, forming a reusable regex object:

regex = re.compile('\s+') ['good', 'better', 'best', 'excellent']

regex.split(text)

Creating a regex object with re.compile is highly recommended if you intend to apply the
same expression to many strings; doing so will save CPU cycles.

Pattern matching
The re module offers a set of functions that allows us to search a string for a match. By
using these functions we can search required pattern. They are as follows:

• findall() : Find all substrings where the RE matches, and returns them as a list. It
searches from start or end of the given string and returns all occurrences of the pattern.

import re
text = """Steve [email protected] ['[email protected]',
Rob [email protected] '[email protected]',
Ryan [email protected] '[email protected]']
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive

regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

• match(): re.match() determine if the RE matches at the beginning of the string. The
method returns a match object if the search is successful. If not, it returns None.
Output:
print(regex.match(text))
None

ROOPA.H.M, Dept of MCA, RNSIT Page 8

Module 4 [20MCA31] Data Analytics using Python

• search(): The search( ) function searches the string for a match, and returns a Match
object if there is a match. If there is more than one match found, only the first occurrence
of the match will be returned.

Output:
m = regex.search(text) <re.Match object; span=(6, 21),
m
match='[email protected]'>

Substitution

sub( ) will return a new string with occurrences of the pattern replaced by the new string:
Output:
print(regex.sub('Son', text)) Steve Son
Rob Son
Ryan Son

8. Write python program to demonstrate i) importing datasets ii)cleaning the data

iii)dataframe manipulation using NumPy.
cars_data=pd.read_csv("Toyota.csv")
cars_data.head()

#Cleaning the Data

cars_data2 =cars_data.copy()
cars_data3 =cars_data.copy()

cars_data2 = cars_data.drop(['Doors','Weight'],axis='columns')
cars_data2.shape

# identifying missing values(NaN -> Not a Number)

cars_data2.isna().sum()

#subsetting the rows that have one or more missing values

missing = cars_data2[cars_data2.isnull().any(axis=1)]
print(missing)

#To fill the numerical values

cars_data2.describe()

#calculating the meanvalue of the 'Age' variable

cars_data2['Age'].mean()

#To fill NA/NAN values using the specified value

cars_data2['Age'].fillna(cars_data2['Age'].mean(), inplace = True)

ROOPA.H.M, Dept of MCA, RNSIT Page 9

Module 4 [20MCA31] Data Analytics using Python

#imputing missing values of categorical variables

cars_data2['FuelType'].value_counts()

# to get the mode value of FuelType

cars_data2['FuelType'].value_counts().index[0]
cars_data2['FuelType'].fillna(cars_data2['FuelType'].value_counts().index[0], inplace = True)

cars_data2['MetColor'].fillna(cars_data2['MetColor'].mode()[0], inplace = True)

#Data frame manipulation using Numpy

print(cars_data.shape)
print(cars_data.index)
cars_data.ndim

9. Explain how text files are read in pieces? Give examples.

• When processing very large files or figuring out the right set of arguments to correctly
process a large file, sometimes we may only want to read in a small piece of a file or iterate
through smaller chunks of the file.

• If you want to only read out a small number of rows (avoiding reading the entire file),
specify that with nrows:

• To read out a file in pieces, specify a chunksize as a number of rows:

• The TextParser object returned by read_csv allows us to iterate over the parts of the file
according to the chunksize.

ROOPA.H.M, Dept of MCA, RNSIT Page 10

Module 4 [20MCA31] Data Analytics using Python

Module 5

1. Explain , how simple line plot can be created using matplotlib? Show the adjustments
done to the plot w.r.t line colors.
The simplest of all plots is the visualization of a single function y = f (x ). Here we will create
simple line plot.

In Matplotlib, the figure (an instance of the class plt.Figure) can be thought of as a single
container that contains all the objects representing axes, graphics, text, and labels. The
axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks
and labels, which will eventually contain the plot elements that make up the visualization.

Alternatively, we can use the pylab interface, which creates the figure and axes in the
background. Ex: plt.plot(x, np.sin(x))

Adjusting the Plot: Line Colors

The plt.plot() function takes additional arguments that can be used to specify the color
keyword, which accepts a string argument representing virtually any imaginable color. The
color can be specified in a variety of ways.

ROOPA.H.M, Dept of MCA, RNSIT Page 11

Module 4 [20MCA31] Data Analytics using Python

2. What is Matplotlib? Mention its advantages.

Matplotlib is a multiplatform data visualization library built on NumPy arrays.

Advantages

• One of Matplotlib’s most important features is its ability to play well with many
operating systems and graphics backends. Matplotlib supports dozens of backends and
output types, which means you can count on it to work regardless of which operating
system you are using or which output format you wish. This cross-platform, everything-
to-everyone approach has been one of the great strengths of Matplotlib.

• It has led to a large userbase, which in turn has led to an active developer base and
Matplotlib’s powerful tools and ubiquity within the scientific Python world.

• Pandas library itself can be used as wrappers around Matplotlib’s API. Even with
wrappers like these, it is still often useful to dive into Matplotlib’s syntax to adjust the
final plot output.

3. Distinguish MATLAB style and object-oriented interfaces of Matplotlib.

MATLAB style interface Object-oriented interface

Matplotlib was originally written as a Python The object-oriented interface is available for
alternative for MATLAB users, and much of its these more complicated situations, and for
syntax reflects that fact. when we want more control over your
figure.
The MATLAB-style tools are contained in the
pyplot (plt) interface.

ROOPA.H.M, Dept of MCA, RNSIT Page 12

Module 4 [20MCA31] Data Analytics using Python

Interface is stateful: it keeps track of the Rather than depending on some notion of
current” figure and axes, where all plt an “active” figure or axes, in the object-
commands are applied. once the second panel oriented interface the plotting functions are
is created, going back and adding something methods of explicit Figure and Axes
to the first is bit complex. objects.

4. Write the lines of code to create a simple histogram using matplotlib library.

A simple histogram can be useful in understanding a dataset. the below code creates a
simple histogram.

5. What are the two ways to adjust axis limits of the plot using Matplotlib? Explain with the example
for each.

Matplotlib does a decent job of choosing default axes limits for your plot, but some‐ times
it’s nice to have finer control.

The two ways to adjust axis limits are:

• using plt.xlim() and plt.ylim() methods

ROOPA.H.M, Dept of MCA, RNSIT Page 13

Module 4 [20MCA31] Data Analytics using Python

• using plt.axis()

The plt.axis( ) method allows you to set the x and y limits with a single call, by passing a
list that specifies [xmin, xmax, ymin, ymax].

6. List out the dissimilarities between plot() and scatter() functions while plotting scatter plot.

• The difference between the two functions is: with pyplot.plot() any property you apply
(color, shape, size of points) will be applied across all points whereas in pyplot.scatter() you
have more control in each point’s appearance. That is, in plt.scatter() you can have the color,
shape and size of each dot (datapoint) to vary based on another variable.

• While it doesn’t matter as much for small amounts of data, as datasets get larger than a
few thousand points, plt.plot can be noticeably more efficient than plt.scatter. The reason is
that plt.scatter has the capability to render a different size and/or color for each point, so
the renderer must do the extra work of constructing each point individually. In plt.plot, on
the other hand, the points are always essentially clones of each other, so the work of
determining the appearance of the points is done only once for the entire set of data.

• For large datasets, the difference between these two can lead to vastly different performance,
and for this reason, plt.plot should be preferred over plt.scatter for large datasets.

7. How to customize the default plot settings of Matplotlib w.r.t runtime configuration
and stylesheets? Explain with the suitable code snippet.
• Each time Matplotlib loads, it defines a runtime configuration (rc) containing the default
styles for every plot element we create.

• We can adjust this configuration at any time using the plt.rc convenience routine.

ROOPA.H.M, Dept of MCA, RNSIT Page 14

Module 4 [20MCA31] Data Analytics using Python

• To modify the rc parameters, we’ll start by saving a copy of the current rcParams
dictionary, so we can easily reset these changes in the current session:

IPython_default = plt.rcParams.copy()

• Now we can use the plt.rc function to change some of these settings:

8. Elaborate on Seaborn versus Matplotlib with suitable examples.

Seaborn library is basically based on Matplotlib. Here is a detailed comparison
between the two:

ROOPA.H.M, Dept of MCA, RNSIT Page 15

Module 4 [20MCA31] Data Analytics using Python

Seaborn Matplotlib

Seaborn, on the other hand, provides Matplotlib is mainly deployed for

a variety of visualization patterns. It basic plotting. Visualization using
uses fewer syntax and has easily Matplotlib generally consists of
interesting default themes. It bars, pies, lines, scatter plots and
Functionality
specializes in statistics visualization so on.
and is used if one has to summarize
data in visualizations and also show
the distribution in the data.

Seaborn automates the creation of Matplotlib has multiple figures

Handling Multiple multiple figures. This sometimes can be opened, but need to be
Figures leads to OOM (out of memory) closed explicitly. plt.close() only
issues. closes the current figure.
plt.close(‘all’) would close them
all.

Seaborn is more integrated for Matplotlib is a graphics package

working with Pandas data frames. for data visualization in Python.
It extends the Matplotlib library It is well integrated with NumPy
for creating beautiful graphics and Pandas. The pyplot module
Visualization
with Python using a more mirrors the MATLAB plotting
straightforward set of methods. commands closely. Hence,
MATLAB users can easily
transit to plotting with Python.

Seaborn works with the dataset as Matplotlib works with data

a whole and is much more frames and arrays. It has
intuitive than Matplotlib. For different stateful APIs for
Seaborn, replot() is the entry API plotting. The figures and aces
Data frames and with ‘kind’ parameter to specify are represented by the object
Arrays the type of plot which could be and therefore plot() like calls
line, bar, or many of the other without parameters suffices,
types. Seaborn is not stateful. without having to manage
Hence, plot() would require parameters.
passing the object.

Flexibility Seaborn avoids a ton of boilerplate Matplotlib is highly

by providing default themes which customizable and powerful.
are commonly used.

Seaborn is for more specific use Pandas uses Matplotlib. It is a

cases. Also, it is Matplotlib under neat wrapper around
Use Cases
the hood. It is specially meant for Matplotlib.
statistical plotting.

ROOPA.H.M, Dept of MCA, RNSIT Page 16

Module 4 [20MCA31] Data Analytics using Python

Let us assume

x=[10,20,30,45,60]

y=[0.5,0.2,0.5,0.3,0.5]

Matplotlib Seaborn
#to plot the graph #to plot the graph
import matplotlib.pyplot as plt import seaborn as sns
plt.style.use('classic') sns.set()
plt.plot(x, y) plt.plot(x, y)
plt.legend('ABCDEF',ncol=2, plt.legend('ABCDEF',ncol=2,
loc='upper left') loc='upper left')

9. List and describe different categories of colormaps with the suitable code snippets.
Three different categories of colormaps:

Sequential colormaps : These consist of one continuous sequence of colors

(e.g., binary or viridis).

Divergent colormaps : These usually contain two distinct colors, which show positive and
negative deviations from a mean (e.g., RdBu or PuOr).

Qualitative colormaps : These mix colors with no particular sequence (e.g., rainbow or jet).

speckles = (np.random.random(I.shape) < 0.01)

I[speckles] = np.random.normal(0, 3, np.count_nonzero(speckles))
plt.figure(figsize=(10, 3.5))
plt.subplot(1, 2, 1)
plt.imshow(I, cmap='RdBu')

ROOPA.H.M, Dept of MCA, RNSIT Page 17

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
12 Information Practices Text Book Preeti Arora
No ratings yet
12 Information Practices Text Book Preeti Arora
45 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Pandas
No ratings yet
Pandas
94 pages
EXP-3
No ratings yet
EXP-3
10 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
MCQ
No ratings yet
MCQ
8 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Pandas
No ratings yet
Pandas
5 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
Data Science Unit 2 Second Half Notes[1]
No ratings yet
Data Science Unit 2 Second Half Notes[1]
18 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Python Notes by Prof T
No ratings yet
Python Notes by Prof T
10 pages
Top Python Questions 1735201448
No ratings yet
Top Python Questions 1735201448
25 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Data Frame 100 Questions
No ratings yet
Data Frame 100 Questions
16 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Pandas
No ratings yet
Pandas
25 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Python Pandas Demo PDF
100% (2)
Python Pandas Demo PDF
23 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
1 Data Handling Using Pandas 1
No ratings yet
1 Data Handling Using Pandas 1
63 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
On Data Handling Using Pandas-I
100% (2)
On Data Handling Using Pandas-I
63 pages
python interviews
No ratings yet
python interviews
154 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Document (4)
No ratings yet
Document (4)
15 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
Pandas 1
No ratings yet
Pandas 1
89 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
lab 1 ML lab
No ratings yet
lab 1 ML lab
15 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Class Xii Information Practices Ppt on Data Handling Using Pandas-i
No ratings yet
Class Xii Information Practices Ppt on Data Handling Using Pandas-i
64 pages
Unit3_3) Pandas.ipynb - Colab
No ratings yet
Unit3_3) Pandas.ipynb - Colab
11 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Agent Reflection Code Execution LangChain
No ratings yet
Agent Reflection Code Execution LangChain
9 pages
CSE413_201-15-3452_LAB-REPORT_02
No ratings yet
CSE413_201-15-3452_LAB-REPORT_02
6 pages
CH #3 Solved Exercise
No ratings yet
CH #3 Solved Exercise
6 pages
Python Class 6 Assignment Solution
No ratings yet
Python Class 6 Assignment Solution
9 pages
Zhibek Abdikarim - Ipynb - Colaboratory
No ratings yet
Zhibek Abdikarim - Ipynb - Colaboratory
9 pages
Croma Campus - Python Full Stack Development (Python Django) Training Curriculum (4) (1) (1)
No ratings yet
Croma Campus - Python Full Stack Development (Python Django) Training Curriculum (4) (1) (1)
27 pages
Pandas 2
No ratings yet
Pandas 2
17 pages
14_QP IP-01-1
No ratings yet
14_QP IP-01-1
8 pages
DAP_Module3
No ratings yet
DAP_Module3
42 pages
Python Unit 4&5 Que
No ratings yet
Python Unit 4&5 Que
33 pages
Python 101: Using Pandas
No ratings yet
Python 101: Using Pandas
53 pages
AIot Lab Syllabus
No ratings yet
AIot Lab Syllabus
4 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
30 pages
SRM PPS Unit 5 Consolidated
No ratings yet
SRM PPS Unit 5 Consolidated
139 pages
IP Assign 12th Standard
No ratings yet
IP Assign 12th Standard
28 pages
OOM Unit 2
No ratings yet
OOM Unit 2
145 pages
Silver Oak College of Computer Application: Subject:Machine Learning
No ratings yet
Silver Oak College of Computer Application: Subject:Machine Learning
15 pages
12 Computer Science Sp 06 With Solution
No ratings yet
12 Computer Science Sp 06 With Solution
17 pages
Learning pandas Python Data Discovery and Analysis Made Easy Heydt Michael 2024 Scribd Download
100% (6)
Learning pandas Python Data Discovery and Analysis Made Easy Heydt Michael 2024 Scribd Download
50 pages
Phonepe Pulse Data Visualization and Exploration - A User-Friendly Tool Using Streamlit and Plotly
No ratings yet
Phonepe Pulse Data Visualization and Exploration - A User-Friendly Tool Using Streamlit and Plotly
3 pages
Python for Finance: A Crash Course Modern Guide: Learn Python Fast Bisette - The latest updated ebook version is ready for download
100% (3)
Python for Finance: A Crash Course Modern Guide: Learn Python Fast Bisette - The latest updated ebook version is ready for download
54 pages
Aerofit_business_Case - JupyterLab
No ratings yet
Aerofit_business_Case - JupyterLab
36 pages
Alera AI_ML Team_ April Action Plan
No ratings yet
Alera AI_ML Team_ April Action Plan
3 pages
Internship Report Big Data Analysis
No ratings yet
Internship Report Big Data Analysis
35 pages
Mohammad Ansari Resume
No ratings yet
Mohammad Ansari Resume
1 page
1 - CE523 - Python Programming - II - CE - Sem5 - Batch2019-2023
No ratings yet
1 - CE523 - Python Programming - II - CE - Sem5 - Batch2019-2023
3 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
85 pages
PB 1 IP Answer Key 2024
No ratings yet
PB 1 IP Answer Key 2024
6 pages
Python Programming For Economics-Importante
No ratings yet
Python Programming For Economics-Importante
341 pages
Python Tutorials - Data To Fish
No ratings yet
Python Tutorials - Data To Fish
4 pages

DAP Module4 Notes

Uploaded by

DAP Module4 Notes

Uploaded by

Module 4 [20MCA31] Data Analytics using Python

1. Explain merging of datasets for the following situations:

Let’s start with a simple example:

# performs inner join # performs outer join

ROOPA.H.M, Dept of MCA, RNSIT Page 1

• The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

2. Describe reshaping with hierarchical indexing with suitable examples.

ROOPA.H.M, Dept of MCA, RNSIT Page 2

The data can rearranged back into a DataFrame with unstack().

read_csv and read_table are most used functions.

Let’s start with a small comma-separated (CSV) text file:

ex1.csv df = pd.read_csv('ex1.csv') pd.read_table('ex1.csv', sep=',')

ROOPA.H.M, Dept of MCA, RNSIT Page 3

Let’s start with a small comma-separated (CSV) text file:

Sample.csv result = pd.read_csv('Sample.csv')

result = pd.read_csv('Sample.csv', na_values=['NULL'])

• Different NA sentinels can be specified for each column in a dict:

sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

5. Describe the following data transformation mechanisms:

ROOPA.H.M, Dept of MCA, RNSIT Page 4

# rows 1, 4 and 6 are

ii) Filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations.

ROOPA.H.M, Dept of MCA, RNSIT Page 5

iii) Replacing Values

data = Series([1., -999., 2., -999., -1000., 3.])

data.replace([-999, -1000], np.nan)

• To use a different replacement for each value, pass a list of substitutes:

data.replace([-999, -1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})

ROOPA.H.M, Dept of MCA, RNSIT Page 6

6. Write a short note on discretization and binning.

ROOPA.H.M, Dept of MCA, RNSIT Page 7

• The re module functions fall into three categories:

regex = re.compile('\s+') ['good', 'better', 'best', 'excellent']

# re.IGNORECASE makes the regex case-insensitive

ROOPA.H.M, Dept of MCA, RNSIT Page 8

8. Write python program to demonstrate i) importing datasets ii)cleaning the data

#Cleaning the Data

# identifying missing values(NaN -> Not a Number)

#subsetting the rows that have one or more missing values

#To fill the numerical values

#calculating the meanvalue of the 'Age' variable

#To fill NA/NAN values using the specified value

ROOPA.H.M, Dept of MCA, RNSIT Page 9

#imputing missing values of categorical variables

# to get the mode value of FuelType

cars_data2['MetColor'].fillna(cars_data2['MetColor'].mode()[0], inplace = True)

#Data frame manipulation using Numpy

9. Explain how text files are read in pieces? Give examples.

• To read out a file in pieces, specify a chunksize as a number of rows:

ROOPA.H.M, Dept of MCA, RNSIT Page 10

Adjusting the Plot: Line Colors

ROOPA.H.M, Dept of MCA, RNSIT Page 11

2. What is Matplotlib? Mention its advantages.

3. Distinguish MATLAB style and object-oriented interfaces of Matplotlib.

ROOPA.H.M, Dept of MCA, RNSIT Page 12

The two ways to adjust axis limits are:

• using plt.xlim() and plt.ylim() methods

ROOPA.H.M, Dept of MCA, RNSIT Page 13

ROOPA.H.M, Dept of MCA, RNSIT Page 14

8. Elaborate on Seaborn versus Matplotlib with suitable examples.

ROOPA.H.M, Dept of MCA, RNSIT Page 15

Seaborn, on the other hand, provides Matplotlib is mainly deployed for

Seaborn automates the creation of Matplotlib has multiple figures

Seaborn is more integrated for Matplotlib is a graphics package

Seaborn works with the dataset as Matplotlib works with data

Flexibility Seaborn avoids a ton of boilerplate Matplotlib is highly

Seaborn is for more specific use Pandas uses Matplotlib. It is a

ROOPA.H.M, Dept of MCA, RNSIT Page 16

Sequential colormaps : These consist of one continuous sequence of colors

speckles = (np.random.random(I.shape) < 0.01)

ROOPA.H.M, Dept of MCA, RNSIT Page 17

You might also like