0% found this document useful (0 votes)

9 views38 pages

Unit 2

This document covers data manipulation techniques in Python, focusing on tools such as Python Shell, Jupyter Notebook, NumPy, and Pandas. It details various operations including array manipulations, data wrangling, handling missing data, and creating DataFrames, along with advanced features like hierarchical indexing and pivot tables. The content is aimed at providing a comprehensive understanding of data manipulation for better decision-making and analysis.

Uploaded by

P SANTHIYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views38 pages

Unit 2

Uploaded by

P SANTHIYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

UNIT 2

DATA MANIPULATION
PYTHON SHELL - JUPYTER NOTEBOOK - IPYTHON MAGIC COMMANDS -
NUMPY ARRAYS-UNIVERSAL FUNCTIONS – AGGREGATIONS – COMPUTATION
ON ARRAYS – FANCY INDEXING – SORTING ARRAYS-STRUCTURED DATA –
DATA MANIPULATION WITH PANDAS – DATA INDEXING AND SELECTION –
HANDLING MISSING DATA – HIERARCHICAL INDEXING – COMBINING
DATASETS – AGGREGATION AND GROUPING –STRING OPERATIONS –
WORKING WITH TIME SERIES – HIGH PERFORMANCE
Python Shell
 Python is an interpreter language. It means it executes the code line by line.
Python provides a Python Shell, which is used to execute a single Python
command and display the result.
 It is also known as REPL (Read, Evaluate, Print, Loop), where it reads the
command, evaluates the command, prints the result, and loop it back to read
the command again.
 It provides an easy and interactive way to write and test small pieces of
Python code.
 It can be a useful tool for debugging code. For example, if you are having an
issue with a larger program, you can use the Python shell to test out specific
lines of code or to try out different approaches to solving a problem.
 It can be a good way to learn about the various built-in functions and
modules in Python.
Jupyter Notebook

 The Jupyter Notebook is an open-source web application that

allows you to create and share documents that contain live code,
equations, visualizations, and narrative text.
 Jupyter has support for over 40 different programming languages
and Python is one of them. Python is a requirement (Python 3.3
or greater, or Python 2.7) for installing the Jupyter Notebook
itself.
 Jupyter Notebook can be installed by using either of the two ways
described below:
 Install Jupyter Notebook with Anaconda
 Installing Jupyter Notebook using Anaconda on Windows
 Step 1: Download Anaconda
 Step 2: Run the Anaconda Installer
 Step 3: Launch Jupyter Notebook
 Install Jupyter using the PIP package manager used to install and
manage software packages/libraries written in Python.
 Step 1: Install Python programming language
 Step 2: Install Jupyter Notebook
 Step 3: Start Jupyter Notebook
Magic Commands

 Magic commands generally known as magic functions are special

commands in IPython that provide special functionalities to users
like modifying the behavior of a code cell explicitly, simplifying
common tasks like timing code execution, profiling, etc. Magic
commands have the prefix ‘%’ or ‘%%’ followed by the command
name. There are two types of magic commands:

 Line Magic Commands

 Cell Magic Commands
Data Wrangling

 Data wrangling is the process of cleaning, structuring and

enriching raw data into a desired format for better decision
making in less time.
 Data wrangling is also called as data munging.
 Data wrangling covers the following processes:
 1. Getting data from the various source into one place
 2. Piecing the data together according to the determined setting
 3. Cleaning the data from the noise or erroneous, missing
elements.
Numpy
 NumPy, short for Numerical Python, is the core library for scientific
computing in Python.
 Numpy array is a powerful N-dimensional array object which is in the
form of rows and columns.
 It has been designed specifically for performing basic and advanced
array operations.
 It primarily supports multi-dimensional arrays and vectors for complex
arithmetic operations.
 NumPy (Numerical Python) is a perfect tool for scientific computing
and performing basic and advanced array operations.
 Numpy is the core library for scientific computing in Python.
 It provides a high performance multidimensional array object and tools
for working with these arrays.
 t contains:
 a) A powerful N-dimensional array object
 b) Basic linear algebra functions
 c) Basic Fourier transforms
 d) Sophisticated random number capabilities
 e) Tools for integrating Fortran code
 f) Tools for integrating C/C++ code.
Basic array manipulations are as
follows :
 1. Attributes of arrays: It define the size, shape, memory
consumption, and data types of arrays.
 2. Indexing of arrays: Getting and setting the value of individual
array elements. 3. Slicing of arrays: Getting and setting smaller
subarrays within a larger array.
 4. Reshaping of arrays: Changing the shape of a given array.
 5. Joining and splitting of arrays: Combining multiple arrays into
one, and splitting one array into many.
Aggregations

 Aggregation function is one

which takes multiple individual
values and returns a summary. In
the majority of the cases, this
summary is a single value.
 The most common aggregation
functions are a simple average or
summation of values.
Computations on Arrays
Computation on NumPy arrays can be very fast, or it can be very slow.

Using vectorized operations, fast computations is possible and it is

implemented by using NumPy's universial functions (ufuncs).

• A universal function (ufuncs) is a function that operates on ndarrays in

an element-by- element fashion, supporting array broadcasting, type
casting, and several other standard features.
The ufunc is a "vectorized" wrapper for a function that takes a fixed
number of specific inputs and produces a fixed number of specific
outputs.
 NumPy's Ufuncs :
 • Ufuncs are of two types: unary ufuncs and binary ufuncs.
 • Unary ufuncs operate on a single input and binary ufuncs,
which operate on two inputs.
 Arithmetic operators
 Absolute value Syntax: abs(number)
 Trigonometric functions
Fancy Indexing
 With NumPy array fancy indexing, an array can be indexed with another NumPy array,
a Python list, or a sequence of integers, whose values select elements in the indexed
array.
 • Example: We first create a NumPy array with 11 floating-point numbers and then
index the array with another NumPy array and Python list, to extract element numbers
0, 2 and 4 from the original array :
importnumpy as np
A = np.linspace(0, 1, 11)
print(A)
print(A[np.array([0, 2, 4])]) # The same thing can be accomplished by indexing with a
Python list
print(A[[0, 2, 4]])
Output:
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ].
[0. 0.2 0.4]
[0. 0.2 0.4]
Structured data
A structured Numpy array is an array of structures.
As numpy arrays are homogeneous i.e. they can contain data of same type
only.
Import numpy as np
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
# Creating the type of a structure
dtype = [('Name', (np.str_, 10)), ('Marks', np.float64), ('GradeLevel',
np.int32)]
# Creating a StrucuredNumpy array
structuredArr= np.array([('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6),
('Iresh', 99.9, 7)], dtype=dtype)
print(structured Arr.dtype)
Output:
[('Name', '<U10'), ('Marks', '<f8'), ('GradeLevel', '<i4')]
Creating structured arrays:

 1. Dictionary method :
 np.dtype({'names': ('name', 'age', 'weight'),
 'formats': ('U10', '14', 'f8')})
 Output: dtype([('name', '<U10'), ('age', '<i4'), ('weight',
'<f8')])
Numerical types can be specified with
Python types or NumPydtypes instead :
 np.dtype({'names': ('name', 'age', 'weight'),

 'formats':((np.str_, 10), int, np.float32)})

 Output: dtype([('name', '<U10'), ('age', '<i8'), ('weight',

'<f4')])
A compound type can also be
specified as a list of tuples :
 np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

 Output: dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])

Data Manipulation with Pandas

 • Pandas is a high-level data manipulation tool developed by Wes

McKinney.
 It is built on the Numpy package and its key data structure is
called the DataFrame.

 • DataFrames allow you to store and manipulate tabular data in

rows of observations and columns of variables.
 Pandas is the library for data manipulation and analysis.
Create DataFrame with Duplicate
Data
 Create Dataframe with Duplicate data
 import pandas as pd
 raw_data={'first_name': ['rupali', 'rupali',
'rakshita','sangeeta', 'mahesh', 'vilas'],
 'last_name': ['dhotre', 'dhotre', 'dhotre','Auti', 'jadhav',
'bagad'],
 'RNo': [12, 12, 1111111, 36, 24, 73],
 'TestScore1': [4, 4, 4, 31, 2, 3],
 'TestScore2': [25, 25, 25, 57, 62, 70]}
Drop duplicates

 df.drop_duplicates()

 • Drop duplicates in the first name column, but take the last
observation in the duplicated set

 df.drop_duplicates (['first_name'], keep='last')

Creating a Data Map and Data
Plan
 import pandas as pd
 scottish_hills={'Ben Nevis': (1345, 56.79685, -5.003508),
 'Ben Macdui': (1309, 57.070453, -3.668262),
 'Braeriach': (1296, 57.078628, -3.728024),
 'Cairn Toul': (1291, 57.054611, -3.71042),
 'Sgòr an Lochain Uaine': (1258, 57.057999, -3.725416)}
 dataframe = pd.DataFrame(scottish_hills)
 print(dataframe)
Manipulating and Creating
Categorical Variables
 import pandas as pd
 cycle_colors=pd.Series(['Blue', 'Red', 'Green'], dtype='category')
 cycle_data = pd.Series( pd.Categorical(['Yellow', 'Green', 'Red', 'Blue',
'Purple'],
categories=cycle_colors, ordered=False))
 find_entries = pd.isnull(cycle_data)
 print cycle_colors
 print
 print cycle_data
 print
 print find_entries [find_entries==True]
Dealing with Dates and Times
Values
 Python provides two methods of formatting date and time.
 1. str() = It turns a datetime value into a string without any
formatting.
 2. strftime() function= It define how user want the datetime
value to appear after
 conversion.
Using pandas.to_datetime() with a
date
 import pandas as pd

 #input in mm.dd.yyyy format

 date = ['21.07.2020']

 # output in yyyy-mm-dd format

 print(pd.to_datetime(date))
Using pandas.to_datetime() with a
date and time
 import pandas as pd

 # date (mm.dd.yyyy) and time (H:MM:SS)

 date [21.07.2020 11:31:01 AM']

 # output in yyyy-mm-dd HH:MM:SS

 print(pd.to_datetime(date))
Missing Data

 # load and summarize the dataset

 from pandas import read_csv

 # load the dataset

 dataset = read_csv('csv file name', header=None)

 # summarize the dataset

 print(dataset.describe())
Hierarchical Indexing

 Hierarchical indexing is a method of creating structured group

relationships in data.

 • A MultiIndex or Hierarchical index comes in when our

DataFrame has more than two dimensions.
 As we already know, a Series is a one-dimensional labelled
NumPy array and a DataFrame is usually a two-dimensional table
whose columns are Series.
To createDataFrame with player ratings of a
few players from the Fifa 19 dataset

In [1]: import pandas as pd

In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",
'MF', 'MF", 'MF', 'CF', 'CF', 'CF'], 'Name': ['De Gea', 'Coutois', 'Allison',
'VanDijk', 'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo‘,
'Messi', 'Neymar'], 'Overall': ['91','88', '89', '89', '91', '90', '91', '90',
'92', '94', '93', '92'], 'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd',
'3rd', '1st', '1st', '2nd', '3rd']}
In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name',
'Overall', 'Rank'])
In [4]: fifa19
 In [5]: fif19.set_index(['Position', 'Rank'], drop = False)

 In [6]: fifa19
Aggregation and Grouping
 • Pandas aggregation methods are as follows:
 a) count() Total number of items
 b) first(), last(): First and last item
 c) mean(), median(): Mean and median
 d) min(), max(): Minimum and maximum
 e) std(), var(): Standard deviation and variance
 f) mad(): Mean absolute deviation
 g) prod(): Product of all items
 h) sum(): Sum of all items.
Pivot Tables
 • A pivot table is a similar operation that is commonly seen in
spreadsheets and other programs that operate on tabular data.
 The pivot table takes simple column-wise data as input, and
groups the entries into a two-dimensional table that provides a
multidimensional summarization of the data.

 • A pivot table is a table of statistics that helps summarize the

data of a larger table by "pivoting" that data.
 Pandas gives access to creating pivot tables in Python using
the .pivot_table() function.
pivot method in Pandas

 1. Index: Which column should be used to identify and order the

rows vertically.
 2. Columns: Which column should be used to create the new
columns in reshaped DataFrame. Each unique value in the
column stated here will create a column in new DataFrame.
 3. Values: Which column(s) should be used to fill the values in the
cells of DataFrame.

Traffic Analysis - LMC-01
67% (3)
Traffic Analysis - LMC-01
15 pages
Westock - Ultra Slim Floor Beam (USFB) Design
100% (1)
Westock - Ultra Slim Floor Beam (USFB) Design
20 pages
Value Added Course: Programming in Python and Machine Learning UNIT-2
No ratings yet
Value Added Course: Programming in Python and Machine Learning UNIT-2
41 pages
Cable Laying Specification
No ratings yet
Cable Laying Specification
16 pages
Barber Colman
No ratings yet
Barber Colman
61 pages
B Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide 7x PDF
No ratings yet
B Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide 7x PDF
268 pages
Introduction To Numpy: Aniruddh Kadam Reg No-12109237 Lovely Professional University
100% (1)
Introduction To Numpy: Aniruddh Kadam Reg No-12109237 Lovely Professional University
84 pages
Det-Tronics Flame Detector
No ratings yet
Det-Tronics Flame Detector
2 pages
Week2-1 Numpy
No ratings yet
Week2-1 Numpy
43 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Sma 306 - Complex Analysis 1 - April 2017
No ratings yet
Sma 306 - Complex Analysis 1 - April 2017
4 pages
NumPy & Pandas
No ratings yet
NumPy & Pandas
27 pages
Subtraction Strategies That Lead To Regrouping
100% (1)
Subtraction Strategies That Lead To Regrouping
6 pages
Water Level Indicator
No ratings yet
Water Level Indicator
29 pages
The "Everything We Could Find On Microsoft VBA" List: Microsoft Support Knowledge Base
0% (1)
The "Everything We Could Find On Microsoft VBA" List: Microsoft Support Knowledge Base
3 pages
RMR DOKU V20 E L
100% (1)
RMR DOKU V20 E L
133 pages
Anesthetic Technique For Inferior Alveolar Nerve Block: A New Approach
No ratings yet
Anesthetic Technique For Inferior Alveolar Nerve Block: A New Approach
5 pages
Dynamic Equilibrium
No ratings yet
Dynamic Equilibrium
4 pages
Numpy
No ratings yet
Numpy
71 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
FDS Lab Meterial CS3361
No ratings yet
FDS Lab Meterial CS3361
30 pages
Ahmet Ozan HATİPOĞLU Cansu Çalişir Mehmet Özgür TEMUÇİN
100% (1)
Ahmet Ozan HATİPOĞLU Cansu Çalişir Mehmet Özgür TEMUÇİN
14 pages
22mbada303 Module 4
No ratings yet
22mbada303 Module 4
32 pages
Polyflow BMTF WS04 Bottle Blow Molding
No ratings yet
Polyflow BMTF WS04 Bottle Blow Molding
34 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
Tablice 1 PDF
No ratings yet
Tablice 1 PDF
1 page
Python CA2
No ratings yet
Python CA2
11 pages
A Few TEQC Tips For Getting Started: Beth Pratt-Sitaula (UNAVCO)
No ratings yet
A Few TEQC Tips For Getting Started: Beth Pratt-Sitaula (UNAVCO)
2 pages
Module 6 NumPY and Pandas
No ratings yet
Module 6 NumPY and Pandas
12 pages
Mathematics - A Course in Fluid Mechanics With Vector Field
No ratings yet
Mathematics - A Course in Fluid Mechanics With Vector Field
198 pages
Wespwer Alp 09
No ratings yet
Wespwer Alp 09
16 pages
1.luzhong Machine Catalog New
No ratings yet
1.luzhong Machine Catalog New
57 pages
Numpy & Pandas
No ratings yet
Numpy & Pandas
13 pages
One-Line Diagram - DHAHRAN - 2017 ... LOOP 6 (Short-Circuit Analysis)
No ratings yet
One-Line Diagram - DHAHRAN - 2017 ... LOOP 6 (Short-Circuit Analysis)
7 pages
Programming For Data Science
No ratings yet
Programming For Data Science
48 pages
PP Unit 4 Q&A
No ratings yet
PP Unit 4 Q&A
25 pages
International Standards in Nanotechnologies: A B C C D
No ratings yet
International Standards in Nanotechnologies: A B C C D
15 pages
HKU - 7001 - 3.2 Managing Data II
No ratings yet
HKU - 7001 - 3.2 Managing Data II
67 pages
Isometry: 5.1 Isometry and Isometric Isomorphism
No ratings yet
Isometry: 5.1 Isometry and Isometric Isomorphism
13 pages
Introduction To Number System
100% (1)
Introduction To Number System
15 pages
Py PPT 06
No ratings yet
Py PPT 06
33 pages
45B AIML Practical1.1
No ratings yet
45B AIML Practical1.1
57 pages
Refrig Alco Solenoid 2004
No ratings yet
Refrig Alco Solenoid 2004
10 pages
Fds Lab Record
No ratings yet
Fds Lab Record
84 pages
Data Science Handwritten Notes - 3
No ratings yet
Data Science Handwritten Notes - 3
26 pages
3 Introduction To Numpy
No ratings yet
3 Introduction To Numpy
9 pages
Foam EOR As An Optimization Technique For Gas EOR - A Comprehensive Review of Laboratory and Field Implementations
No ratings yet
Foam EOR As An Optimization Technique For Gas EOR - A Comprehensive Review of Laboratory and Field Implementations
52 pages
DV Lab2 Updated
No ratings yet
DV Lab2 Updated
12 pages
PyDays Day-2 - Final
No ratings yet
PyDays Day-2 - Final
26 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
43 pages
ML File Updated
No ratings yet
ML File Updated
60 pages
Num Py
No ratings yet
Num Py
71 pages
Grace Python Numpy MB Final
No ratings yet
Grace Python Numpy MB Final
55 pages
Report
No ratings yet
Report
18 pages
Unit 5
No ratings yet
Unit 5
27 pages
Introduction To Pythagoras PowerPoint
100% (1)
Introduction To Pythagoras PowerPoint
15 pages
Final Fds Manual
No ratings yet
Final Fds Manual
77 pages
Unit Iii Using Numpy
No ratings yet
Unit Iii Using Numpy
23 pages
CS3361 - Data Science
No ratings yet
CS3361 - Data Science
56 pages
Unit 3
No ratings yet
Unit 3
56 pages
4 Introduction To Python Part 3
No ratings yet
4 Introduction To Python Part 3
62 pages
Final Fds Manual Print
No ratings yet
Final Fds Manual Print
55 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
FINAL FDS MANUAL Print
No ratings yet
FINAL FDS MANUAL Print
55 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
UNIT 5 Python Aktu
No ratings yet
UNIT 5 Python Aktu
49 pages
C1 W1 Lab 1 Introduction To Numpy Arrays
No ratings yet
C1 W1 Lab 1 Introduction To Numpy Arrays
12 pages
S-8244 Series: Battery Protection Ic For 1-Serial To 4-Serial-Cell Pack (Secondary Protection)
No ratings yet
S-8244 Series: Battery Protection Ic For 1-Serial To 4-Serial-Cell Pack (Secondary Protection)
28 pages
Unit 1
No ratings yet
Unit 1
21 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
N Umpy Pandas Tutorial
No ratings yet
N Umpy Pandas Tutorial
65 pages
Python 2.1.1
No ratings yet
Python 2.1.1
7 pages
Fdsa Lab Manual Final
No ratings yet
Fdsa Lab Manual Final
70 pages
Priyesh Physics Investigatory Proj
No ratings yet
Priyesh Physics Investigatory Proj
9 pages
Unit 5 Numpy and Pandas - in Python
No ratings yet
Unit 5 Numpy and Pandas - in Python
58 pages
Unit I (A)
No ratings yet
Unit I (A)
17 pages
UNIT I (B)
No ratings yet
UNIT I (B)
34 pages
UNIT I (D)
No ratings yet
UNIT I (D)
32 pages
Advanced NumPy Broadcasting and Strides Guide
No ratings yet
Advanced NumPy Broadcasting and Strides Guide
21 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
61 pages
UNIT I (C)
No ratings yet
UNIT I (C)
63 pages
Declarative Programming
No ratings yet
Declarative Programming
35 pages
NUMPY
No ratings yet
NUMPY
33 pages
cs3352 Foundations of Data Science Unit II
No ratings yet
cs3352 Foundations of Data Science Unit II
34 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
36 pages
Attachment 3 Python For Data Analysis Lyst9850
No ratings yet
Attachment 3 Python For Data Analysis Lyst9850
31 pages
Fods Final Done
No ratings yet
Fods Final Done
67 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Numpy
No ratings yet
Numpy
10 pages
ML Sample Programs
No ratings yet
ML Sample Programs
7 pages
Numpy
No ratings yet
Numpy
60 pages
Numpy Array
No ratings yet
Numpy Array
14 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
42 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Numpy Simply In Depth
From Everand
Numpy Simply In Depth
Ajit Singh
5/5 (1)

Unit 2

Uploaded by

Unit 2

Uploaded by

UNIT 2

 The Jupyter Notebook is an open-source web application that

 Magic commands generally known as magic functions are special

 Line Magic Commands

 Data wrangling is the process of cleaning, structuring and

 Aggregation function is one

Using vectorized operations, fast computations is possible and it is

• A universal function (ufuncs) is a function that operates on ndarrays in

 'formats':((np.str_, 10), int, np.float32)})

 Output: dtype([('name', '<U10'), ('age', '<i8'), ('weight',

 Output: dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])

 • Pandas is a high-level data manipulation tool developed by Wes

 • DataFrames allow you to store and manipulate tabular data in

 df.drop_duplicates (['first_name'], keep='last')

 #input in mm.dd.yyyy format

 # output in yyyy-mm-dd format

 # date (mm.dd.yyyy) and time (H:MM:SS)

 date [21.07.2020 11:31:01 AM']

 # output in yyyy-mm-dd HH:MM:SS

 # load and summarize the dataset

 from pandas import read_csv

 # load the dataset

 dataset = read_csv('csv file name', header=None)

 # summarize the dataset

 Hierarchical indexing is a method of creating structured group

 • A MultiIndex or Hierarchical index comes in when our

In [1]: import pandas as pd

 • A pivot table is a table of statistics that helps summarize the

 1. Index: Which column should be used to identify and order the

You might also like