0% found this document useful (0 votes)
7 views38 pages

Unit 2

This document covers data manipulation techniques in Python, focusing on tools such as Python Shell, Jupyter Notebook, NumPy, and Pandas. It details various operations including array manipulations, data wrangling, handling missing data, and creating DataFrames, along with advanced features like hierarchical indexing and pivot tables. The content is aimed at providing a comprehensive understanding of data manipulation for better decision-making and analysis.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

Unit 2

This document covers data manipulation techniques in Python, focusing on tools such as Python Shell, Jupyter Notebook, NumPy, and Pandas. It details various operations including array manipulations, data wrangling, handling missing data, and creating DataFrames, along with advanced features like hierarchical indexing and pivot tables. The content is aimed at providing a comprehensive understanding of data manipulation for better decision-making and analysis.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

UNIT 2

DATA MANIPULATION
PYTHON SHELL - JUPYTER NOTEBOOK - IPYTHON MAGIC COMMANDS -
NUMPY ARRAYS-UNIVERSAL FUNCTIONS – AGGREGATIONS – COMPUTATION
ON ARRAYS – FANCY INDEXING – SORTING ARRAYS-STRUCTURED DATA –
DATA MANIPULATION WITH PANDAS – DATA INDEXING AND SELECTION –
HANDLING MISSING DATA – HIERARCHICAL INDEXING – COMBINING
DATASETS – AGGREGATION AND GROUPING –STRING OPERATIONS –
WORKING WITH TIME SERIES – HIGH PERFORMANCE
Python Shell
 Python is an interpreter language. It means it executes the code line by line.
Python provides a Python Shell, which is used to execute a single Python
command and display the result.
 It is also known as REPL (Read, Evaluate, Print, Loop), where it reads the
command, evaluates the command, prints the result, and loop it back to read
the command again.
 It provides an easy and interactive way to write and test small pieces of
Python code.
 It can be a useful tool for debugging code. For example, if you are having an
issue with a larger program, you can use the Python shell to test out specific
lines of code or to try out different approaches to solving a problem.
 It can be a good way to learn about the various built-in functions and
modules in Python.
Jupyter Notebook

 The Jupyter Notebook is an open-source web application that


allows you to create and share documents that contain live code,
equations, visualizations, and narrative text.
 Jupyter has support for over 40 different programming languages
and Python is one of them. Python is a requirement (Python 3.3
or greater, or Python 2.7) for installing the Jupyter Notebook
itself.
 Jupyter Notebook can be installed by using either of the two ways
described below:
 Install Jupyter Notebook with Anaconda
 Installing Jupyter Notebook using Anaconda on Windows
 Step 1: Download Anaconda
 Step 2: Run the Anaconda Installer
 Step 3: Launch Jupyter Notebook
 Install Jupyter using the PIP package manager used to install and
manage software packages/libraries written in Python.
 Step 1: Install Python programming language
 Step 2: Install Jupyter Notebook
 Step 3: Start Jupyter Notebook
Magic Commands

 Magic commands generally known as magic functions are special


commands in IPython that provide special functionalities to users
like modifying the behavior of a code cell explicitly, simplifying
common tasks like timing code execution, profiling, etc. Magic
commands have the prefix ‘%’ or ‘%%’ followed by the command
name. There are two types of magic commands:

 Line Magic Commands


 Cell Magic Commands
Data Wrangling

 Data wrangling is the process of cleaning, structuring and


enriching raw data into a desired format for better decision
making in less time.
 Data wrangling is also called as data munging.
 Data wrangling covers the following processes:
 1. Getting data from the various source into one place
 2. Piecing the data together according to the determined setting
 3. Cleaning the data from the noise or erroneous, missing
elements.
Numpy
 NumPy, short for Numerical Python, is the core library for scientific
computing in Python.
 Numpy array is a powerful N-dimensional array object which is in the
form of rows and columns.
 It has been designed specifically for performing basic and advanced
array operations.
 It primarily supports multi-dimensional arrays and vectors for complex
arithmetic operations.
 NumPy (Numerical Python) is a perfect tool for scientific computing
and performing basic and advanced array operations.
 Numpy is the core library for scientific computing in Python.
 It provides a high performance multidimensional array object and tools
for working with these arrays.
 t contains:
 a) A powerful N-dimensional array object
 b) Basic linear algebra functions
 c) Basic Fourier transforms
 d) Sophisticated random number capabilities
 e) Tools for integrating Fortran code
 f) Tools for integrating C/C++ code.
Basic array manipulations are as
follows :
 1. Attributes of arrays: It define the size, shape, memory
consumption, and data types of arrays.
 2. Indexing of arrays: Getting and setting the value of individual
array elements. 3. Slicing of arrays: Getting and setting smaller
subarrays within a larger array.
 4. Reshaping of arrays: Changing the shape of a given array.
 5. Joining and splitting of arrays: Combining multiple arrays into
one, and splitting one array into many.
Aggregations

 Aggregation function is one


which takes multiple individual
values and returns a summary. In
the majority of the cases, this
summary is a single value.
 The most common aggregation
functions are a simple average or
summation of values.
Computations on Arrays
Computation on NumPy arrays can be very fast, or it can be very slow.

Using vectorized operations, fast computations is possible and it is


implemented by using NumPy's universial functions (ufuncs).

• A universal function (ufuncs) is a function that operates on ndarrays in


an element-by- element fashion, supporting array broadcasting, type
casting, and several other standard features.
The ufunc is a "vectorized" wrapper for a function that takes a fixed
number of specific inputs and produces a fixed number of specific
outputs.
 NumPy's Ufuncs :
 • Ufuncs are of two types: unary ufuncs and binary ufuncs.
 • Unary ufuncs operate on a single input and binary ufuncs,
which operate on two inputs.
 Arithmetic operators
 Absolute value Syntax: abs(number)
 Trigonometric functions
Fancy Indexing
 With NumPy array fancy indexing, an array can be indexed with another NumPy array,
a Python list, or a sequence of integers, whose values select elements in the indexed
array.
 • Example: We first create a NumPy array with 11 floating-point numbers and then
index the array with another NumPy array and Python list, to extract element numbers
0, 2 and 4 from the original array :
importnumpy as np
A = np.linspace(0, 1, 11)
print(A)
print(A[np.array([0, 2, 4])]) # The same thing can be accomplished by indexing with a
Python list
print(A[[0, 2, 4]])
Output:
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ].
[0. 0.2 0.4]
[0. 0.2 0.4]
Structured data
A structured Numpy array is an array of structures.
As numpy arrays are homogeneous i.e. they can contain data of same type
only.
Import numpy as np
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
# Creating the type of a structure
dtype = [('Name', (np.str_, 10)), ('Marks', np.float64), ('GradeLevel',
np.int32)]
# Creating a StrucuredNumpy array
structuredArr= np.array([('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6),
('Iresh', 99.9, 7)], dtype=dtype)
print(structured Arr.dtype)
Output:
[('Name', '<U10'), ('Marks', '<f8'), ('GradeLevel', '<i4')]
Creating structured arrays:

 1. Dictionary method :
 np.dtype({'names': ('name', 'age', 'weight'),
 'formats': ('U10', '14', 'f8')})
 Output: dtype([('name', '<U10'), ('age', '<i4'), ('weight',
'<f8')])
Numerical types can be specified with
Python types or NumPydtypes instead :
 np.dtype({'names': ('name', 'age', 'weight'),

 'formats':((np.str_, 10), int, np.float32)})

 Output: dtype([('name', '<U10'), ('age', '<i8'), ('weight',


'<f4')])
A compound type can also be
specified as a list of tuples :
 np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

 Output: dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])


Data Manipulation with Pandas

 • Pandas is a high-level data manipulation tool developed by Wes


McKinney.
 It is built on the Numpy package and its key data structure is
called the DataFrame.

 • DataFrames allow you to store and manipulate tabular data in


rows of observations and columns of variables.
 Pandas is the library for data manipulation and analysis.
Create DataFrame with Duplicate
Data
 Create Dataframe with Duplicate data
 import pandas as pd
 raw_data={'first_name': ['rupali', 'rupali',
'rakshita','sangeeta', 'mahesh', 'vilas'],
 'last_name': ['dhotre', 'dhotre', 'dhotre','Auti', 'jadhav',
'bagad'],
 'RNo': [12, 12, 1111111, 36, 24, 73],
 'TestScore1': [4, 4, 4, 31, 2, 3],
 'TestScore2': [25, 25, 25, 57, 62, 70]}
Drop duplicates

 df.drop_duplicates()

 • Drop duplicates in the first name column, but take the last
observation in the duplicated set

 df.drop_duplicates (['first_name'], keep='last')


Creating a Data Map and Data
Plan
 import pandas as pd
 scottish_hills={'Ben Nevis': (1345, 56.79685, -5.003508),
 'Ben Macdui': (1309, 57.070453, -3.668262),
 'Braeriach': (1296, 57.078628, -3.728024),
 'Cairn Toul': (1291, 57.054611, -3.71042),
 'Sgòr an Lochain Uaine': (1258, 57.057999, -3.725416)}
 dataframe = pd.DataFrame(scottish_hills)
 print(dataframe)
Manipulating and Creating
Categorical Variables
 import pandas as pd
 cycle_colors=pd.Series(['Blue', 'Red', 'Green'], dtype='category')
 cycle_data = pd.Series( pd.Categorical(['Yellow', 'Green', 'Red', 'Blue',
'Purple'],
categories=cycle_colors, ordered=False))
 find_entries = pd.isnull(cycle_data)
 print cycle_colors
 print
 print cycle_data
 print
 print find_entries [find_entries==True]
Dealing with Dates and Times
Values
 Python provides two methods of formatting date and time.
 1. str() = It turns a datetime value into a string without any
formatting.
 2. strftime() function= It define how user want the datetime
value to appear after
 conversion.
Using pandas.to_datetime() with a
date
 import pandas as pd

 #input in mm.dd.yyyy format

 date = ['21.07.2020']

 # output in yyyy-mm-dd format

 print(pd.to_datetime(date))
Using pandas.to_datetime() with a
date and time
 import pandas as pd

 # date (mm.dd.yyyy) and time (H:MM:SS)

 date [21.07.2020 11:31:01 AM']

 # output in yyyy-mm-dd HH:MM:SS

 print(pd.to_datetime(date))
Missing Data

 # load and summarize the dataset

 from pandas import read_csv

 # load the dataset

 dataset = read_csv('csv file name', header=None)

 # summarize the dataset

 print(dataset.describe())
Hierarchical Indexing

 Hierarchical indexing is a method of creating structured group


relationships in data.

 • A MultiIndex or Hierarchical index comes in when our


DataFrame has more than two dimensions.
 As we already know, a Series is a one-dimensional labelled
NumPy array and a DataFrame is usually a two-dimensional table
whose columns are Series.
To createDataFrame with player ratings of a
few players from the Fifa 19 dataset

In [1]: import pandas as pd


In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",
'MF', 'MF", 'MF', 'CF', 'CF', 'CF'], 'Name': ['De Gea', 'Coutois', 'Allison',
'VanDijk', 'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo‘,
'Messi', 'Neymar'], 'Overall': ['91','88', '89', '89', '91', '90', '91', '90',
'92', '94', '93', '92'], 'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd',
'3rd', '1st', '1st', '2nd', '3rd']}
In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name',
'Overall', 'Rank'])
In [4]: fifa19
 In [5]: fif19.set_index(['Position', 'Rank'], drop = False)

 In [6]: fifa19
Aggregation and Grouping
 • Pandas aggregation methods are as follows:
 a) count() Total number of items
 b) first(), last(): First and last item
 c) mean(), median(): Mean and median
 d) min(), max(): Minimum and maximum
 e) std(), var(): Standard deviation and variance
 f) mad(): Mean absolute deviation
 g) prod(): Product of all items
 h) sum(): Sum of all items.
Pivot Tables
 • A pivot table is a similar operation that is commonly seen in
spreadsheets and other programs that operate on tabular data.
 The pivot table takes simple column-wise data as input, and
groups the entries into a two-dimensional table that provides a
multidimensional summarization of the data.

 • A pivot table is a table of statistics that helps summarize the


data of a larger table by "pivoting" that data.
 Pandas gives access to creating pivot tables in Python using
the .pivot_table() function.
pivot method in Pandas

 1. Index: Which column should be used to identify and order the


rows vertically.
 2. Columns: Which column should be used to create the new
columns in reshaped DataFrame. Each unique value in the
column stated here will create a column in new DataFrame.
 3. Values: Which column(s) should be used to fill the values in the
cells of DataFrame.

You might also like