*By the end of this course, you will be able to :
• Explain the need for Python libraries
• Use Numpy to work with arrays
• Use Pandas to load, explore, manipulate, analyze and process data
• Derive statistical outcomes of a real dataset
• Visualize data
• Create a machine learning model for predictive analysis
*About Python
Python is an open source, general-purpose programming language. It supports both
structured and object-oriented style of programming.
It can be utilized for developing wide range of applications including web
applications, data analytics, machine learning applications
etc.
Python provides various data types and data structures for storing and processing
data. For handling single values, there are
data types like int, float, str, and bool. For handling data in groups, python
provides data structures like list, tuple, dictionary,
set, etc.
Python has a wide range of libraries and built-in functions which aid in rapid
development of applications. Python libraries are
collections of pre-written codes to perform specific tasks. This eliminates the
need of rewriting the code from scratch.
*Why Python Libraries?
Let us consider the following scenario:
John is a software developer. His project requires developing an application that
connects to various database servers like MySQL,
Postgre, MongoDB etc. To implement this requirement from scratch, John needs to
invest his time and effort to understand the
underlying architectures of the respective databases. Instead, John can choose to
use pre-defined libraries to perform the database
operations which abstracts the complexities involved.
Use of libraries will help John in the following ways:
Faster application development – Libraries promote code reusability and help the
developers save time and focus on building the
functional logic.
Enhance code efficiency – Use of pre-tested libraries enhances the quality and
stability of the application.
Achieve code modularization – Libraries can be coupled or decoupled based on
requirement.
Over the last two decades, python has emerged as a first-choice tool for tasks that
involve scientific computing, including the
analysis and visualization of large datasets. Python has gained popularity,
particularly in the field of data science because of
large and active ecosystem of third-party libraries.
Few of the popular libraries in data science include NumPy, Pandas, Matplotlib and
Scikit-Learn.
Let us proceed to understand about these libraries in detail.
*Business scenario
*XYZ Custom Cars is an automobile restoration company based in New York, USA. This
company is renowned for restoring vintage and
muscle cars. Their team takes great pride in each of their projects, no matter how
big or small. They offer paint jobs, frame
build-ups, engine restoration, body work etc. They are also involved in buying and
reselling of cars.
Every car owner that comes to XYZ Custom Cars gets a documentation drafted that
consists important information about the car.
This information is related to the car’s performance and manufacturing. The company
maintains this database with proper diligence.
Click here to download the XYZ Custom Cars data.
The data consists of features like acceleration, horsepower, region, model year,
etc. And the board of directors think that these
data can help generate insights about their projects. On the other hand, these
insights would help restore similar cars with similar
standards and procedures. Also, this would help them predict better reselling
prices in future. Precisely, these insights would help
generate greater revenue for the company through cost cutting and providing a data
driven approach to their process.
For example, the company may be interested in setting up different workstations
that cater to specific categories of cars as follows –
Category
Description
Features coming in play
Fuel efficient
Cars designed with low power and high fuel efficiency
High MPG, Low Horsepower, Low weight
Muscle Cars
Intermediate sized cars designed for high performance
High displacement, High horsepower, Moderate weight
SUV
Big sized cars designed for high performance, long distance trips and family
comfort
High horsepower, High weight
Racecar
Cars specifically designed for race tracks
High horsepower, Low weight, High acceleration
This would allow the company to place specialized mechanics and equipment in
specific workstations, thereby creating a hassle free and efficient work
atmosphere. Another interesting thing would be predicting the fuel efficiency of
cars after restoration based on the available data to minimize field testing.
The defined business scenarios will be used to understand the following Python
libraries:
1. Numpy
To get the mathematical and structural understanding of such data
To build a base for Pandas, the data manipulation library
To get familiar with terms like arrays, axis, vectorization etc.
2. Pandas
To read the data
To explore the data
To do operations on the data
To manipulate the data
To draw simple visualizations for self-consumption
To generate insights
3. Matplotlib
To visualize the data
To generate deep insights
To present the data to the leadership
To get a visual understanding of various features
4. Sci-kit learn
To get data ready for model building
To build predictive models
To evaluate the models
To infer the model results and present to the leadership
*Revisiting Python List
*a Python List can be used to store a group of elements together in a sequence. It
can contain heterogeneous elements.
Following are some examples of List:
item_list = ['Bread', 'Milk', 'Eggs', 'Butter', 'Cocoa']
student_marks = [78, 47, 96, 55, 34]
hetero_list = [ 1,2,3.0, ‘text’, True, 3+2j ]
To perform operations on the List elements, one needs to iterate through the List.
For example, if five extra marks need to be awarded
to all the entries in the student marks list. The following approach can be used to
achieve the same:
student_marks = [78, 47, 96, 55, 34]
for i in range(len(student_marks)):
student_marks[i]+=5
print(student_marks)
It can be observed that, there is use of a loop. The code is lengthy and becomes
computationally expensive with increase in the
size of the List.
Data Science is a field that utilizes scientific methods and algorithms to generate
insights from the data. These insights can
then be made actionable and applied across a broad range of application domains.
Data Science deals with large datatsets. Operating
on such data with lists and loops is time consuming and computationally expensive.
*Comparing Python List and Numpy performance.
Let us understand why Python Lists can become a bottleneck if they are used for
large data.
Consider that 1 million numbers must be added from two different lists.
%%time
#Used to calculate total operation time
list1 = list(range(1,1000000))
list2 = list(range(2,1000001))
list3 = []
for i in range(len(list1)):
list3.append(list1[i]+list2[i])
Note: Time taken will be different in different systems.
Let us understand, how Numpy can solve the same in minimal time.
Note: Ignore the syntax and focus on only the output.
%%time
#Used to calculate total operation time
#Importing Numpy
import numpy as np
#Creating a numpy array of 1 million numbers
a = np.arange(1,1000000)
b = np.arange(2,1000001)
c = a+b
It can be observed that the same operation has been completed in 12 milliseconds
when compared to 395 milliseconds taken by
Python List. As the data size and the complexity of operations increases, the
difference between the performance of Numpy and
Python Lists broadens.
In Data Science, there are millions of records to be dealt with. The performance
limitations faced by using Python List can
be managed by usage of advanced Python libraries like Numpy.
*Introduction to Numpy
Numeric-Python (Numpy), is a Python library that is used for numeric and scientific
operations. It serves as a building block for
many libraries available in Python.
Data structures in Numpy
The main data structure of NumPy is the ndarray or n-dimensional array.
The ndarray is a multidimensional container of elements of the same type as
depicted below. It can easily deal with matrix and
vector operations.
*Getting Started
*Importing Numpy
Numpy library needs to be imported in the environment before it can be used as
shown below. 'np' is the standard alias used for Numpy.
import numpy as np
Numpy object creation
Numpy array can be created by using array() function. The array() function in Numpy
returns an array object named ndarray.
Syntax: np.array(object, dtype)
object – A python object(for example, a list)
dtype – data type of object (for example, integer)
Example: Consider the following marks scored by students:
Student ID
Marks
78
92
36
4
64
89
These marks can be represented in a one-dimensional Numpy array as shown below:
import numpy as np
student_marks_arr = np.array([78, 92, 36, 64, 89])
student_marks_arr
This is one way to create a simple one-dimensional array.
*Numpy object creation demo- 1D array
The following dataset has been provided by XYZ Custom Cars. This data comes in a
csv file format.
There are various columns in this dataset. Each column contains multiple values.
These values can be represented as lists of items.
Since each column contains homogenous values, Numpy arrays can be used to represent
them.
Let us understand , how to represent the car ‘horsepower’ values in a Numpy array.
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
horsepower_arr
*Numpy object creation demo- 2D array
*How can multiple columns be represented together?
This can be achieved by creating the Numpy array from List of Lists.
Let us understand , how to represent the car 'mpg', ‘horsepower’, and
'acceleration' values in a Numpy array.
#creating a list of lists of 5 mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from car_attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr
The example demonstrates that the Numpy array created using the List of Lists
results in a two-dimensional array.
*Shape of ndarray
*The numpy.ndarray.shape returns a tuple that describes the shape of the array.
For example:
a one-dimensional array having 10 elements will have a shape as (10,)
a two-dimensional array having 10 elements distributed evenly in two rows will have
a shape as (2,5)
Let us comprehend, how to find out the shape of car attributes array.
#creating a list of lists of mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr.shape
Here, 3 represents the number of rows and 5 represents the number of elements in
each row.
*'dtype' of ndarray
*'dtype' refers to the data type of the data contained by the array. Numpy supports
multiple datatypes like integer, float, string, boolean etc.
Below is an example of using dtype property to identify the data type of elements
in an array.
#creating a list of lists of 5 mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#creating a numpy array from attributes list
car_attributes_arr = np.array(car_attributes)
car_attributes_arr.dtype
Changing dtype
Numpy dtype can be changed as per requirements. For example, an array of integers
can be converted to float.
Below is an example of using dtype as an argument of np.array() function to convert
the data type of elements from integer to float.
#creating a list of lists of 5 mpg, horsepower and acceleration values
car_attributes = [[18, 15, 18, 16, 17],[130, 165, 150, 150, 140],[307, 350, 318,
304, 302]]
#converting dtype
car_attributes_arr = np.array(car_attributes, dtype = 'float')
print(car_attributes_arr)
print(car_attributes_arr.dtype)
*Accessing Numpy arrays
*The elements in the ndarray are accessed using index within the square brackets
[ ]. In Numpy, both positive and negative indices
can be used to access elements in the ndarray. Positive indices start from the
beginning of the array, while negative indices start
from the end of the array. Array indexing starts from 0 in positive indexing and
from -1 in negative indexing.
Below are some examples of accessing data from numpy arrays:
1. Accessing element from 1D array.
#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino'])
#accessing the second car from the array
cars[1]
2. Accessing elements from a 2D array
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
car_hp_arr
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing car names
car_hp_arr[0]
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing horsepower
car_hp_arr[1]
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing second car - 0 represents 1st row and 1 represents 2nd element of the
row
car_hp_arr[0,1]
#Creating a 2D array consisting car names and horsepower
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
car_hp_arr = np.array([car_names, horsepower])
#Accessing name of last car using negative indexing
car_hp_arr[0,-1]
*Slicing
Slicing is a way to access and obtain subsets of ndarray in Numpy.
Syntax: array_name[start : end] – index starts at ‘start’ and ends at ‘end - 1’.
1.Slicing from 1D array
#creating an array of cars
cars = np.array(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino'])
#accessing a subset of cars from the array
cars[1:4]
2. Slicing from a 2D array
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
car_hp_acc_arr
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name and horsepower
car_hp_acc_arr[0:2]
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name and horsepower of last two cars
car_hp_acc_arr[0:2, 3:5]
#Creating a 2D array consisting car names, horsepower and acceleration
car_names = ['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth
satellite', 'amc rebel sst', 'ford torino']
horsepower = [130, 165, 150, 150, 140]
acceleration = [18, 15, 18, 16, 17]
car_hp_acc_arr = np.array([car_names, horsepower, acceleration])
#Accessing name, horsepower and acceleration of first three cars
car_hp_acc_arr[0:3, 0:3]
*Mean and Median
Problem Statement:
The engineers at XYZ Custom Cars want to know about the mean and median of
horsepower.
Solution:
The mean and median can be calculated with the help of following code:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#mean horsepower
print("Mean horsepower = ",np.mean(horsepower_arr))
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#median horsepower
print("Median horsepower = ",np.median(horsepower_arr))
*Min and max
Problem Statement:
The engineers at XYZ Custom Cars want to know about the minimum and maximum
horsepower.
Solution:
The min and max can be calculated with the help of following code:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Minimum horsepower: ", np.min(horsepower_arr))
print("Maximum horsepower: ", np.max(horsepower_arr))
Finding the index of minimum and maximum values:
'argmin()' and 'argmax()' return the index of minimum and maximum values in an
array respectively.
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
print("Index of Minimum horsepower: ", np.argmin(horsepower_arr))
print("Index of Maximum horsepower: ", np.argmax(horsepower_arr))
*Querying/searching in an array
Problem Statement:
The engineers at XYZ Custom Cars want to know the horsepower of cars that are
greater than or equal to 150.
Solution:
The 'where' function can be used for this requirement. Given a condition, 'where'
function returns the indexes of the array where the condition satisfies. Using
these indexes, the respective values from the array can be obtained.
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
x = np.where(horsepower_arr >= 150)
print(x) # gives the indices
# With the indices , we can find those values
horsepower_arr[x]
*Filter data
Problem Statement:
The Engineers at XYZ Custom Cars want to create a separate array consisting of
filtered values of horsepower greater than 135.
Solution:
Getting some elements out of an existing array based on certain conditions and
creating a new array out of them is called filtering.
The following code can be used to accomplish this:
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#creating filter array
filter_arr = horsepower_arr > 135
newarr = horsepower_arr[filter_arr]
print(filter_arr)
print(newarr)
*Sorting an array
Problem Statement:
The engineers at XYZ Custom Cars want the horsepower in sorted order.
Solution:
The numpy array can be sorted by passing the array to the function sort(array) or
by array.sort.
So, what is the difference between these two functions though they are used for the
same functionality?
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
print('Sorted array: ', np.sort(horsepower_arr))
print('original array after sorting: ', horsepower_arr)
#creating a list of 5 horsepower values
horsepower = [130, 165, 150, 150, 140]
#creating a numpy array from horsepower list
horsepower_arr = np.array(horsepower)
#using sort(array)
print('original array: ', horsepower_arr)
horsepower_arr.sort()
print('original array after sorting: ', horsepower_arr)
The difference is that the array.sort() function modifies the original array by
default, whereas the sort(array) function does not.
*Vectorized operations
The mathematical operations can be performed on Numpy arrays. Numpy makes use of
optimized, pre-compiled code to perform mathematical operations on each array
element. This eliminates the need of using loops, thereby enhancing the
performance. This process is called vectorization.
Numpy provides various mathematical functions such as sum(), add(), sub(), log(),
sin() etc. which uses vectorization.
Consider an example of marks scored by a student:
Subject
Marks
English
78
Mathematics
92
Physics
36
Chemistry
64
Biology
89
Problem Statement:
Calculate the sum of all the marks.
Solution:
The sum() function can be used which internally uses vectorizaton .
student_marks_arr = np.array([78, 92, 36, 64, 89])
print(np.sum(student_marks_arr))
Problem Statement:
Award extra marks in subjects as follows:
English: +2
Mathematics: +2
Physics: +5
Chemistry: +10
Biology: +2
Solution:
Below is the solution to the problem:
additional_marks = [2, 2, 5, 10, 1]
student_marks_arr += additional_marks
student_marks_arr
Also, the same operation can be performed as shown below:
student_marks_arr = np.array([78, 92, 36, 64, 89])
student_marks_arr = np.add(student_marks_arr, additional_marks)
student_marks_arr
Both the above methods use vectorization internallly eliminating the need of loops.
Other arithmetic operations can also be performed in a similar manner.
In addition to arithmetic operations, several other mathematical operations
like exponents, logarithms and trigonometric functions are also available in Numpy.
This makes Numpy a very useful tool for
scientific computing.
*Broadcasting
"Broadcasting" refers to the term on how Numpy handles arrays with different shapes
during arithmetic operations. Array of smaller size is stretched or copied across
the larger array.
For example, considering the following arithmetic operations across 2 arrays:
import numpy as np
# Array 1
array1=np.array([5, 10, 15])
# Array 2
array2=np.array([5])
array3= array1 * array2
array3
In this example, the array2 is being stretched or copied to match array1 during the
arithmetic operation resulting in new array array3 with the same shape as array1.
The following diagram explains broadcasting:
In the first operation, the shape of first array is 1x3 and the shape of second
array is 1x1. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 1x3.
In the second operation, the shape of first array is 3x3 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, the second array gets stretched to match the shape of first array and the
shape of the resulting array is 3x3.
In the third operation, the shape of first array is 3x1 and the shape of second
array is 1x3. Hence, according to broadcasting
rules, both first and second arrays get stretched and the shape of the resulting
array is 3x3.
*Broadcasting - demo
Consider the following table consisting marks scored by four student in two
different subjects:
Students
Chemistry
Physics
Subodh
67
45
Ram
90
92
Abdul
66
72
John
32
40
The teacher of these students wants to represent their marks in an array. To do so,
the marks can be stored using a 4x2 array as follows:
#Students marks in 4 subjects
students_marks = np.array([[67, 45],[90, 92],[66, 72],[32, 40]])
students_marks
Problem Statement:
Now the teacher wants to award extra five marks in Chemistry and extra ten marks in
Physics.
Solution:
#Students marks in 4 subjects
students_marks = np.array([[67, 45],[90, 92],[66, 72],[32, 40]])
#Broadcasting
students_marks += [5,10]
students_marks
The student's marks array is a 2D array of shape 4x2. The marks to be added are in
the form of a 1D array of size 1x2. According to
the broadcasting rules, the marks to
be added get stretched to match the shape of student marks array and the shape of
the resulting array is 4x2.
*Image as a Numpy matrix
Images are stored as arrays of hundreds, thousands or even millions of picture
elements called as pixels. Therefore, images can also
be treated as Numpy array, as they can be represented as matrix of pixels.
Certain basic operations and manipulations can be carried out on images using Numpy
and scikit-image package. Scikit-image is an
image processing package.
The package is imported as skimage.
Importing an image:
#Importing path and skimage i/o library
import os.path
from skimage.io import imread
from skimage import data_dir
#reading the astronaut image
img = imread(os.path.join(data_dir, 'astronaut.png'))
#Using matplotlib.pyplot to visualize the image
import matplotlib.pyplot as plt
plt.imshow(img)
To view as a matrix, the below command must be followed:
print(img)
Properties of image
Let us understand the type, dimensions and shape of the image.
print('Type of image: ', type(img))
print('Dimensions of image: ', img.ndim)
print('Shape of image:', img.shape)
*Indexing and selection
So far, you have become familiar with, how to retrieve the basic attributes of the
image. Let us proceed to understand some
examples on indexing and selection on images.
Cutting the rocket out of the image
#Importing path and skimage i/o library
import os.path
from skimage.io import imread
from skimage import data_dir
#reading the astronaut image
img = imread(os.path.join(data_dir, 'astronaut.png'))
#Slicing out the rocket
img_slice = img.copy()
img_slice = img_slice[0:300,360:480]
plt.figure()
plt.imshow(img_slice)
In this case, the image has been sliced corresponding to the rocket from the
original image.
Assigning the values corresponding to the sliced image as 0:
img[0:300,360:480,:] = 0
plt.imshow(img)
img_slice[np.greater_equal(img_slice[:,:,0],100) &
np.less_equal(img_slice[:,:,0],150)] = 0
plt.figure()
plt.imshow(img_slice)
The place where the sliced rocket image was present initially, is now filled with
black color because 0 is assigned to the
values corresponding to the sliced image.
Replacing the ‘rocket’ back to its original place:
img[0:300,360:480,:] = img_slice
plt.imshow(img)
For the above picture, the black image in the previous step is replaced with the
sliced ‘rocket’.
*Summary
So far, you have learnt the following key features of Numpy:
Numpy offers multi-dimensional arrays.
It provides array operations that are better than python list operations in terms
of speed, efficiency and ease of writing code.
Numpy provides fast and convenient operations in the form of vectorization and
broadcasting.
Numpy offers additional capabilities to perform Linear Algebra and scientific
computing. This is out of scope of this module.
There is another Python library called Pandas, built on top of Numpy when it comes
to analysis and manipulation on tabular data.
Let us proceed to learn more about it.
*Additional methods to create numpy arrays - Arange and Linspace
Arange
This method returns evenly spaced values between the given intervals excluding the
end limit. The values are generated based on
the step value and by default, the step value is 1.
#start and end limit
np.arange(0,10000)
#step value = 2
np.arange(0,10,2)
Linspace
This method returns the given number of evenly spaced values, between the given
intervals. By default, the number of values
between a given interval is 50.
#Generating values between 0 and 10
arr = np.linspace(0,10)
print(arr)
print('Length of arr: ',len(arr))
#Number of values = 3
print(np.linspace(0,10,3))
*Zeroes and Ones
Zeros
Returns an array of given shape filled with zeros.
#1D
np.zeros(5)
#2D
np.zeros([2,3])
Ones
Returns an array of given shape filled with ones.
#1D
np.ones(3)
#2D
np.ones([2,1])
*Full and Eye
Full:
Returns an array of given shape, filled with given value, irrespective of datatype.
#number=5, value=8
np.full(5,8)
#shape=[3,3], value=numpy
np.full([3,3],'numpy')
Eye
Returns an identity matrix for the given shape.
#3x3 identity matrix
np.eye(3)
*Random
Random
NumPy has numerous ways to create random number arrays. Random numbers can be
created for the required length, from a uniform distribution by just passing the
value of required length to the random.rand function.
#generating 5 random numbers from a uniform distribution
np.random.rand(5)
Note: Output might not be same as it is randomly generated.
Similarly, to generate random numbers from a Normal distribution, use random.randn
function.
Random numbers of type 'integer' can also be generated using random.randint
function. Below shown is an example of creating five random numbers between 1 and
10.
#random integer values low=1, high=10, number of values=5
np.random.randint(1,10, size=5)
Similarly, two-dimensional arrays of random numbers can also be created by passing
the shape instead of number of values.
#random integer values high=100, shape = (3,5)
x = np.random.randint(100, size=(3, 5))
print(x)
print(type(x))
To generate a random number from a predefined set of values present in an array,
the choice() method can be used.
The choice() method takes an array as a parameter and randomly returns the values
based on the size.
#returns a single random value from the array
x = np.random.choice([9, 3, 7, 5])
print(x)
#returns 3*5 random values from the array
x = np.random.choice([9, 3, 7, 5], size=(3, 5)) # sampling to create an nd-array
print(x)
print(type(x))
*Introduction to pandas
Pandas is an open-source library for real world data analysis in python. It is
built on top of Numpy. Using Pandas, data can
be cleaned, transformed, manipulated, and analyzed. It is suited for different
kinds of data including tabular as in a SQL table
or a Excel spreadsheets, time series data, observational or statistical datasets.
The steps involved to perform data analysis using Pandas are as follows:
*Steps in data analysis
Reading the data
The first step is to read the data. There are multiple formats in which data can be
obtained such as '.csv', '.json', '.xlsx' etc.
Below are the examples:
Example of an excel file:
Example of a json (javascript object notation) file:
Example of a csv (comma separated values) file:
*Steps in daata analysis
Exploring the data
The next step is to explore the data. Exploring data helps to:
know the shape(number of rows and columns) of the data
understand the nature of the data by obtaining subsets of the data
identify missing values and treat them accordingly
get insights about the data using descriptive statistics
Performing operations on the data
Some of the operations supported by pandas for data manipulation are as follows:
Grouping operations
Sorting operations
Masking operations
Merging operations
Concatenating operations
Visualizing data
The next step is to visualize the data to get a clear picture of various
relationshipsamong the data. The following plots can help visualize the data:
Scatter plot
Box plot
Bar plot
Histogram and many more
Generating Insights
All the above steps help generating insights about our data.
*Why Pandas?
*Pandas is one of the most popular data wrangling and analysis tools because it:
has the capability to load huge sizes of data easily
provides us with extremely streamlined forms of data representation
can handle heterogenous data, has extensive set of data manipulation features and
makes data flexible and customizable
*Getting started with Pandas
To get started with Pandas, Numpy and Pandas needs to be imported as shown below:
#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd
In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in
which the rows and columns are identified with
labels instead of simple integer indices.
The basic data structures of Pandas are Series and DataFrame.
*Getting started with Pandas
To get started with Pandas, Numpy and Pandas needs to be imported as shown below:
#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of
numpy
import numpy as np
#importing pandas
import pandas as pd
In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in
which the rows and columns are identified with
labels instead of simple integer indices.
The basic data structures of Pandas are Series and DataFrame.
*Panda series object
Series is one dimensional labelled array. It supports different datatypes like
integer, float, string etc. Let us understand more about series with the following
example.
Consider the scenario where marks of students are given as shown in the following
table:
Student ID
Marks
78
92
36
64
89
The pandas series object can be used to represent this data in a meaningful manner.
Series is created using the following syntax:
Syntax:
pd.Series(data, index, dtype)
data – It can be a list, a list of lists or even a dictionary.
index – The index can be explicitly defined for different valuesif
required.
dtype – This represents the data type used in the series (optional
parameter).
series = pd.Series(data = [78, 92, 36, 64, 89])
series
As shown in the above output, the series object provides the values along with
their index attributes.
Series.values provides the values.
series.values
Series.index provides the index.
series.index
Accessing data in series:
Data can be accessed by the associated index using [ ].
series[1]
Slicing a series:
series[1:3]
*Custom index in series