FDS - Unit 1
FDS - Unit 1
Science
21CSS202T
Unit I
Unit-1: INTRODUCTION TO DATA SCIENCE 9 hours
Benefits and uses of Data science, Facets of data, The data
science process
You can think of the relationship between big data and data
science as being like the relationship between crude oil and
an oil refinery.
Characteristics of Big Data
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated?
Benefits and uses of data
science and big data
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
5. Data Science Makes Data Better
6. Data Scientists are Highly Prestigious
7. No More Boring Tasks
8. Data Science Makes Products Smarter
9. Data Science can Save Lives
Facets of data
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured Data
• Structured data is data that depends on a data model and
resides in a fixed field within a record.
Unstructured data
• Unstructured data is data that isn’t easy to fit into a data
model because the content is context-specific or varying.
Natural language
• Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of
specific data science techniques and linguistics.
Output:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19])
1. Using the NumPy functions
a. check the dimensions by using array.shape.
(20, )
Output:
array([ 0 1 2 3 4 5 6 7 8 9 10 1112 13 14,15, 16, 17, 18, 19])
1. Using the NumPy functions
b. Creating two-dimensional arrays in NumPy
array=np.arange(20).reshape(4,5)
Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]
[15, 16, 17, 18, 19]])
1. Using the NumPy functions
c. Using other NumPy functions
np.zeros((2,4))
np.ones((3,6))
np.full((2,2), 3)
Output:
array([[0., 0., 0., 0.],
[0., 0., 0., 0.]])
array([[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.]])
1. Using the NumPy functions
1. Using the NumPy functions
1. Using the NumPy functions
[[0. 0. 0. 0.]
[0. 0. 0. 0.]]
c. Using other NumPy
functions [[1. 1. 1. 1. 1. 1.]
import numpy as np [1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]]
a=np.zeros((2,4))
b=np.ones((3,6)) [[1.14137702e-316 0.00000000e+000
c=np.empty((2,3)) 6.91583610e-310]
[6.91583609e-310 6.91583601e-310
d=np.full((2,2), 3)
6.91583601e-310]]
e= np.eye(3,3)
f=np.linspace(0, 10, num=4) [[3 3]
[3 3]]
import numpy as np
array=np.array([4,5,6]) [4 5 6]
[4, 5, 6]
print(array)
list=[4,5,6]
print(list)
Working with Ndarray
• np.ndarray(shape, type)
• Creates an array of the given shape with random numbers.
• np.array(array_object)
• Creates an array of the given shape from the list or tuple.
• np.zeros(shape)
• Creates an array of the given shape with all zeros.
• np.ones(shape)
• Creates an array of the given shape with all ones.
• np.full(shape,array_object, dtype)
• Creates an array of the given shape with complex
numbers.
• np.arange(range)
• Creates an array with the specified range.
NumPy Basic Array Operations
There is a vast range of built-in operations that we can
perform on these arrays.
1. ndim – It returns the dimensions of the array.
2. itemsize – It calculates the byte size of each element.
3. dtype – It can determine the data type of the element.
4. reshape – It provides a new view.
5. slicing – It extracts a particular set of elements.
6. linspace – Returns evenly spaced elements.
7. max/min , sum, sqrt
8. ravel – It converts the array into a single line.
Arrays in NumPy
Checking Array Dimensions in
NumPy
import numpy as np
a = np.array(10)
b = np.array([1,1,1,1])
c = np.array([[1, 1, 1], [2,2,2]])
d = np.array([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]])
print(a.ndim) #0
print(b.ndim) #1
print(c.ndim) #2
print(d.ndim) #3
Higher Dimensional Arrays in NumPy
import numpy as np
arr = np.array([1, 1, 1, 1, 1], ndmin=10)
print(arr)
print('number of dimensions :', arr.ndim)
[[[[[[[[[[1 1 1 1 1]]]]]]]]]]
number of dimensions : 10
Indexing and Slicing in NumPy
Indexing & Slicing
Indexing
import numpy as np
arr=([1,2,5,6,7])
print(arr[3]) #6
Slicing
import numpy as np
arr=([1,2,5,6,7])
print(arr[2:5]) #[5, 6, 7]
Indexing and Slicing
Indexing and Slicing in 2-D
Copying Arrays
Copy from one array to another
• Method 1: Using np.empty_like() function
• Method 2: Using np.copy() function
• Method 3: Using Assignment Operator
Using np.empty_like( )
• This function returns a new array with the same shape and
type as a given array.
Syntax:
• numpy.empty_like(a, dtype = None, order = ‘K’, subok = True)
Using np.empty_like( )
import numpy as np
ary = np.array([13, 99, 100, 34, 65, 11, 66, 81, 632, 44])
# modifying org_array
org_array[1, 2] = 13
import numpy as np
for x in arr:
print(x)
Output:
[1 2 3]
[4 5 6]
Iterating Arrays
• To return the actual values, the scalars, we have to iterate
the arrays in each dimension.
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
for y in x:
print(y)
1
2
3
4
5
6
Iterating Arrays
• Iterating 3-D Arrays
• In a 3-D array it will go through all the 2-D arrays.
• import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)
[[1 2 3] [4 5 6]]
[[ 7 8 9] [10 11 12]]
Iterating Arrays
• Iterating 3-D Arrays
• To return the actual values, the scalars, we have to iterate the
arrays in each dimension.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
for y in x:
for z in y:
print(z)
1 2 3 4 5 6 7 8
nditer()
• The function nditer() is a helping function that can be
used from very basic to very advanced iterations. 1
2
• Iterating on Each Scalar Element 3
• In basic for loops, iterating through each scalar of an array 4
we need to use n for loops which can be difficult to write 5
for arrays with very high dimensionality. 6
7
8
import numpy as np
for x in np.nditer(arr):
print(x)
Identity array
• The identity array is a square array with ones on the main
diagonal.
• The identity() function return the identity array.
Identity
• numpy.identity(n, dtype = None) : Return a identity
matrix i.e. a square matrix with ones on the main daignol
• Parameters:
• n : [int] Dimension n x n of output array
• dtype : [optional, float(by Default)] Data type of returned
array
Identity array
# 2x2 matrix with 1's on main diagonal
b = geek.identity(2, dtype = float)
print("Matrix b : \n", b)
a = geek.identity(4)
print("\nMatrix a : \n", a)
Output:
Matrix b :
[[ 1. 0.]
[ 0. 1.]]
Matrix a :
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]]
eye( )
• numpy.eye(R, C = None, k = 0, dtype = type <‘float’>)
: Return a matrix having 1’s on the diagonal and 0’s
elsewhere w.r.t. k.
• R : Number of rows
C : [optional] Number of columns; By default M = N
k : [int, optional, 0 by default]
Diagonal we require; k>0 means diagonal above main
diagonal or vice versa.
dtype : [optional, float(by Default)] Data type of returned
array.
eye( )
Identity( ) vs eye( )
• np.identity returns a square matrix (special case of a
2D-array) which is an identity matrix with the main
diagonal (i.e. 'k=0') as 1's and the other values as 0's. you
can't change the diagonal k here.
• np.eye returns a 2D-array, which fills the diagonal, i.e. 'k'
which can be set, with 1's and rest with 0's.
• So, the main advantage depends on the requirement. If you
want an identity matrix, you can go for identity right away,
or can call the np.eye leaving the rest to defaults.
• But, if you need a 1's and 0's matrix of a particular
shape/size or have a control over the diagonal you can go
for eye method.
Identity( ) vs eye( )
import numpy as np
print(np.eye(3,5,1))
print(np.eye(8,4,0))
print(np.eye(8,4,-1))
print(np.eye(8,4,-2))
Print(np.identity(4)
Shape of an Array
• import numpy as np
print(arr.shape)
• Output: (2,4)
Reshaping arrays
• Reshaping means changing the shape of an array.
• The shape of an array is the number of elements in each
dimension.
• By reshaping we can add or remove dimensions or change
number of elements in each dimension.
Reshape From 1-D to 2-D
• import numpy as np
newarr = arr.reshape(4, 3)
print(newarr)
• Output:
• [[ 1 2 3]
• [ 4 5 6]
• [ 7 8 9]
• [10 11 12]]
Reshape From 1-D to 3-D
• The outermost dimension will have 2 arrays that contains 3 arrays, each
with 2 elements
• import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)
Output:
[[[ 1 2]
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]]
Can we Reshape into any
Shape?
• Yes, as long as the elements required for reshaping are equal in
both shapes.
• We can reshape an 8 elements 1D array into 4 elements in 2
rows 2D array but we cannot reshape it into a 3 elements 3
rows 2D array as that would require 3x3 = 9 elements.
import numpy as np
newarr = arr.reshape(3, 3)
print(newarr)
newarr = arr.reshape(-1)
print(newarr)
• Output: [1 2 3 4 5 6]
• There are a lot of functions for changing the shapes of
arrays in numpy flatten, ravel and also for rearranging the
elements rot90, flip, fliplr, flipud etc. These fall under
Intermediate to Advanced section of numpy.
Operations on NumPy
1.NumPy Arithmetic
Operations
1.NumPy Arithmetic
Operations
import numpy as np [6 7 8 9]
a = np.array([1, 2, 3, 4]) [-1 0 1 2]
# add 5 to every element [10 20 30 40]
[0.5 1. 1.5 2. ]
print ( a+5)
# subtract 2 from each element
print ( a-2)
# multiply each element by 5
print (a*10)
# divide each element by 2
print ( a/2)
2. NumPy Unary Operators
import numpy as np
arr = np.array([[1,5, 12], [2,32, 20], [3, 40, 13]])
print(arr.max(axis = 1))
print(arr.max(axis = 0))
print (arr.min(axis = 0))
print(arr.min(axis = 1))
print (arr.sum( ))
print ( arr.sum(axis=0))
print( arr.sum(axis=1))
[12 32 40]
[ 3 40 20]
[ 1 5 12]
[1 2 3]
128
[6 77 45]
[18 54 56]
3. NumPy Binary Operators
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[4, 3], [2, 1]])
print (a + b)
print (a*b)
[[5 5]
[5 5]]
[[4 6]
[6 4]]
NumPy Universal Functions
import numpy as np
a = np.array([0, np.pi/2, np.pi])
print ( np.sin(a))
a = np.array([0, 1, 2, 3])
print ( np.exp(a))
print ( np.sqrt(a))
print(arr)
Output:[1 2 3 4 5 6]
Join two 2-D arrays along
rows (axis=1)
import numpy as np
print(arr)
Output:[[1 2 5 6] [3 4 7 8]]
stack( )
• Stacking is same as concatenation, the only difference is
that stacking is done along a new axis.
• We can concatenate two 1-D arrays along the second axis
which would result in putting them one over the other, ie.
stacking.
• We pass a sequence of arrays that we want to join to
the stack() method along with the axis. If axis is not
explicitly passed it is taken as 0.
stack( )
import numpy as np
print(arr)
import numpy as np
print(arr)
Output: [1 2 3 4 5 6]
vstack( ) - Stacking Along
Columns
• NumPy provides a helper function: vstack() to stack along
columns.
import numpy as np
print(arr)
• array_split( )
Split
• hsplit( )
• vsplit( )
Splitting
• We use array_split() for splitting arrays, we pass it the
array we want to split and the number of splits.
• Note: The return value is an array containing three arrays.
• If the array has less elements than required, it will adjust
from the end accordingly.
newarr = np.array_split(arr, 3)
print(newarr)
Output:
[array([1, 2]), array([3, 4]), array([5, 6])]
split( )
import numpy as np
newarr = np.split(arr, 3)
print(newarr)
Output:
[array([1, 2]), array([3, 4]), array([5, 6])]
array_split( )
import numpy as np
newarr = np.array_split(arr, 4)
print(newarr)
newarr = np.split(arr, 4)
print(newarr)
Output:
Error
Split Into Arrays
• The return value of the array_split() method is an array
containing each of the split as an array.
import numpy as np
newarr = np.array_split(arr, 3)
print(newarr[0])
print(newarr[1])
print(newarr[2])
Output:[1 2] [3 4] [5 6]
Splitting 2-D Arrays
• Use the same syntax when splitting 2-D arrays.
• Use the array_split() method, pass in the array you want to
split and the number of splits you want to do.
import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)
newarr = np.vsplit(arr, 3)
print(newarr)
Searching
• You can search an array for a certain value, and return the
indexes that get a match.
• To search an array, use the where() method.
x = np.where(arr == 4)
print(x)
Output:(array([3, 5, 6]),)
Sorting
• Sorting means putting elements in an ordered sequence.
• Ordered sequence is any sequence that has an order
corresponding to elements, like numeric or alphabetical,
ascending or descending.
• The NumPy ndarray object has a function called sort(), that will
sort a specified array.
• Note: This method returns a copy of the array, leaving the
original array unchanged.
import numpy as np
print(np.sort(arr))
Output:[0 1 2 3]
Search
ARRAY SHAPE
MANIPULATION
Reshaping Array
List and Array
Linear Algebra with Numpy
Linear Algebra Module
• The Linear Algebra module of NumPy offers various
methods to apply linear algebra on any numpy array.
One can find:
• rank, determinant, trace, etc. of an array.
• eigen values of matrices
• matrix and vector products (dot, inner, outer,etc.
product), matrix exponentiation
• solve linear or tensor equations and much more!
Example
# Importing numpy as np
import numpy as np
A = np.array([[6, 1, 1],
[4, -2, 5],
[2, 8, 7]])
# Rank of a matrix
print("Rank of A:", np.linalg.matrix_rank(A))
# Trace of matrix A
print("\nTrace of A:", np.trace(A))
# Determinant of a matrix
print("\nDeterminant of A:", np.linalg.det(A))
# Inverse of matrix A
print("\nInverse of A:\n", np.linalg.inv(A))
• Pseudorandom
• Computers work on programs, and programs are definitive
set of instructions. So it means there must be some
algorithm to generate a random number as well.
• If there is a program to generate random number it can be
predicted, thus it is not truly random.
• Random numbers generated through a generation
algorithm are called pseudo random.
Random
• Random number does NOT mean a different number every
time. Random means something that can not be predicted
logically.
• Truerandom
• In order to generate a truly random number on our
computers we need to get the random data from some
outside source. This outside source is generally our
keystrokes, mouse movements, data on network etc.
• We do not need truly random numbers, unless its related to
security (e.g. encryption keys) or the basis of application is
the randomness (e.g. Digital roulette wheels).
Generate Random number
from numpy import random
x = random.randint(100)
print(x)
Output:45
Generate Random Float
from numpy import random
x = random.rand()
print(x)
Output:0.20589891226659818
Generate Random Array
• In NumPy we work with arrays, and you can use the two
methods from the above examples to make random arrays.
Integers
• The randint() method takes a size parameter where you
can specify the shape of an array.
x=random.randint(100, size=(5))
print(x)
Output:[61 66 32 13 16]
Data Distribution
• Data Distribution is a list of all possible values, and how
often each value occurs.
• Such lists are important when working with statistics and
data science.
• The random module offer methods that returns randomly
generated data distributions.
Random Distribution
• A random distribution is a set of random numbers that
follow a certain probability density function.
• Probability Density Function: A function that describes a
continuous probability. i.e. probability of all values in an
array.
• We can generate random numbers based on defined
probabilities using the choice() method of
the random module.
• The choice() method allows us to specify the probability for
each value.
• The probability is set by a number between 0 and 1, where
0 means that the value will never occur and 1 means that
the value will always occur.
Example
• Generate a 1-D array containing 100 values, where each
value has to be 3, 5, 7 or 9.
• The probability for the value to be 3 is set to be 0.1
• The probability for the value to be 5 is set to be 0.3
• The probability for the value to be 7 is set to be 0.6
• The probability for the value to be 9 is set to be 0
print(x)
Example
[3 7 5 7 7 7 5 7 3 7 3 7 7 7 7 5 7 5 7 7 7 7 7 5 3 7 5 7 7 7 3 5
37577575757757735357757777575355
77737777577577777755355755775337
7 5 7 7]
Example
• You can return arrays of any shape and size by specifying
the shape in the size parameter.
• Same example as above, but return a 2-D array with 3 rows,
each containing 5 values.
[[7 7 7 7 7]
[5 3 5 7 5]
[5 7 5 7 5]]
Normal Distribution
• The Normal Distribution is one of the most important
distributions.
• It is also called the Gaussian Distribution after the German
mathematician Carl Friedrich Gauss.
• It fits the probability distribution of many events, eg. IQ
Scores, Heartbeat etc.
• Use the random.normal() method to get a Normal Data
Distribution.
• It has three parameters:
• loc - (Mean) where the peak of the bell exists.
• scale - (Standard Deviation) how flat the graph distribution
should be.
• size - The shape of the returned array.
Normal Distribution
from numpy import random
x = random.normal(size=(2, 3))
print(x)
Output:
Run1:
[[ 0.15001821 -1.31355388 -1.35020654] [-1.31067087
-0.48537757 -0.02052509]]
Run2:
[[-2.0610908 -0.3081812 0.99886608] [ 0.56001902
0.38363428 -0.07954767]]
Visualization of Normal
Distribution
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.normal(size=1000), hist=False)
plt.show()
Exponential Distribution
• Exponential distribution is used for describing time till next
event e.g. failure/success etc.
• It has two parameters:
• scale - inverse of rate ( see lam in poisson distribution )
defaults to 1.0.
• size - The shape of the returned array.
Exponential Distribution
Time Between Customers
• The number of minutes between customers who enter a certain shop
can be modeled by the exponential distribution.
• For example, suppose a new customer enters a shop every two
minutes, on average. After a customer arrives, find the probability
that a new customer arrives in less than one minute.
To solve this, we can start by knowing that the average time between
customers is two minutes. Thus, the rate can be calculated as:
• λ = 1/μ
• λ = 1/2
• λ = 0.5
• We can plug in λ = 0.5 and x = 1 to the formula for the CDF:
• P(X ≤ x) = 1 – e-λx
• P(X ≤ 1) = 1 – e-0.5(1)
• P(X ≤ 1) = 0.3935
The probability that we’ll have to wait less than one minute for the
next customer to arrive is 0.3935.
Exponential Distribution
• Draw out a sample for exponential distribution with 2.0
scale with 2x3 size:
print(x)
print(x)
Output:[5 7 6 5 4 7 5 4 6 5]
[5 3 6 4 3 3 3 5 5 5]
Visualization of Binomial Distribution
x = random.poisson(lam=2, size=10)
print(x)
Uniform Distribution
• Used to describe probability where every event has equal
chances of occurring.
• E.g. Generation of random numbers.
x = random.uniform(size=(2, 3))
print(x)
print(x)