BTech 5 CSE Data Analytics With Python Unit 2 and 3 Notes
BTech 5 CSE Data Analytics With Python Unit 2 and 3 Notes
Unit 2
An Introduction to Data Analysis
Knowledge Domains of the Data Analysis
A knowledge domain refers to a specialized area of expertise within a
broader discipline. It encompasses a defined body of knowledge, including
theories, principles, methodologies, and best practices that are essential for
proficiency in that area. Knowledge domains serve as frameworks for
professionals to structure their learning, problem-solving approaches, and
application of skills.
In the field of data analysis, there are several important knowledge domains that
provide a foundation for understanding, analyzing, and deriving insights from
data. These include:
1. Data Cleaning and Preprocessing
2. Exploratory Data Analysis (EDA)
3. Data Visualization and Communication
4. Statistics
5. Machine Learning
6. Data Mining
7. Programming and Scripting
8. Big Data Technologies
9. Data Management
10.Ethics and Privacy in DATA
1. Data Cleaning and Preprocessing: Data cleaning and preprocessing
involves preparing data for analysis by ensuring its quality and usability. It
starts with data collection, gathering accurate and relevant data from
various sources. Next, data cleaning addresses missing values, outliers, and
errors. Data transformation follows, where tasks like normalization,
scaling, and encoding categorical variables are applied. Finally, data
reduction techniques, such as PCA, reduce dimensionality, simplifying the
dataset while preserving essential information for analysis.
2. Exploratory Data Analysis (EDA): Data visualization, summarization,
and feature engineering are key steps in data analysis. Data visualization
uses plots like scatter plots, histograms, and bar charts to reveal patterns
and trends in the data. Summarization identifies key relationships and
correlations, enabling better understanding. Feature engineering creates
new features to enhance model performance and capture important data
patterns.
SRGI, BHILAI
Key Features:
- Data Protection Rights: It grants individuals rights such as the right to access
their data, the right to be forgotten, and the right to data portability.
- Consent: Organizations must obtain clear consent from individuals before
processing their personal data.
- Accountability: Businesses are required to implement appropriate technical
and organizational measures to protect personal data and report data breaches
within 72 hours.
- Fines: Non-compliance can lead to significant penalties, including fines of up
to 4% of annual global revenue or €20 million, whichever is higher.
In summary, while both GDPR and HIPAA aim to protect personal data, GDPR
focuses on the privacy rights of individuals in the EU regarding all types of
personal data, whereas HIPAA specifically addresses the privacy and security of
health information in the United States.
SRGI, BHILAI
Quantitative Data
Quantitative data refers to numerical information that can be measured or
counted. This type of data is used for statistical analysis and often involves
operations like addition, subtraction, or averaging.
• Types:
o Discrete Data: Countable values (e.g., number of students).
o Continuous Data: Measurable values within a range (e.g., height,
weight).
Example in Python:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Creating a DataFrame with quantitative data
data = {
'Student_ID': [1, 2, 3, 4, 5],
'Math_Score': [78, 85, 92, 88, 76],
'Science_Score': [72, 89, 95, 84, 80],
'Study_Hours': [3.5, 4.0, 5.0, 4.5, 3.0]
}
df = pd.DataFrame(data)
# Statistical Analysis
print("Summary Statistics:")
print(df[['Math_Score', 'Science_Score', 'Study_Hours']].describe())
# Plotting the data
plt.scatter(df['Study_Hours'], df['Math_Score'])
plt.title("Study Hours vs Math Score")
plt.xlabel("Study Hours")
plt.ylabel("Math Score")
SRGI, BHILAI
plt.show()
Output:
• The summary statistics provide the mean, median, standard deviation, etc.
• The scatter plot visualizes the relationship between study hours and math
scores.
Qualitative Data
Qualitative data refers to non-numerical information that describes qualities or
characteristics. This type of data is often categorical and used for classification or
grouping.
• Types:
o Nominal Data: Categories without an order (e.g., colors, gender).
o Ordinal Data: Categories with a meaningful order (e.g., satisfaction
levels).
Example in Python:
1. Customer Feedback in a Shopping App
• Data:
o "Satisfied," "Neutral," "Dissatisfied," "Very Satisfied."
• Type: Ordinal Data (since there is an order of satisfaction levels).
Python Example:
import pandas as pd
data = {'Customer_ID': [101, 102, 103, 104],
'Feedback': ['Satisfied', 'Neutral', 'Dissatisfied', 'Very Satisfied']}
df = pd.DataFrame(data)
print(df)
2. Product Categories
• Data:
SRGI, BHILAI
3. Employee Roles
• Data:
o "Manager," "Engineer," "Analyst," "Technician."
• Type: Nominal Data.
Python Example:
data = {'Employee_ID': [1, 2, 3, 4],
'Role': ['Manager', 'Engineer', 'Analyst', 'Technician']}
df = pd.DataFrame(data)
print(df)
4. Movie Genres
• Data:
o "Action," "Comedy," "Drama," "Horror," "Sci-Fi."
• Type: Nominal Data.
Python Example:
data = {'Movie_ID': [101, 102, 103, 104],
'Genre': ['Action', 'Comedy', 'Drama', 'Horror']}
df = pd.DataFrame(data)
print(df)
SRGI, BHILAI
5. Education Levels
• Data:
o "High School," "Bachelor's," "Master's," "PhD."
• Type: Ordinal Data (since education levels follow a meaningful order).
Python Example:
data = {'Person_ID': [1, 2, 3, 4],
'Education_Level': ["High School", "Bachelor's", "Master's", "PhD"]}
df = pd.DataFrame(data)
print(df)
6. Car Colors
• Data:
o "Red," "Blue," "Black," "White," "Green."
• Type: Nominal Data.
Python Example:
data = {'Car_ID': [1001, 1002, 1003, 1004],
'Color': ['Red', 'Blue', 'Black', 'White']}
df = pd.DataFrame(data)
print(df)
7. Survey Responses
• Data:
o "Yes," "No," "Maybe."
• Type: Nominal Data.
Python Example:
data = {'Respondent_ID': [1, 2, 3],
SRGI, BHILAI
8. Marital Status
• Data:
o "Single," "Married," "Divorced," "Widowed."
• Type: Nominal Data.
Python Example:
data = {'Person_ID': [1, 2, 3, 4],
'Marital_Status': ['Single', 'Married', 'Divorced', 'Widowed']}
df = pd.DataFrame(data)
print(df)
9. Weather Descriptions
• Data:
o "Sunny," "Cloudy," "Rainy," "Windy."
• Type: Nominal Data.
Python Example:
data = {'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday'],
'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Windy']}
df = pd.DataFrame(data)
print(df)
Visualization Line charts, scatter plots, histograms Bar charts, pie charts
Applications in Python:
1. Quantitative: Calculating trends, correlation, regression analysis.
Both types of data are often used together to provide a comprehensive analysis in data science
projects.
Unit 3
An array object represents a multidimensional, homogeneous array of
fixed-size items. An associated data-type object describes the format of each
element in the array (its byte-order, how many bytes it occupies in memory,
whether it is an integer, a floating-point number, or something else, etc.)
SRGI, BHILAI
Arrays should be constructed using array, zeros or empty (refer to the See
Also section below). The parameters given here refer to a low-level method
(ndarray(…)) for instantiating an array.
Indexing on ndarray
ndarrays can be indexed using the standard Python x[obj] syntax,
where x is the array and obj the selection. There are different kinds of indexing
available depending on obj: basic indexing, advanced indexing and field access.
Most of the following examples show the use of indexing when referencing data
in an array. The examples work just as well when assigning to an array.
Note that in Python, x[(exp1, exp2, ..., expN)] is equivalent
to x[exp1, exp2, ..., expN]; the latter is just syntactic sugar for the former.
Basic indexing
Single element indexing
Single element indexing works exactly like that for other standard Python
sequences. It is 0-based, and accepts negative indices for indexing from the end
of the array.
>>> x = np.arange(10)
>>> x[2]
2
>>> x[-2]
8
It is not necessary to separate each dimension’s index into its own set of square
brackets.
>>> x.shape = (2, 5) # now x is 2-dimensional
>>> x[1, 3]
8
>>> x[1, -1]
9
SRGI, BHILAI
Note that if one indexes a multidimensional array with fewer indices than
dimensions, one gets a sub dimensional array. For example:
>>> x[0]
array ([0, 1, 2, 3, 4])
That is, each index specified selects the array corresponding to the rest of
the dimensions selected. In the above example, choosing 0 means that the
remaining dimension of length 5 is being left unspecified, and that what is
returned is an array of that dimensionality and size. It must be noted that the
returned array is a view, i.e., it is not a copy of the original, but points to the same
values in memory as does the original array. In this case, the 1-D array at the first
position (0) is returned. So using a single index on the returned array, results in a
single element being returned. That is:
>>> x[0][2]
2
So note that x[0, 2] == x[0][2] though the second case is more inefficient as a new
temporary array is created after the first index that is subsequently indexed by 2.
k where m=q+(r≠0) and q and r are the quotient and remainder obtained by
dividing j - i by k: j - i = q k + r, so that i + (m - 1) k < j. For example:
• >>> x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
• >>> x [1:7:2]
• array ([1, 3, 5])
• Negative i and j are interpreted as n + i and n + j where n is the number of
elements in the corresponding dimension. Negative k makes stepping go towards
smaller indices. From the above example:
• >>> x [-2:10]
• array ([8, 9])
• >>> x [-3:3:-1]
• array ([7, 6, 5, 4])
• Assume n is the number of elements in the dimension being sliced. Then, if i is
not given it defaults to 0 for k > 0 and n - 1 for k < 0 . If j is not given it defaults
to n for k > 0 and -n-1 for k < 0 . If k is not given it defaults to 1. Note that :: is
the same as : and means select all indices along this axis. From the above
example:
• >>> x [5:]
• array ([5, 6, 7, 8, 9])
• If the number of objects in the selection tuple is less than N, then : is assumed for
any subsequent dimensions. For example:
• >>> x = np.array([[[1],[2],[3]], [[4],[5],[6]]])
• >>> x.shape
• (2, 3, 1)
• >>> x [1:2]
• array ([[[4],
• [5],
• [6]]])
• An integer, i, returns the same values as i:i+1 except the dimensionality of the
returned object is reduced by 1. In particular, a selection tuple with the p-th
SRGI, BHILAI
element an integer (and all other entries :) returns the corresponding sub-array
with dimension N - 1. If N = 1 then the returned object is an array scalar.
• If the selection tuple has all entries: except the p-th entry which is a slice
object i:j:k, then the returned array has dimension N formed by stacking, along
the p-th axis, the sub-arrays returned by integer indexing of elements i, i+k, …, i
+ (m - 1) k < j.
• Basic slicing with more than one non-: entry in the slicing tuple, acts like repeated
application of slicing using a single non-: entry, where the non-: entries are
successively taken (with all other non-: entries replaced by :).
Thus, x[ind1, ..., ind2,:] acts like x[ind1][..., ind2, :] under basic slicing.
Array Concatenation
import numpy as np
# Creating two arrays
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
# Concatenating along axis 0 (row-wise)
concatenated = np.concatenate((arr1, arr2), axis=0) print(concatenated)
# Concatenating along axis 1 (column-wise) concatenated_col =
np.concatenate((arr1, arr2), axis=1) print(concatenated_col)
Output:
# Concatenation along axis 0
[[1 2]
[3 4]
[5 6]
[7 8]]
# Concatenation along axis 1
[[1 2 5 6]
[3 4 7 8]]
Splitting Array
# Creating an array
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
# Splitting into 2 arrays along axis 1 (column-wise)
split_arr = np.hsplit(arr, 2)
print(split_arr)
SRGI, BHILAI
Shape manipulation
In Python's NumPy library, shape manipulation allows you to change the structure
of arrays without changing the data they contain. Common shape manipulation
functions include reshaping, flattening, transposing, expanding, and squeezing
arrays.
common shape manipulation methods in NumPy:
1. Reshape
• Changes the shape of an array to a specified new shape, provided the total
number of elements remains the same.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped_arr = arr.reshape(2, 3)
print(reshaped_arr)
Output:
[[1 2 3]
[4 5 6]]
2. Flatten
• Converts a multi-dimensional array into a 1D array.
arr = np.array([[1, 2, 3], [4, 5, 6]])
flattened_arr = arr.flatten()
SRGI, BHILAI
print(flattened_arr)
Output:
[1 2 3 4 5 6]
3. Transpose
• Reverses or permutes the axes of an array, commonly used for matrices.
arr = np.array([[1, 2, 3], [4, 5, 6]])
transposed_arr = arr.T
print(transposed_arr)
Output:
[[1 4]
[2 5]
[3 6]]
4. Expand Dimensions
• Adds an extra dimension to an array, useful for aligning shapes for
operations like broadcasting.
arr = np.array([1, 2, 3])
expanded_arr = np.expand_dims(arr, axis=0)
print(expanded_arr)
print("Shape:", expanded_arr.shape)
Output:
[[1 2 3]]
Shape: (1, 3)
5. Squeeze
• Removes single-dimensional entries from the shape of an array, often used
to simplify results.
arr = np.array([[[1, 2, 3]]])
squeezed_arr = np.squeeze(arr)
SRGI, BHILAI
print(squeezed_arr)
print("Shape:", squeezed_arr.shape)
Output:
[1 2 3]
Shape: (3,)
Array Manipulations:
Array manipulation in Python, especially with NumPy, allows for powerful
operations like adding, removing, splitting, and modifying elements.
Some common array manipulation techniques:
1. Appending Elements
• Use np.append() to add elements to an array. It returns a new array with the
appended values.
import numpy as np
arr = np.array([1, 2, 3])
appended_arr = np.append(arr, [4, 5, 6])
print(appended_arr)
Output:
[1 2 3 4 5 6]
2. Inserting Elements
• Use np.insert() to insert values at a specific index.
arr = np.array([1, 2, 3])
inserted_arr = np.insert(arr, 1, [9, 10])
print(inserted_arr)
Output:
[ 1 9 10 2 3]
Here, [9, 10] is inserted starting at index 1
SRGI, BHILAI
3. Deleting Elements
• Use np.delete() to remove elements at specific indices.
arr = np.array([1, 2, 3, 4, 5])
deleted_arr = np.delete(arr, [1, 3]) # Remove elements at indices 1 and 3
print(deleted_arr)
Output:
[1 3 5]
4. Concatenating Arrays
• Combine arrays along an existing axis using np.concatenate().
arr1 = np.array([1, 2])
arr2 = np.array([3, 4])
concatenated_arr = np.concatenate((arr1, arr2))
print(concatenated_arr)
Output:
[1 2 3 4]
5. Splitting Arrays
• Use np.split() to split an array into multiple sub-arrays.
arr = np.array([1, 2, 3, 4, 5, 6])
split_arr = np.split(arr, 3) # Split into 3 equal parts
print(split_arr)
Output:
[array([1, 2]), array([3, 4]), array([5, 6])]
6. Reshaping Arrays
• reshape() changes the shape of an array without modifying the data.
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped_arr = arr.reshape(2, 3)
SRGI, BHILAI
print(reshaped_arr)
Output:
[[1 2 3]
[4 5 6]]
7. Flattening Arrays
• Convert a multi-dimensional array into a 1D array with flatten().
arr = np.array([[1, 2], [3, 4]])
flattened_arr = arr.flatten()
print(flattened_arr)
Output:
[1 2 3 4]
8. Stacking Arrays
• Stack arrays along a new axis using np.vstack() for vertical stacking or
np.hstack() for horizontal stacking.
arr1 = np.array([1, 2])
arr2 = np.array([3, 4])
vstacked_arr = np.vstack((arr1, arr2))
hstacked_arr = np.hstack((arr1, arr2))
print("Vertical Stack:\n", vstacked_arr)
print("Horizontal Stack:\n", hstacked_arr)
Output:
Vertical Stack:
[[1 2]
[3 4]]
SRGI, BHILAI
Horizontal Stack:
[1 2 3 4]
9. Reversing an Array
• Reverse an array with slicing or by using np.flip().
arr = np.array([1, 2, 3, 4, 5])
reversed_arr = np.flip(arr)
print(reversed_arr)
Output:
[5 4 3 2 1]
10. Repeating Elements
• Use np.repeat() to repeat each element a specified number of times.
arr = np.array([1, 2, 3])
repeated_arr = np.repeat(arr, 2)
print(repeated_arr)
Output:
[1 1 2 2 3 3]
Vectorization:
In Python, vectorization refers to performing operations on entire arrays rather
than individual elements, allowing for faster execution, especially with large
SRGI, BHILAI
datasets. Libraries like NumPy provide tools to make operations on entire arrays
faster and more memory-efficient.
1. Adding Two Arrays
Let's add two arrays element-wise using vectorization.
import numpy as np
# Creating two arrays
array1 = np.array([1, 2, 3, 4])
array2 = np.array([10, 20, 30, 40])
# Adding arrays using vectorized operation
result = array1 + array2
print(result)
Output:
[11 22 33 44]
2. Scalar Operations on Arrays
Performing a scalar operation on each element in an array without a loop.
# Multiply each element in the array by 5
array = np.array([1, 2, 3, 4, 5])
result = array * 5
print(result)
Output:
[ 5 10 15 20 25]
3. Element-wise Multiplication
In this example, we'll multiply two arrays element by element.
array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])
# Element-wise multiplication
result = array1 * array2
SRGI, BHILAI
print(result)
Output:
[ 5 12 21 32]
4. Using Mathematical Functions on Arrays
Vectorized operations can be applied using mathematical functions on entire
arrays.
# Creating an array
array = np.array([0, np.pi / 2, np.pi, 3 * np.pi / 2])
# Applying the sine function to each element
result = np.sin(array)
print(result)
Output:
[ 0.0000000e+00 1.0000000e+00 1.2246468e-16 -1.0000000e+00]
5. Boolean Indexing
Vectorization also allows for conditional operations on arrays.
# Creating an array
array = np.array([1, 2, 3, 4, 5])
# Get elements greater than 3
result = array[array > 3]
print(result)
Output:
[4 5]
Broadcasting
SRGI, BHILAI
Output:
SRGI, BHILAI
[11 12 13 14]
In this example, the scalar 10 is treated as if it were an array of the same shape as
arr ([10, 10, 10, 10]), and the addition is applied element-wise.
Example 2: Broadcasting a 1D Array to a 2D Array
Broadcasting also allows you to apply operations between arrays of different
dimensions. Let’s take a 2D array and a 1D array.
import numpy as np
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
vector = np.array([10, 20, 30])
# Broadcasting the 1D array 'vector' across each row of the 2D array 'matrix'
result = matrix + vector
print(result)
Output:
[[11 22 33]
[14 25 36]]
Here, the shape of matrix is (2, 3), and the shape of vector is (3). Since the number
of columns (3) matches, vector is broadcasted to each row of matrix. NumPy
treats the vector as if it had shape (1, 3) and replicates it to match the (2, 3) shape
of matrix.
Example 3: Broadcasting with Arrays of Different Shapes
In this case, let’s broadcast arrays of different dimensions.
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
b = np.array([[10],
[20],
[30]])
result = a + b
print(result)
Output:
[[11 12 13]
[24 25 26]
[37 38 39]]
Here:
• a has shape (3, 3).
• b has shape (3, 1) (a column vector).
NumPy broadcasts b to match the shape of a by "stretching" b along the columns,
treating b as if it were:
[[10, 10, 10],
[20, 20, 20],
[30, 30, 30]]
Then, the element-wise addition is performed.
Example 4: Broadcasting Across Multiple Dimensions
Let’s look at an example with higher-dimensional arrays.
import numpy as np
# 3D array of shape (2, 3, 4)
a = np.array([[[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]],
result = a + b
print(result)
Output:
[[[ 2 4 6 8]
[ 6 8 10 12]
[10 12 14 16]]
[[14 16 18 20]
[18 20 22 24]
[22 24 26 28]]]
Here:
• a has shape (2, 3, 4) (a 3D array),
• b has shape (4) (a 1D array).
The array b is broadcast across the last dimension of a so that it applies to each
subarray of shape (3, 4) in a.
Example 5: Incompatible Shapes (Broadcasting Failure)
Broadcasting will fail if the shapes of the arrays are not compatible under the
broadcasting rules.
import numpy as np
a = np.array([[1, 2],
[3, 4]])
b = np.array([1, 2, 3])
# This will raise a ValueError because the shapes are incompatible
result = a + b
Error:
ValueError: operands could not be broadcast together with shapes (2,2) (3,)
In this case:
SRGI, BHILAI
Structured Arrays
SRGI, BHILAI
A structured array is created using a custom dtype (data type) specification that
defines the names and data types of each field.
You can create a structured array by defining a dtype with named fields. Each
field is assigned a data type and optionally a shape.
import numpy as np
# Define structured array with named fields
person_dtype = np.dtype([('name', 'S20'), ('age', 'i4'), ('height', 'f4')])
# Create a structured array
SRGI, BHILAI
Output:
Here, each entry has a name (string), age (integer), and height (float).
In structured arrays in NumPy, S20, i4, and f4 refer to data types and their sizes.
These codes define the type of each field in the array. Let’s break them down:
In the array representation array([(b'Ram', 25, 5.5)]), the b before the string
'Ram' indicates that the string is a byte string or a bytes literal, rather than a
regular Unicode string. In Python, strings can be stored as either Unicode or
bytes:
You can access individual fields (columns) in a structured array using the field
names.
Output:
[b'Ram' b'Shyam']
[25 30]
We can treat fields like individual arrays, which can be useful for data
manipulation.
You can modify the values in structured arrays using standard NumPy array
indexing and assignment.
Output:
Structured arrays also support more complex data types, such as arrays within
fields.
Output:
Here, each student has a list of three grades stored in the grades field.
You can even nest structured arrays, where one field is itself another structured
array.
Output:
print(sorted_people)
Output:
array([(b'Ram', 25, (b'123 Ave', b'New York')), (b'Shyam', 30, (b'456 St',
b'Chicago'))],
dtype=[('name', 'S20'), ('age', '<i4'), ('address', [('street', 'S20'), ('city',
'S20')])])
Reading array data typically means loading arrays from files or converting data
into arrays.
Example:
import numpy as np
data = [1, 2, 3, 4, 5]
array = np.array(data)
SRGI, BHILAI
print(array)
Output:
[1 2 3 4 5]
• From a file: You can read array data from a file using functions like
np.loadtxt() or np.genfromtxt(), which are useful for text files.
Example: Let's assume you have a file called data.txt with the following content:
1, 2, 3
4, 5, 6
7, 8, 9
import numpy as np
print(array)
Output:
[[1. 2. 3.]
[4. 5. 6.]
[7. 8. 9.]]
You can write array data to files using NumPy's built-in functions such as
np.savetxt() or np.save().
• Saving as text file (CSV format): You can save the array as a text file
using savetxt().
Example:
SRGI, BHILAI
import numpy as np
This will create a file called output.txt with the following content:
1,2,3
4,5,6
7,8,9
• Saving as a binary file (for faster I/O): You can save arrays as binary
files using np.save() for more efficient storage.
Example:
import numpy as np
np.save('array_data.npy', array)
array_loaded = np.load('array_data.npy')
print(array_loaded)
Output:
[1 2 3 4 5]
Summary of Functions:
These methods make it easy to work with arrays in both textual and binary
formats.