0% found this document useful (0 votes)
2 views

BTech 5 CSE Data Analytics With Python Unit 2 and 3 Notes

The document provides an introduction to data analysis, outlining key knowledge domains such as data cleaning, exploratory data analysis, statistics, and machine learning. It also discusses the importance of ethics and privacy in data management, highlighting regulations like GDPR and HIPAA. Additionally, it differentiates between quantitative and qualitative data, providing examples and applications in Python for both types.

Uploaded by

yeeshandas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

BTech 5 CSE Data Analytics With Python Unit 2 and 3 Notes

The document provides an introduction to data analysis, outlining key knowledge domains such as data cleaning, exploratory data analysis, statistics, and machine learning. It also discusses the importance of ethics and privacy in data management, highlighting regulations like GDPR and HIPAA. Additionally, it differentiates between quantitative and qualitative data, providing examples and applications in Python for both types.

Uploaded by

yeeshandas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

SRGI, BHILAI

Unit 2
An Introduction to Data Analysis
Knowledge Domains of the Data Analysis
A knowledge domain refers to a specialized area of expertise within a
broader discipline. It encompasses a defined body of knowledge, including
theories, principles, methodologies, and best practices that are essential for
proficiency in that area. Knowledge domains serve as frameworks for
professionals to structure their learning, problem-solving approaches, and
application of skills.
In the field of data analysis, there are several important knowledge domains that
provide a foundation for understanding, analyzing, and deriving insights from
data. These include:
1. Data Cleaning and Preprocessing
2. Exploratory Data Analysis (EDA)
3. Data Visualization and Communication
4. Statistics
5. Machine Learning
6. Data Mining
7. Programming and Scripting
8. Big Data Technologies
9. Data Management
10.Ethics and Privacy in DATA
1. Data Cleaning and Preprocessing: Data cleaning and preprocessing
involves preparing data for analysis by ensuring its quality and usability. It
starts with data collection, gathering accurate and relevant data from
various sources. Next, data cleaning addresses missing values, outliers, and
errors. Data transformation follows, where tasks like normalization,
scaling, and encoding categorical variables are applied. Finally, data
reduction techniques, such as PCA, reduce dimensionality, simplifying the
dataset while preserving essential information for analysis.
2. Exploratory Data Analysis (EDA): Data visualization, summarization,
and feature engineering are key steps in data analysis. Data visualization
uses plots like scatter plots, histograms, and bar charts to reveal patterns
and trends in the data. Summarization identifies key relationships and
correlations, enabling better understanding. Feature engineering creates
new features to enhance model performance and capture important data
patterns.
SRGI, BHILAI

3. Data Visualization and Communication: Mastery of visualization tools


like Matplotlib, Seaborn, Power BI, and Tableau is crucial for presenting
data insights clearly. Effective storytelling with data translates complex
information into compelling narratives for decision-makers. Additionally,
dashboarding allows for the creation of interactive, real-time dashboards,
enabling continuous monitoring and up-to-date insights for informed
decision-making.
4. Statistics: Descriptive statistics provides a summary of data through key
metrics such as mean, median, mode, and standard deviation, offering
insights into its central tendency and variability. In contrast, inferential
statistics enables predictions and inferences about a population based on
sample data, utilizing methods like hypothesis testing and confidence
intervals. Additionally, probability theory addresses the likelihood of
events, forming a foundational understanding of distributions, risks, and
patterns within data, thereby enhancing the analytical framework for data-
driven decision-making.
5. Machine Learning: Supervised learning involves models that utilize
labeled data to make predictions, employing algorithms such as regression
and classification techniques like decision trees and support vector
machines (SVM). Whereas, unsupervised learning identifies patterns in
data without labels, using methods like clustering and dimensionality
reduction to uncover hidden structures. Reinforcement learning, on the
other hand, focuses on learning optimal actions based on feedback or
rewards in dynamic environments, enabling agents to improve their
decision-making over time.
6. DATA Mining: Data mining involves techniques for extracting insights
from large datasets, including pattern recognition, which identifies and
extracts meaningful trends. Association rules, such as market basket
analysis, uncover relationships between data points, revealing item
purchase patterns. Additionally, clustering groups similar items based on
attributes using methods like K-means and DBSCAN, facilitating a better
understanding of the dataset's structure and aiding in informed decision-
making.
7. Programming and Scripting: Programming and scripting are vital in data
analysis, utilizing various languages and libraries. Python is prominent,
featuring libraries like Pandas for data manipulation, NumPy for numerical
computations, Scikit-learn for machine learning, and TensorFlow for deep
learning. R is popular for statistical analysis and machine learning due to
its extensive packages. SQL is essential for querying relational databases
SRGI, BHILAI

and manipulating data, while specialized environments like SAS and


Matlab offer advanced tools for data manipulation and statistical modeling.
8. Big Data Technologies: Big Data technologies are important for managing
and processing large datasets effectively. Hadoop and Spark are key
distributed computing frameworks that enable parallel processing and
efficient data analysis. NoSQL databases like MongoDB and Cassandra
offer flexible, non-relational data storage solutions. Additionally, cloud
computing platforms such as AWS, Google Cloud, and Azure provide
scalable resources for data processing and analytics, enhancing the
deployment and management of big data applications.
9. Data Management: Database management involves various practices
essential for organizing and analyzing large sets of structured data. Data
warehousing focuses on the efficient organization and storage of this data,
enabling comprehensive analysis. The ETL process, which stands for
Extract, Transform, Load, prepares data from various sources for analysis,
ensuring it is clean and structured appropriately. Understanding the
distinctions between SQL and NoSQL databases is also necessary, as SQL
databases are relational and structured, while NoSQL databases offer more
flexible, non-relational storage options.
10.Ethics and Privacy in DATA: Ethics and privacy in data are critical
considerations in data management. Data governance encompasses the
policies, procedures, and controls necessary to maintain data quality and
security. Adhering to data privacy regulations like GDPR and HIPAA is
essential to ensure compliance and protect individuals' personal
information. Additionally, recognizing and mitigating bias in data analysis
and machine learning models is crucial for promoting fairness and
accuracy in decision-making processes.
GDPR (General Data Protection Regulation) and HIPAA (Health Insurance
Portability and Accountability Act) are two significant regulations focused on
data protection and privacy, but they apply to different contexts and types of
information.

GDPR (General Data Protection Regulation)


- Purpose: GDPR is a comprehensive data protection law in the European Union
(EU) that governs how personal data of EU citizens is collected, processed, and
stored.
SRGI, BHILAI

Key Features:
- Data Protection Rights: It grants individuals rights such as the right to access
their data, the right to be forgotten, and the right to data portability.
- Consent: Organizations must obtain clear consent from individuals before
processing their personal data.
- Accountability: Businesses are required to implement appropriate technical
and organizational measures to protect personal data and report data breaches
within 72 hours.
- Fines: Non-compliance can lead to significant penalties, including fines of up
to 4% of annual global revenue or €20 million, whichever is higher.

HIPAA (Health Insurance Portability and Accountability Act)


- Purpose: HIPAA is a U.S. regulation that sets standards for the protection of
sensitive patient health information.
Key Features:
- Protected Health Information (PHI): HIPAA defines and safeguards PHI,
which includes any individually identifiable health information held by covered
entities.
- Privacy and Security Rules: It mandates the confidentiality, integrity, and
availability of PHI, requiring healthcare providers and related entities to
implement security measures.
- Patient Rights: Individuals have the right to access their medical records,
request corrections, and receive notifications of breaches affecting their health
information.
- Penalties: Violations can result in civil and criminal penalties, including fines
and, in severe cases, imprisonment.

In summary, while both GDPR and HIPAA aim to protect personal data, GDPR
focuses on the privacy rights of individuals in the EU regarding all types of
personal data, whereas HIPAA specifically addresses the privacy and security of
health information in the United States.
SRGI, BHILAI

Quantitative Data
Quantitative data refers to numerical information that can be measured or
counted. This type of data is used for statistical analysis and often involves
operations like addition, subtraction, or averaging.
• Types:
o Discrete Data: Countable values (e.g., number of students).
o Continuous Data: Measurable values within a range (e.g., height,
weight).
Example in Python:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Creating a DataFrame with quantitative data
data = {
'Student_ID': [1, 2, 3, 4, 5],
'Math_Score': [78, 85, 92, 88, 76],
'Science_Score': [72, 89, 95, 84, 80],
'Study_Hours': [3.5, 4.0, 5.0, 4.5, 3.0]
}
df = pd.DataFrame(data)
# Statistical Analysis
print("Summary Statistics:")
print(df[['Math_Score', 'Science_Score', 'Study_Hours']].describe())
# Plotting the data
plt.scatter(df['Study_Hours'], df['Math_Score'])
plt.title("Study Hours vs Math Score")
plt.xlabel("Study Hours")
plt.ylabel("Math Score")
SRGI, BHILAI

plt.show()
Output:
• The summary statistics provide the mean, median, standard deviation, etc.
• The scatter plot visualizes the relationship between study hours and math
scores.

Qualitative Data
Qualitative data refers to non-numerical information that describes qualities or
characteristics. This type of data is often categorical and used for classification or
grouping.
• Types:
o Nominal Data: Categories without an order (e.g., colors, gender).
o Ordinal Data: Categories with a meaningful order (e.g., satisfaction
levels).
Example in Python:
1. Customer Feedback in a Shopping App
• Data:
o "Satisfied," "Neutral," "Dissatisfied," "Very Satisfied."
• Type: Ordinal Data (since there is an order of satisfaction levels).
Python Example:
import pandas as pd
data = {'Customer_ID': [101, 102, 103, 104],
'Feedback': ['Satisfied', 'Neutral', 'Dissatisfied', 'Very Satisfied']}
df = pd.DataFrame(data)
print(df)

2. Product Categories
• Data:
SRGI, BHILAI

o "Electronics," "Clothing," "Groceries," "Books."


• Type: Nominal Data (no inherent order among categories).
Python Example:
data = {'Product_ID': [1, 2, 3, 4],
'Category': ['Electronics', 'Clothing', 'Groceries', 'Books']}
df = pd.DataFrame(data)
print(df)

3. Employee Roles
• Data:
o "Manager," "Engineer," "Analyst," "Technician."
• Type: Nominal Data.
Python Example:
data = {'Employee_ID': [1, 2, 3, 4],
'Role': ['Manager', 'Engineer', 'Analyst', 'Technician']}
df = pd.DataFrame(data)
print(df)

4. Movie Genres
• Data:
o "Action," "Comedy," "Drama," "Horror," "Sci-Fi."
• Type: Nominal Data.
Python Example:
data = {'Movie_ID': [101, 102, 103, 104],
'Genre': ['Action', 'Comedy', 'Drama', 'Horror']}
df = pd.DataFrame(data)
print(df)
SRGI, BHILAI

5. Education Levels
• Data:
o "High School," "Bachelor's," "Master's," "PhD."
• Type: Ordinal Data (since education levels follow a meaningful order).
Python Example:
data = {'Person_ID': [1, 2, 3, 4],
'Education_Level': ["High School", "Bachelor's", "Master's", "PhD"]}
df = pd.DataFrame(data)
print(df)

6. Car Colors
• Data:
o "Red," "Blue," "Black," "White," "Green."
• Type: Nominal Data.
Python Example:
data = {'Car_ID': [1001, 1002, 1003, 1004],
'Color': ['Red', 'Blue', 'Black', 'White']}
df = pd.DataFrame(data)
print(df)

7. Survey Responses
• Data:
o "Yes," "No," "Maybe."
• Type: Nominal Data.
Python Example:
data = {'Respondent_ID': [1, 2, 3],
SRGI, BHILAI

'Response': ['Yes', 'No', 'Maybe']}


df = pd.DataFrame(data)
print(df)

8. Marital Status
• Data:
o "Single," "Married," "Divorced," "Widowed."
• Type: Nominal Data.
Python Example:
data = {'Person_ID': [1, 2, 3, 4],
'Marital_Status': ['Single', 'Married', 'Divorced', 'Widowed']}
df = pd.DataFrame(data)
print(df)

9. Weather Descriptions
• Data:
o "Sunny," "Cloudy," "Rainy," "Windy."
• Type: Nominal Data.
Python Example:
data = {'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday'],
'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Windy']}
df = pd.DataFrame(data)
print(df)

10. Social Media Sentiments


• Data:
o "Positive," "Negative," "Neutral."
SRGI, BHILAI

• Type: Ordinal Data (can also be nominal if no order is implied).


Python Example:
data = {'Post_ID': [1, 2, 3],
'Sentiment': ['Positive', 'Negative', 'Neutral']}
df = pd.DataFrame(data)
print(df)
Key Differences
Aspect Quantitative Data Qualitative Data

Nature Numerical (e.g., age, salary) Non-numerical (e.g., gender, color)

Analysis Type Statistical and mathematical analysis Classification and grouping

Representation Numbers Text or categories

Visualization Line charts, scatter plots, histograms Bar charts, pie charts

Applications in Python:
1. Quantitative: Calculating trends, correlation, regression analysis.

2. Qualitative: Sentiment analysis, clustering, decision tree classification.

Both types of data are often used together to provide a comprehensive analysis in data science
projects.

Unit 3
An array object represents a multidimensional, homogeneous array of
fixed-size items. An associated data-type object describes the format of each
element in the array (its byte-order, how many bytes it occupies in memory,
whether it is an integer, a floating-point number, or something else, etc.)
SRGI, BHILAI

Arrays should be constructed using array, zeros or empty (refer to the See
Also section below). The parameters given here refer to a low-level method
(ndarray(…)) for instantiating an array.

Indexing on ndarray
ndarrays can be indexed using the standard Python x[obj] syntax,
where x is the array and obj the selection. There are different kinds of indexing
available depending on obj: basic indexing, advanced indexing and field access.
Most of the following examples show the use of indexing when referencing data
in an array. The examples work just as well when assigning to an array.
Note that in Python, x[(exp1, exp2, ..., expN)] is equivalent
to x[exp1, exp2, ..., expN]; the latter is just syntactic sugar for the former.
Basic indexing
Single element indexing
Single element indexing works exactly like that for other standard Python
sequences. It is 0-based, and accepts negative indices for indexing from the end
of the array.
>>> x = np.arange(10)
>>> x[2]
2
>>> x[-2]
8
It is not necessary to separate each dimension’s index into its own set of square
brackets.
>>> x.shape = (2, 5) # now x is 2-dimensional
>>> x[1, 3]
8
>>> x[1, -1]
9
SRGI, BHILAI

Note that if one indexes a multidimensional array with fewer indices than
dimensions, one gets a sub dimensional array. For example:
>>> x[0]
array ([0, 1, 2, 3, 4])
That is, each index specified selects the array corresponding to the rest of
the dimensions selected. In the above example, choosing 0 means that the
remaining dimension of length 5 is being left unspecified, and that what is
returned is an array of that dimensionality and size. It must be noted that the
returned array is a view, i.e., it is not a copy of the original, but points to the same
values in memory as does the original array. In this case, the 1-D array at the first
position (0) is returned. So using a single index on the returned array, results in a
single element being returned. That is:
>>> x[0][2]
2
So note that x[0, 2] == x[0][2] though the second case is more inefficient as a new
temporary array is created after the first index that is subsequently indexed by 2.

Slicing and striding


Basic slicing extends Python’s basic concept of slicing to N dimensions. Basic
slicing occurs when obj is a slice object (constructed by start:stop:step notation
inside of brackets), an integer, or a tuple of slice objects and integers. Ellipsis and
newaxis objects can be interspersed with these as well.
The simplest case of indexing with N integers returns an array scalar representing
the corresponding item. As in Python, all indices are zero-based: for the i-th
index ni, the valid range is 0≤ni<di where di is the i-th element of the shape of
the array. Negative indices are interpreted as counting from the end of the array
(i.e., if ni<0, it means ni+di).
All arrays generated by basic slicing are always views of the original array.
The standard rules of sequence slicing apply to basic slicing on a per-dimension
basis (including using a step index). Some useful concepts to remember include:
• The basic slice syntax is i:j:k where i is the starting index, j is the stopping index,
and k is the step (k≠0). This selects the m elements (in the corresponding
dimension) with index values i, i + k, …, i + (m - 1)
SRGI, BHILAI

k where m=q+(r≠0) and q and r are the quotient and remainder obtained by
dividing j - i by k: j - i = q k + r, so that i + (m - 1) k < j. For example:
• >>> x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
• >>> x [1:7:2]
• array ([1, 3, 5])
• Negative i and j are interpreted as n + i and n + j where n is the number of
elements in the corresponding dimension. Negative k makes stepping go towards
smaller indices. From the above example:
• >>> x [-2:10]
• array ([8, 9])
• >>> x [-3:3:-1]
• array ([7, 6, 5, 4])
• Assume n is the number of elements in the dimension being sliced. Then, if i is
not given it defaults to 0 for k > 0 and n - 1 for k < 0 . If j is not given it defaults
to n for k > 0 and -n-1 for k < 0 . If k is not given it defaults to 1. Note that :: is
the same as : and means select all indices along this axis. From the above
example:
• >>> x [5:]
• array ([5, 6, 7, 8, 9])
• If the number of objects in the selection tuple is less than N, then : is assumed for
any subsequent dimensions. For example:
• >>> x = np.array([[[1],[2],[3]], [[4],[5],[6]]])
• >>> x.shape
• (2, 3, 1)
• >>> x [1:2]
• array ([[[4],
• [5],
• [6]]])
• An integer, i, returns the same values as i:i+1 except the dimensionality of the
returned object is reduced by 1. In particular, a selection tuple with the p-th
SRGI, BHILAI

element an integer (and all other entries :) returns the corresponding sub-array
with dimension N - 1. If N = 1 then the returned object is an array scalar.
• If the selection tuple has all entries: except the p-th entry which is a slice
object i:j:k, then the returned array has dimension N formed by stacking, along
the p-th axis, the sub-arrays returned by integer indexing of elements i, i+k, …, i
+ (m - 1) k < j.
• Basic slicing with more than one non-: entry in the slicing tuple, acts like repeated
application of slicing using a single non-: entry, where the non-: entries are
successively taken (with all other non-: entries replaced by :).
Thus, x[ind1, ..., ind2,:] acts like x[ind1][..., ind2, :] under basic slicing.

Array Concatenation
import numpy as np
# Creating two arrays
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
# Concatenating along axis 0 (row-wise)
concatenated = np.concatenate((arr1, arr2), axis=0) print(concatenated)
# Concatenating along axis 1 (column-wise) concatenated_col =
np.concatenate((arr1, arr2), axis=1) print(concatenated_col)
Output:
# Concatenation along axis 0
[[1 2]
[3 4]
[5 6]
[7 8]]
# Concatenation along axis 1
[[1 2 5 6]
[3 4 7 8]]

Splitting Array
# Creating an array
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
# Splitting into 2 arrays along axis 1 (column-wise)
split_arr = np.hsplit(arr, 2)
print(split_arr)
SRGI, BHILAI

# Splitting into 2 arrays along axis 0 (row-wise)


split_arr_row = np.vsplit(arr, 2)
print(split_arr_row)
Output:
# Splitting column-wise
[array([[1, 2],
[5, 6]]), array([[3, 4],
[7, 8]])]
# Splitting row-wise
[array([[1, 2, 3, 4]]), array([[5, 6, 7, 8]])]

Shape manipulation
In Python's NumPy library, shape manipulation allows you to change the structure
of arrays without changing the data they contain. Common shape manipulation
functions include reshaping, flattening, transposing, expanding, and squeezing
arrays.
common shape manipulation methods in NumPy:
1. Reshape
• Changes the shape of an array to a specified new shape, provided the total
number of elements remains the same.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped_arr = arr.reshape(2, 3)
print(reshaped_arr)
Output:
[[1 2 3]
[4 5 6]]
2. Flatten
• Converts a multi-dimensional array into a 1D array.
arr = np.array([[1, 2, 3], [4, 5, 6]])
flattened_arr = arr.flatten()
SRGI, BHILAI

print(flattened_arr)
Output:
[1 2 3 4 5 6]
3. Transpose
• Reverses or permutes the axes of an array, commonly used for matrices.
arr = np.array([[1, 2, 3], [4, 5, 6]])
transposed_arr = arr.T
print(transposed_arr)
Output:
[[1 4]
[2 5]
[3 6]]
4. Expand Dimensions
• Adds an extra dimension to an array, useful for aligning shapes for
operations like broadcasting.
arr = np.array([1, 2, 3])
expanded_arr = np.expand_dims(arr, axis=0)
print(expanded_arr)
print("Shape:", expanded_arr.shape)
Output:
[[1 2 3]]
Shape: (1, 3)
5. Squeeze
• Removes single-dimensional entries from the shape of an array, often used
to simplify results.
arr = np.array([[[1, 2, 3]]])
squeezed_arr = np.squeeze(arr)
SRGI, BHILAI

print(squeezed_arr)
print("Shape:", squeezed_arr.shape)
Output:
[1 2 3]
Shape: (3,)

Array Manipulations:
Array manipulation in Python, especially with NumPy, allows for powerful
operations like adding, removing, splitting, and modifying elements.
Some common array manipulation techniques:
1. Appending Elements
• Use np.append() to add elements to an array. It returns a new array with the
appended values.
import numpy as np
arr = np.array([1, 2, 3])
appended_arr = np.append(arr, [4, 5, 6])
print(appended_arr)
Output:
[1 2 3 4 5 6]
2. Inserting Elements
• Use np.insert() to insert values at a specific index.
arr = np.array([1, 2, 3])
inserted_arr = np.insert(arr, 1, [9, 10])
print(inserted_arr)
Output:
[ 1 9 10 2 3]
Here, [9, 10] is inserted starting at index 1
SRGI, BHILAI

3. Deleting Elements
• Use np.delete() to remove elements at specific indices.
arr = np.array([1, 2, 3, 4, 5])
deleted_arr = np.delete(arr, [1, 3]) # Remove elements at indices 1 and 3
print(deleted_arr)
Output:
[1 3 5]
4. Concatenating Arrays
• Combine arrays along an existing axis using np.concatenate().
arr1 = np.array([1, 2])
arr2 = np.array([3, 4])
concatenated_arr = np.concatenate((arr1, arr2))
print(concatenated_arr)
Output:
[1 2 3 4]
5. Splitting Arrays
• Use np.split() to split an array into multiple sub-arrays.
arr = np.array([1, 2, 3, 4, 5, 6])
split_arr = np.split(arr, 3) # Split into 3 equal parts
print(split_arr)
Output:
[array([1, 2]), array([3, 4]), array([5, 6])]

6. Reshaping Arrays
• reshape() changes the shape of an array without modifying the data.
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped_arr = arr.reshape(2, 3)
SRGI, BHILAI

print(reshaped_arr)
Output:
[[1 2 3]
[4 5 6]]
7. Flattening Arrays
• Convert a multi-dimensional array into a 1D array with flatten().
arr = np.array([[1, 2], [3, 4]])
flattened_arr = arr.flatten()
print(flattened_arr)
Output:
[1 2 3 4]
8. Stacking Arrays
• Stack arrays along a new axis using np.vstack() for vertical stacking or
np.hstack() for horizontal stacking.
arr1 = np.array([1, 2])
arr2 = np.array([3, 4])
vstacked_arr = np.vstack((arr1, arr2))
hstacked_arr = np.hstack((arr1, arr2))
print("Vertical Stack:\n", vstacked_arr)
print("Horizontal Stack:\n", hstacked_arr)

Output:
Vertical Stack:
[[1 2]
[3 4]]
SRGI, BHILAI

Horizontal Stack:
[1 2 3 4]
9. Reversing an Array
• Reverse an array with slicing or by using np.flip().
arr = np.array([1, 2, 3, 4, 5])
reversed_arr = np.flip(arr)
print(reversed_arr)
Output:
[5 4 3 2 1]
10. Repeating Elements
• Use np.repeat() to repeat each element a specified number of times.
arr = np.array([1, 2, 3])
repeated_arr = np.repeat(arr, 2)
print(repeated_arr)
Output:
[1 1 2 2 3 3]

Vectorization:
In Python, vectorization refers to performing operations on entire arrays rather
than individual elements, allowing for faster execution, especially with large
SRGI, BHILAI

datasets. Libraries like NumPy provide tools to make operations on entire arrays
faster and more memory-efficient.
1. Adding Two Arrays
Let's add two arrays element-wise using vectorization.
import numpy as np
# Creating two arrays
array1 = np.array([1, 2, 3, 4])
array2 = np.array([10, 20, 30, 40])
# Adding arrays using vectorized operation
result = array1 + array2
print(result)
Output:
[11 22 33 44]
2. Scalar Operations on Arrays
Performing a scalar operation on each element in an array without a loop.
# Multiply each element in the array by 5
array = np.array([1, 2, 3, 4, 5])
result = array * 5
print(result)
Output:
[ 5 10 15 20 25]
3. Element-wise Multiplication
In this example, we'll multiply two arrays element by element.
array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])
# Element-wise multiplication
result = array1 * array2
SRGI, BHILAI

print(result)
Output:
[ 5 12 21 32]
4. Using Mathematical Functions on Arrays
Vectorized operations can be applied using mathematical functions on entire
arrays.
# Creating an array
array = np.array([0, np.pi / 2, np.pi, 3 * np.pi / 2])
# Applying the sine function to each element
result = np.sin(array)
print(result)
Output:
[ 0.0000000e+00 1.0000000e+00 1.2246468e-16 -1.0000000e+00]
5. Boolean Indexing
Vectorization also allows for conditional operations on arrays.
# Creating an array
array = np.array([1, 2, 3, 4, 5])
# Get elements greater than 3
result = array[array > 3]
print(result)
Output:
[4 5]

6. Dot Product of Vectors


The dot product of vectors is a common operation that can be vectorized in
Python.
SRGI, BHILAI

array1 = np.array([1, 2, 3])


array2 = np.array([4, 5, 6])
# Dot product
result = np.dot(array1, array2)
print(result)
Output:
32
7. Matrix Multiplication
Matrix multiplication can also be efficiently performed through vectorization.
# Creating two 2x2 matrices
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])
# Matrix multiplication
result = np.dot(matrix1, matrix2)
print(result)
Output:
[[19 22]
[43 50]]

Broadcasting
SRGI, BHILAI

In NumPy, broadcasting refers to the ability of NumPy to perform


element-wise operations on arrays of different shapes in a way that avoids making
unnecessary copies of data. Broadcasting allows NumPy to "stretch" smaller
arrays along specific dimensions so that they are compatible with larger arrays
when performing operations like addition, multiplication, and comparison.
Instead of forcing arrays to have the same shape by manually reshaping or
duplicating data, broadcasting automatically adjusts the shapes of the arrays so
that operations can be performed efficiently.
Broadcasting Rules
For broadcasting to work, NumPy follows specific rules to determine how arrays
of different shapes are treated:
1. Rule 1: If the arrays differ in the number of dimensions, prepend ones to
the shape of the smaller array until both arrays have the same number of
dimensions.
2. Rule 2: If the size of the dimensions of the arrays match or one of the
dimensions is 1, the arrays are compatible in that dimension and can be
broadcasted.
3. Rule 3: If the arrays are not compatible based on the above two rules,
broadcasting will fail, resulting in a ValueError.

Example 1: Broadcasting a Scalar with an Array


Broadcasting is simplest when a scalar is involved. NumPy broadcasts the scalar
across the entire array.
import numpy as np
arr = np.array([1, 2, 3, 4])
scalar = 10
# Broadcasting the scalar to each element in the array
result = arr + scalar
print(result)

Output:
SRGI, BHILAI

[11 12 13 14]
In this example, the scalar 10 is treated as if it were an array of the same shape as
arr ([10, 10, 10, 10]), and the addition is applied element-wise.
Example 2: Broadcasting a 1D Array to a 2D Array
Broadcasting also allows you to apply operations between arrays of different
dimensions. Let’s take a 2D array and a 1D array.
import numpy as np
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
vector = np.array([10, 20, 30])
# Broadcasting the 1D array 'vector' across each row of the 2D array 'matrix'
result = matrix + vector
print(result)
Output:
[[11 22 33]
[14 25 36]]
Here, the shape of matrix is (2, 3), and the shape of vector is (3). Since the number
of columns (3) matches, vector is broadcasted to each row of matrix. NumPy
treats the vector as if it had shape (1, 3) and replicates it to match the (2, 3) shape
of matrix.
Example 3: Broadcasting with Arrays of Different Shapes
In this case, let’s broadcast arrays of different dimensions.
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
b = np.array([[10],
[20],
[30]])

# Broadcasting the column vector across each row


SRGI, BHILAI

result = a + b
print(result)
Output:
[[11 12 13]
[24 25 26]
[37 38 39]]
Here:
• a has shape (3, 3).
• b has shape (3, 1) (a column vector).
NumPy broadcasts b to match the shape of a by "stretching" b along the columns,
treating b as if it were:
[[10, 10, 10],
[20, 20, 20],
[30, 30, 30]]
Then, the element-wise addition is performed.
Example 4: Broadcasting Across Multiple Dimensions
Let’s look at an example with higher-dimensional arrays.
import numpy as np
# 3D array of shape (2, 3, 4)
a = np.array([[[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]],

[[13, 14, 15, 16],


[17, 18, 19, 20],
[21, 22, 23, 24]]])
# 1D array of shape (4)
b = np.array([1, 2, 3, 4])
# Broadcasting 'b' across the last dimension of 'a'
SRGI, BHILAI

result = a + b
print(result)
Output:
[[[ 2 4 6 8]
[ 6 8 10 12]
[10 12 14 16]]

[[14 16 18 20]
[18 20 22 24]
[22 24 26 28]]]
Here:
• a has shape (2, 3, 4) (a 3D array),
• b has shape (4) (a 1D array).
The array b is broadcast across the last dimension of a so that it applies to each
subarray of shape (3, 4) in a.
Example 5: Incompatible Shapes (Broadcasting Failure)
Broadcasting will fail if the shapes of the arrays are not compatible under the
broadcasting rules.
import numpy as np
a = np.array([[1, 2],
[3, 4]])
b = np.array([1, 2, 3])
# This will raise a ValueError because the shapes are incompatible
result = a + b
Error:
ValueError: operands could not be broadcast together with shapes (2,2) (3,)
In this case:
SRGI, BHILAI

• a has shape (2, 2).


• b has shape (3).
These shapes are not compatible because the number of columns in a (which is
2) does not match the size of b (which is 3), and neither can be broadcasted to fit
the other.
Example 6: Broadcasting with Uneven Shapes
Broadcasting can also handle cases where only one of the arrays has a dimension
of size 1, enabling "stretching."
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6]])
b = np.array([[1],
[2]])
# Broadcasting 'b' to match the shape of 'a'
result = a * b
print(result)
Output:
[[ 1 2 3]
[ 8 10 12]]
Here:
• a has shape (2, 3).
• b has shape (2, 1).
b is broadcasted along the second dimension, effectively treating it as:
[[1, 1, 1],
[2, 2, 2]]
The multiplication is then performed element-wise.

Structured Arrays
SRGI, BHILAI

A structured array in NumPy is a specialized ndarray that allows for the


storage and manipulation of complex, heterogeneous data, where each element of
the array is a collection of fields, each with its own name and data type. This
enables users to represent tabular or record-based data (e.g., rows of a database
table) with a more structured format than a typical homogeneous ndarray.
Structured arrays in NumPy allow you to store arrays of complex data types
with named fields, like a table in a database. Each field can have a different data
type, allowing you to represent data more clearly and efficiently.
Features of a Structured Array:
1. Named Fields: Each element in the structured array consists of fields with
distinct names, similar to columns in a database or fields in a record.
2. Heterogeneous Data Types: Fields can have different data types (e.g.,
integers, floats, strings), allowing for complex data structures.
3. Efficient Memory Management: Structured arrays ensure efficient
memory storage and access through fixed-size fields, enabling fast
computations and data retrieval.
4. Field Access: You can access fields by name (e.g., array['field_name']),
allowing for clear and intuitive data manipulation.
5. Support for Nested Structures: Structured arrays can support fields
containing other arrays or even nested structured arrays, enabling complex
data hierarchies.
Syntax for Creating a Structured Array:

A structured array is created using a custom dtype (data type) specification that
defines the names and data types of each field.

Creating a Structured Array

You can create a structured array by defining a dtype with named fields. Each
field is assigned a data type and optionally a shape.

Example 1: Basic Structured Array

import numpy as np
# Define structured array with named fields
person_dtype = np.dtype([('name', 'S20'), ('age', 'i4'), ('height', 'f4')])
# Create a structured array
SRGI, BHILAI

people = np.array([('Ram', 25, 5.5), ('Shyam', 30, 6.0)], dtype=person_dtype)


# Access the structured array
print(people)

Output:

array([(b'Ram', 25, 5.5), (b'Shyam', 30, 6. )],


dtype=[('name', 'S20'), ('age', '<i4'), ('height', '<f4')])

Here, each entry has a name (string), age (integer), and height (float).

In structured arrays in NumPy, S20, i4, and f4 refer to data types and their sizes.
These codes define the type of each field in the array. Let’s break them down:

1. S20 (String Data Type)

• S: Refers to a string (character) data type.


• 20: Specifies the maximum number of bytes (characters) the string can
have. In this case, S20 means the string can have up to 20 characters.

2. i4 (Integer Data Type)

• i: Refers to a signed integer data type.


• 4: Specifies the number of bytes (size) used to store the integer. In this
case, i4 means a 4-byte (32-bit) signed integer.

3. f4 (Floating-Point Data Type)

• f: Refers to a floating-point (decimal) data type.


• 4: Specifies the number of bytes used to store the floating-point number.
In this case, f4 means a 4-byte (32-bit) floating-point number, which is
equivalent to a float in Python.

In the array representation array([(b'Ram', 25, 5.5)]), the b before the string
'Ram' indicates that the string is a byte string or a bytes literal, rather than a
regular Unicode string. In Python, strings can be stored as either Unicode or
bytes:

• Unicode String: A regular string in Python, represented by str. It


supports various encodings like UTF-8 and can handle characters from
different languages.
• Byte String: A sequence of bytes (binary data), represented by bytes in
Python, which is why the letter b precedes the string literal.
SRGI, BHILAI

Accessing Fields in Structured Arrays

You can access individual fields (columns) in a structured array using the field
names.

Example 2: Accessing a Field

# Access the 'name' field


print(people['name'])
# Access the 'age' field
print(people['age'])

Output:

[b'Ram' b'Shyam']
[25 30]

We can treat fields like individual arrays, which can be useful for data
manipulation.

Modifying Structured Arrays

You can modify the values in structured arrays using standard NumPy array
indexing and assignment.

Example 3: Modifying a Field

# Modify 'age' field for the first person


people['age'][0] = 26
# Print the updated array
print(people)

Output:

array([(b'Ram', 26, 5.5), (b'Shyam', 30, 6. )],


dtype=[('name', 'S20'), ('age', '<i4'), ('height', '<f4')])

Complex Data Types

Structured arrays also support more complex data types, such as arrays within
fields.

Example 4: Structured Array with Arrays in Fields


SRGI, BHILAI

# Define a structured array with a field containing an array


complex_dtype = np.dtype([('name', 'S20'), ('grades', 'i4', (3,))])
students = np.array([('Ram', [85, 90, 92]), ('Shyam', [75, 80, 85])],
dtype=complex_dtype)
# Access the array
print(students)
# Access the grades field
print(students['grades'])

Output:

array([(b'Ram', [85, 90, 92]), (b'Shyam', [75, 80, 85])],


dtype=[('name', 'S20'), ('grades', '<i4', (3,))])
[[85 90 92]
[75 80 85]]

Here, each student has a list of three grades stored in the grades field.

Nested Structured Arrays

You can even nest structured arrays, where one field is itself another structured
array.

Example 5: Nested Structured Array

# Define nested dtype


address_dtype = np.dtype([('street', 'S20'), ('city', 'S20')])
person_dtype = np.dtype([('name', 'S20'), ('age', 'i4'), ('address', address_dtype)])

# Create array with nested dtype


people = np.array([('Ram', 25, ('123 Ave', 'New York')),
('Shyam', 30, ('456 St', 'Chicago'))], dtype=person_dtype)

# Access the nested field


print(people['address']['city'])

Output:

[b'New York' b'Chicago']

Operations on Structured Arrays


SRGI, BHILAI

You can perform NumPy operations on structured arrays, such as sorting or


filtering by field.

Example 6: Sorting by Field

# Sort by 'age' field


sorted_people = np.sort(people, order='age')

print(sorted_people)

Output:

array([(b'Ram', 25, (b'123 Ave', b'New York')), (b'Shyam', 30, (b'456 St',
b'Chicago'))],
dtype=[('name', 'S20'), ('age', '<i4'), ('address', [('street', 'S20'), ('city',
'S20')])])

In this example, we sorted the structured array by the age field.

Reading and Writing Array Data


In Python, you can read and write array data using libraries like NumPy,
which provides efficient methods to handle array-like data structures. Let's go
over the basic methods for reading and writing arrays.

1. Reading Array Data

Reading array data typically means loading arrays from files or converting data
into arrays.

• From a list (manually creating an array): You can manually create an


array from a Python list using NumPy's array() function.

Example:

import numpy as np

# Creating an array from a list

data = [1, 2, 3, 4, 5]

array = np.array(data)
SRGI, BHILAI

print(array)

Output:

[1 2 3 4 5]

• From a file: You can read array data from a file using functions like
np.loadtxt() or np.genfromtxt(), which are useful for text files.

Example: Let's assume you have a file called data.txt with the following content:

1, 2, 3

4, 5, 6

7, 8, 9

You can read this file as follows:

import numpy as np

# Reading array from a file

array = np.loadtxt('data.txt', delimiter=',')

print(array)

Output:

[[1. 2. 3.]

[4. 5. 6.]

[7. 8. 9.]]

2. Writing Array Data

You can write array data to files using NumPy's built-in functions such as
np.savetxt() or np.save().

• Saving as text file (CSV format): You can save the array as a text file
using savetxt().

Example:
SRGI, BHILAI

import numpy as np

array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Writing array to a text file

np.savetxt('output.txt', array, delimiter=',', fmt='%d')

This will create a file called output.txt with the following content:

1,2,3

4,5,6

7,8,9

• Saving as a binary file (for faster I/O): You can save arrays as binary
files using np.save() for more efficient storage.

Example:

import numpy as np

array = np.array([1, 2, 3, 4, 5])

# Writing array to a binary file

np.save('array_data.npy', array)

You can later load this binary file using np.load():

array_loaded = np.load('array_data.npy')

print(array_loaded)

Output:

[1 2 3 4 5]

Summary of Functions:

• np.array(data): Convert a list to an array.


• np.loadtxt('filename', delimiter=','): Read array from a text file.
SRGI, BHILAI

• np.genfromtxt('filename', delimiter=','): Another method to read array


data, handling missing values.
• np.savetxt('filename', array, delimiter=','): Save array as a text file.
• np.save('filename.npy', array): Save array as a binary .npy file.
• np.load('filename.npy'): Load a binary .npy file.

These methods make it easy to work with arrays in both textual and binary
formats.

You might also like