Fundamentals of Data Curation Using Python
Fundamentals of Data Curation Using Python
Contents
i
Fundamentals of Data Curation using Python
1.5.9 simple programs on strings ....................................................................................... 59
1.6 Sequence Data Types......................................................................................................... 59
1.6.1 list ................................................................................................................................. 59
1.6.2 tuple ............................................................................................................................. 60
1.6.3 Dictionary .................................................................................................................... 60
1.6.4 Indexing and accessing elements of lists, tuples and dictionaries ......................... 61
1.6.5 slicing in list, tuple ...................................................................................................... 64
1.6.6 concatenation on list, tuple and dictionary .............................................................. 66
1.6.7 Concept of mutuability ............................................................................................... 68
1.6.8 Other operations on list, tuple and dictionary ......................................................... 70
1.7 Functions ............................................................................................................................ 75
1.7.1 Top-down Approach of Problem Solving .................................................................. 75
1.7.2 Modular Programming and Functions ...................................................................... 76
1.7.3 Advantages of Modular Design .................................................................................. 76
1.7.4 Function and function parameters ............................................................................ 77
1.7.5 How to Define a Function in Python.......................................................................... 77
1.7.6 How to Define and Call a Basic Function in Python ................................................. 78
1.7.7 How to Define and Call Functions with Parameters ................................................ 79
1.7.8 Local Variables ............................................................................................................ 80
1.7.9 The Return Statement ................................................................................................ 81
1.7.10 Default argument values .......................................................................................... 82
1.8 .Library function: ............................................................................................................... 84
1.8.1 input() .......................................................................................................................... 84
1.8.2 eval() ............................................................................................................................ 84
1.8.3 print() function ........................................................................................................... 85
1.8.4 String Functions: ......................................................................................................... 86
1.8.5 count() function .......................................................................................................... 86
1.8.6 find() function ............................................................................................................. 87
1.8.7 rfind() function ........................................................................................................... 87
1.8.8 Various string functions capitalize(), title(), lower(), upper() and swapcase() .... 88
1.8.9 Various string functions islower(), isupper() and istitle(). ..................................... 89
1.8.10 Replace() and strip() function ................................................................................. 90
1.8.11 numeric Functions: ................................................................................................... 91
1.8.12 Date and time functions ........................................................................................... 91
1.8.13 recursion ................................................................................................................... 92
ii
Fundamentals of Data Curation using Python
1.8.14 Packages and modules ............................................................................................. 93
1.9 File Handling ............................................................................................................. 98
1.9.1 Introduction to File Handling in Python ................................................................... 98
1.9.1 Basic File Handling Operations in Python ................................................................ 99
1.10 Understanding the Basics of Python Libraries: ................................................ 100
1.10.1 Working of Python Library: ................................................................................ 100
1.10.2 Python standard Libraries: ................................................................................. 100
Assessment Criteria ............................................................................................................... 102
Refrences : .............................................................................................................................. 103
Exercise 1 ............................................................................................................................... 103
Multiple Choice Questions ................................................................................................. 103
State whether statement is true or false .......................................................................... 104
Fill in the blanks ................................................................................................................. 105
Lab Practice Questions ...................................................................................................... 105
Exercise 2 ............................................................................................................................... 106
Multiple Choice Questions ................................................................................................. 106
State whether statement is true or false .......................................................................... 107
Fill in the blanks ................................................................................................................. 107
Lab Practice Questions ...................................................................................................... 107
Exercise 3 ............................................................................................................................... 107
Multiple choice questions .................................................................................................. 107
State whether statement is true or false .......................................................................... 109
Fill in the blanks ................................................................................................................. 109
Lab Practice Questions ...................................................................................................... 109
Chapter 2: ................................................................................................................................... 111
Basics of Artificial Intelligence & Data Science ....................................................................... 111
2.1 Introduction to AI ............................................................................................................ 111
2.1.1 Understanding the basic concepts and evolution of Artificial
Intelligence. ........................................................................................................................ 112
2.1.2 Understanding the key components of Artificial Intelligence: ........... 112
2.4 Introduction to Data Science and Analytics.............................................................. 114
2.4.2 Framing the problem .......................................................................................... 116
2.2.2 Collecting Data .......................................................................................................... 117
2.2.3 Processing.................................................................................................................. 118
2.2.5 Cleaning and Munging Data ..................................................................................... 120
iii
Fundamentals of Data Curation using Python
2.3 Exploratory Data Analysis .............................................................................................. 121
2.3.1 Visualizing results ..................................................................................................... 123
2.4 Types of Machine Learning Algorithms (supervised, unsupervised) ......................... 124
2.4.1 Supervised Machine Learning ................................................................................. 124
2.4.2. Unsupervised Machine Learning ............................................................................ 125
2.4.3. Semi-supervised Machine Learning ....................................................................... 126
2.4.4. Reinforcement Machine Learning .......................................................................... 127
2.5 Machine Learning Workflow .......................................................................................... 128
2.5.1 Feature engineering.................................................................................................. 128
2.5.2 Preparing Data .......................................................................................................... 129
2.5.3 Training Data, Test data ........................................................................................... 130
2.5.4 Data Validation .......................................................................................................... 131
2.5.5 Introduction to different Machine Learning Algorithms ....................................... 131
2.6 Applications of Machine Learning. ................................................................................. 132
2.6.1 Image Recognition: ................................................................................................... 133
2.6.2 Speech Recognition:.................................................................................................. 133
2.6.3 Traffic prediction: ..................................................................................................... 134
2.6.4 Product recommendations: ..................................................................................... 134
2.6.5 Self-driving cars: ....................................................................................................... 135
2.6.6 Email Spam and Malware Filtering: ........................................................................ 135
2.6.7 Virtual Personal Assistant: ....................................................................................... 136
2.6.8 Online Fraud Detection: ........................................................................................... 136
2.6.9 Stock Market trading: ............................................................................................... 136
2.6.10 Medical Diagnosis: .................................................................................................. 136
2.6.11 Automatic Language Translation: ......................................................................... 136
2.7 Common Applications of AI: ........................................................................................... 137
2.7.1 AI Application in E-Commerce:................................................................................ 137
2.7.2 Applications of Artificial Intelligence in Education: .............................................. 137
2.7.3 Applications of Artificial Intelligence in Lifestyle: ................................................. 138
2.7.4 Applications of Artificial intelligence in Navigation: ............................................. 139
2.7.5 Applications of Artificial Intelligence in Robotics: ................................................. 139
2.7.6 Applications of Artificial Intelligence in Human Resource.................................... 139
2.7.7 Applications of Artificial Intelligence in Healthcare .............................................. 139
2.7.8 Applications of Artificial Intelligence in Agriculture ............................................. 140
2.7.9 Applications of Artificial Intelligence in Gaming .................................................... 140
iv
Fundamentals of Data Curation using Python
2.8 Advantages and Disadvantages of AI ............................................................................. 140
2.8.1 Advantages of Artificial Intelligence ....................................................................... 140
2.8.2 Disadvantages of Artificial Intelligence .................................................................. 140
2.9 Common examples of AI using python .......................................................................... 141
2.10 Introduction To Numpy ................................................................................................ 144
2.10.1 Array Processing Package ...................................................................................... 144
2.10.2 Array types .............................................................................................................. 145
2.10.3 Array slicing ............................................................................................................ 146
2.10.4 Negative Slicing ....................................................................................................... 147
2.10.5 Slicing 2-D Array ..................................................................................................... 148
2.11Computation on NumPy Arrays – Universal functions ............................................... 149
2.11.1 Array arithmetic ..................................................................................................... 150
2.11.2 Aggregations: Min, Max, etc. .................................................................................. 151
2.11.3 Python numpy sum:................................................................................................ 152
2.11.4 Python numpy average: ......................................................................................... 152
2.11.5 Python numpy min : ............................................................................................... 153
2.11.6 Python numpy max ................................................................................................. 154
2.11.7 N-Dimensional arrays ............................................................................................ 155
2.11.8 Broadcasting ........................................................................................................... 157
2.11.9 Fancy indexing ........................................................................................................ 160
2.11.10 Sorting Arrays ....................................................................................................... 161
Assessment Criteria ............................................................................................................... 164
Refrences : .............................................................................................................................. 164
Exercise .................................................................................................................................. 165
Objective Type Question.................................................................................................... 165
Subjective Type Questions ................................................................................................ 169
True False Questions ......................................................................................................... 170
Lab Practice Questions ...................................................................................................... 170
Chapter 3: ................................................................................................................................... 172
Introduction to Data Curation .................................................................................................. 172
3.1 Introduction and scope of Data Curation ...................................................................... 172
3.2 Data curation in AI and Machine Learning .................................................................... 173
3.3 Examples of Data Curation in AI and Machine Learning .............................................. 173
3.4 Importance of Data Curation in AI and Machine Learning .......................................... 173
3.5 The Future of Data Curation in AI and Machine Learning ........................................... 174
v
Fundamentals of Data Curation using Python
3.6 The Data Curation Process: From Collection to Analysis ............................................. 174
3.7 Real-World Applications of Data Curation ................................................................... 176
3.8 Challenges in Data Curation............................................................................................ 177
3.9 Key Steps in Data Curation: ............................................................................................ 179
3.10 Data Collection: Sources and Methods ........................................................................ 182
3.10.1 Sources of Data Collection ..................................................................................... 182
3.10.2 Methods of Data Collection .................................................................................... 184
3.10.3 Challenges in Data Collection ................................................................................ 184
3.11 Data Cleaning: Handling Missing, Duplicate, and Inconsistent Data......................... 185
3.11.1 Handling Missing Data ........................................................................................... 185
3.11.3 Handling Inconsistent Data.................................................................................... 186
Data Curation Vs. Data Management Vs. Data Cleaning ..................................................... 186
3.12 Data Transformation: Preparing Data for Analysis .................................................... 187
3.12.1 What is Data Transformation?............................................................................... 187
3.12.2 Key Steps in Data Transformation ........................................................................ 187
3.12.3 Tools for Data Transformation .............................................................................. 188
3.12.4 Why is Data Transformation Important? ............................................................. 188
3.13 Data Storage and Organization .................................................................................... 189
3.13.1 What is Data Storage and Organization? .............................................................. 189
3.13.2 Types of Data Storage ............................................................................................. 189
3.13.3 Data Organization Techniques .............................................................................. 189
3.13.4 Data Indexing & Retrieval ...................................................................................... 190
3.13.5 Data Backup & Security .......................................................................................... 190
3.13.6 Choosing the Right Storage & Organization Strategy .......................................... 190
3.13.7 Tools for Data Curation .......................................................................................... 190
3.13.8 Python-Based Tools ................................................................................................ 191
3.13.9 No-Code/Low-Code Data Curation Tools ............................................................. 191
3.13.10 Database & Big Data Curation Tools ................................................................... 192
3.13.11 Specialized Data Curation Tools .......................................................................... 192
3.14 Different Data Types and Data Sensitivities................................................................ 193
3.14.1 Different Data Types............................................................................................... 193
3.14.2 Data Sensitivities .................................................................................................... 195
3.14.3 How AI and Machine Learning Handle Different Data Types ............................. 197
3.14.4 Hands-On Exercise: Identifying Data Types in Real-World ................................ 199
3.14.5 Data Sensitivities Scenarios ................................................................................... 202
vi
Fundamentals of Data Curation using Python
3.14.6 Legal and Ethical Considerations in Handling Sensitive Data ............................ 204
3.14.7 Ethical Considerations in Handling Sensitive Data .............................................. 206
3.14.8 Industry-Specific Data Sensitivity ......................................................................... 207
3.14.9 Healthcare Industry: Protecting Patient Data ...................................................... 207
3.14.10 Financial Industry: Securing Transactions & Customer Data ........................... 207
3.14.11. Retail Industry: Protecting Customer & Payment Data .................................... 208
3.14.12 Industry Comparison: Data Sensitivity & Security Requirements ................... 209
3.14.13 Case Study: Data Sensitivity in Healthcare (HIPAA Compliance)..................... 209
3.14.14 Tools and Technologies for Data Curation and Sensitivity ............................... 211
3.15 Open-Source Tools for Data Curation .......................................................................... 212
3.16 Cloud-Based Data Curation Solutions .......................................................................... 214
3.17 Tools for Handling Sensitive Data ................................................................................ 217
3.18 Automating Data Curation with AI and Machine Learning ........................................ 220
3.19 Hands-On Exercise: Using Python Pandas for Data Cleaning and Transformation . 222
Assessment criteria ............................................................................................................... 225
Refrences : .............................................................................................................................. 225
Exercise .................................................................................................................................. 226
Multiple Choice Questions ................................................................................................. 226
True/False Questions: ....................................................................................................... 227
Fill in the Blanks Questions: .............................................................................................. 228
Lab Practice Questions ...................................................................................................... 228
Chapter 4 : .................................................................................................................................. 229
Data Collection & Acquisition Methods ................................................................................... 229
4.1 Data collection ................................................................................................................. 229
4.1.1 Definition and Importance of Data Collection ........................................................ 229
4.1.2 Steps Involved in Data Collection: ........................................................................... 230
4.1.3 Goal-Setting: Defining Objectives for Data Collection ........................................... 233
4.1.4 Choosing Appropriate Methods for Different Scenarios ....................................... 234
4.1.5 Real-World Applications of Data Collection (e.g., Market Research, Healthcare,
Finance) .............................................................................................................................. 237
4.1.6 Challenges in Data Collection ................................................................................... 238
4.2 Data Analysis Tool: Pandas ............................................................................................. 239
4.2.1 Introduction to the Data Analysis Library Pandas ................................................. 239
4.2.2 Pandas objects – Series and Data frames ................................................................ 240
4.2.3 Pandas Series ............................................................................................................ 240
vii
Fundamentals of Data Curation using Python
4.2.4 Pandas Dataframe ..................................................................................................... 241
4.2.5 Nan objects ................................................................................................................ 250
4.2.6 Filtering ..................................................................................................................... 260
4.2.7 Slicing ......................................................................................................................... 263
4.2.8 Sorting ....................................................................................................................... 265
4.2.9 Ufunc ......................................................................................................................... 268
4.3 Methods of Acquiring Data ............................................................................................ 268
4.3.1 Web Scraping: Extracting Data from Websites ...................................................... 268
4.3.2 Tools for Web Scraping (e.g., BeautifulSoup, Scrapy) ............................................ 271
4.3.3 Ethical Considerations in Web Scraping ................................................................. 273
4.3.4 API Usage: Accessing Data from APIs ...................................................................... 275
4.3.5 Types of APIs (REST) ................................................................................................ 278
4.4 Data Quality Issues and Techniques for Cleaning and Transforming Data ................ 280
4.4.1 Types of Data Quality Issues: ................................................................................... 281
4.4.2 Outliers ...................................................................................................................... 282
4.4.3 Impact of Data Quality on AI and Machine Learning Models ................................ 285
4.4.4 Case Study: Identifying Data Quality Issues in a Real-World Dataset .................. 286
4.4.5 Data Cleaning: Handling Missing and Inconsistent Data ....................................... 288
4.4.6 Techniques for Imputing Missing Data ................................................................... 291
4.4.7 Removing Duplicates and Outliers..................................................................... 293
4.5 Data Transformation: Preparing Data for Analysis ...................................................... 295
4.6 Hands-On Exercise: Cleaning and Transforming a Dataset ......................................... 297
4.7 Data Enrichment Methods .............................................................................................. 298
4.7.1 Data Enrichment - Definition and Importance ....................................................... 298
4.7.2 Augmenting Datasets with External Data ............................................................... 299
4.7.3 Sources of External Data (e.g., Public Datasets, APIs) ........................................... 300
4.7.4 Text Enrichment Techniques: .................................................................................. 301
Assessment Criteria ............................................................................................................... 304
Refrences : .............................................................................................................................. 304
Exercise .................................................................................................................................. 305
Multiple Choice Questions ................................................................................................. 305
True/False Questions ........................................................................................................ 306
LAB Exercise ....................................................................................................................... 307
Chapter 5 : .................................................................................................................................. 308
Data Integration, Storage and Visualization ........................................................................... 308
viii
Fundamentals of Data Curation using Python
5.1 Introduction to ETL Processes and Data Consolidation ............................................... 308
5.1.1 Introduction to Data Integration ............................................................................. 310
5.1.2 What is ETL? (Extract, Transform, Load) ............................................................... 312
5.1.3 Common Data Sources (Databases, APIs, CSV Files) ............................................. 313
5.1.4 Step-by-Step ETL Process ........................................................................................ 314
5.1.5 Tools for ETL ( Apache NiFi, Talend, Python Pandas) ........................................... 316
5.1.6 Data Cleaning and Transformation Techniques ..................................................... 316
5.1.7 Consolidating Data into a Unified Dataset .............................................................. 317
5.1.8 Real-World Examples of ETL in Action ................................................................... 319
5.2 Understanding Modern Data Storage Architectures - Data Lakes vs. Data Warehouses
................................................................................................................................................. 321
5.2.1 Introduction to Data Storage, Data Lake ................................................................. 323
5.2.2 Key Differences Between Data Lakes and Data Warehouses ................................ 325
5.2.3 Use Cases for Data Lakes and Data Warehouses .................................................... 327
5.2.4 Introduction to Distributed Databases ................................................................... 328
5.2.5 Cloud Storage Solutions (AWS S3, Azure Data Lake, Google Cloud Storage)....... 329
5.3 Interactive Data Visualization: Building Dashboards with Plotly and Matplotlib ..... 330
5.3.1 Introduction to Data Visualization .......................................................................... 334
5.3.2 Why Visualization Matters in Data Analysis ........................................................... 334
5.3.3 Getting Started with Matplotlib ............................................................................... 335
5.3.4 Creating Basic Charts (Line, Bar, Pie)...................................................................... 337
5.3.5 Introduction to Plotly for Interactive Visualizations ............................................. 339
5.3.6 Building Interactive Dashboards ............................................................................. 343
5.3.7 Real-Time Data Visualization ................................................................................... 345
5.4 Cloud Storage Solutions: Security, Scalability, and Compliance for Data Management
................................................................................................................................................. 348
5.4.1 Introduction to Cloud Storage ................................................................................. 352
5.4.2 Overview of Cloud Providers (AWS, Azure, Google Cloud) ................................... 353
5.4.3 Key Features of Cloud Storage (Scalability, Security, Compliance) ...................... 355
5.4.4 Data Security in the Cloud (Encryption, Access Control) ...................................... 357
5.4.5 Compliance Requirements (GDPR, HIPAA) ........................................................... 359
5.4.6 Cost Management in Cloud Storage ......................................................................... 361
5.4.7 Hands-On Project: Storing and Retrieving Data from the Cloud .......................... 362
5.4.8 Best Practices for Cloud Data Management ........................................................... 367
5.4.9 Case Studies: Cloud Storage in Real-World Scenarios ........................................... 369
ix
Fundamentals of Data Curation using Python
5.4.10 Future of Cloud Storage ......................................................................................... 370
Assessment Criteria ............................................................................................................... 373
Refrences : .............................................................................................................................. 373
Exercise .................................................................................................................................. 374
Multiple Choice Questions ................................................................................................. 374
True False Questions ......................................................................................................... 375
Lab Practice Questions ...................................................................................................... 376
Chapter 6 : .................................................................................................................................. 377
Data Quality and Governance ................................................................................................... 377
6.1 Ensuring and Maintaining High Data Quality Standards ............................................. 377
6.1.1 Understanding Data Quality ..................................................................................... 377
6.1.2 Data Quality Metrics and Assessment ................................................................... 379
6.1.3 Ensuring Data Integrity ............................................................................................ 384
6.1.4 Data Cleansing and Standardization .................................................................... 388
6.1.5 Continuous Monitoring and Improvement ........................................................... 392
6.2 Effective Implementation and Management of Data Governance ............................. 398
6.2.1 Introduction to Data Governance ............................................................................ 398
6.2.2 Data Governance Frameworks ................................................................................ 402
6.2.3 Data Lineage and Cataloging .................................................................................... 405
6.2.4 Data Security and Privacy ....................................................................................... 409
6.2.5 Ensuring Data Accessibility and Management ....................................................... 413
6.2.6 Compliance and Regulatory Considerations .......................................................... 416
6.3 What is Data Lineage? ..................................................................................................... 420
Key Terms ........................................................................................................................... 420
Types of Data Lineage ........................................................................................................... 421
Automating Data Lineage with AI ........................................................................................ 422
Tools That Support AI-Driven Data Lineage ....................................................................... 422
Example Use Case .................................................................................................................. 422
Visualizing Data Lineage ....................................................................................................... 423
Metadata: who created it, when it last ran, and versioning. What is Data Cataloging? ... 423
Key Features of a Data Catalog ............................................................................................. 423
Role of AI in Data Cataloging ................................................................................................ 423
How Data Cataloging Fits into Your Data Ecosystem ......................................................... 423
Real-World Example .............................................................................................................. 424
Quick Steps to Implement Data Cataloging ......................................................................... 424
x
Fundamentals of Data Curation using Python
Assessment Criteria ............................................................................................................... 425
Refrences : .............................................................................................................................. 426
Exercise .................................................................................................................................. 426
Multiple Choice Questions: ................................................................................................ 426
True or False Questions ..................................................................................................... 427
Chapter 7 .................................................................................................................................... 429
Advanced-Data Management Techniques ............................................................................... 429
7.1 Introduction to Advanced Data Management ............................................................... 429
7.1.1 Definition and Importance of Data Management ................................................... 430
7.1.2 Evolution from Traditional to Advanced Data Management Techniques ............ 432
7.1.3 Role of AI and Big Data in Modern Data Handling ................................................. 435
7.2 Data Governance Frameworks and Implementation ................................................... 439
7.2.1 Understanding Data Governance ............................................................................. 441
7.2.2 Definition and Key Components .............................................................................. 442
7.2.3 Data Security and Compliance ................................................................................. 444
7.2.4 Regulatory Frameworks (GDPR, HIPAA, CCPA) ..................................................... 446
7.2.5 Best Practices for Data Protection ........................................................................... 448
7.2.6 Ensuring Data Consistency and Quality .................................................................. 449
7.2.7 Master Data Management (MDM) ........................................................................... 450
7.2.8 Techniques for Data Validation and Cleansing ....................................................... 451
7.2.9 Case Studies in Data Governance............................................................................. 452
7.2.10 Real-World Examples of Successful Implementations ........................................ 453
7.3 AI-Assisted Data Curation Techniques .......................................................................... 454
7.3.1 Role of Machine Learning in Data Tagging ............................................................. 456
7.3.2 Application of AI in Metadata Generation and Categorization ............................. 458
7.3.3 Enhancing Data Integration with AI ........................................................................ 459
7.4 Big Data Management and Processing ........................................................................... 460
7.4.1 Introduction to Big Data Technologies ................................................................... 462
7.4.2 Big Data Tools for Data Management ...................................................................... 462
7.4.3 Hadoop Ecosystem (HDFS, MapReduce, Hive) ....................................................... 463
7.4.4 Apache Spark for Large-Scale Data Processing ...................................................... 464
7.4.5 NoSQL Databases (MongoDB, Cassandra) .............................................................. 465
7.4.6 Handling Large Datasets in Enterprise Applications ............................................. 466
7.4.7 Performance Optimization in Big Data Processing ................................................ 466
7.5 Implementing Advanced Data Management Strategies ............................................... 466
xi
Fundamentals of Data Curation using Python
7.5.1 Integrating AI and Big Data for Efficient Data Management ................................. 467
7.5.2 Challenges in Advanced Data Management and Solutions .................................... 468
7.5.3 Future Trends and Innovations in Data Governance and AI-Assisted Management
............................................................................................................................................. 469
Assessment Criteria ............................................................................................................... 471
Refrences : .............................................................................................................................. 471
Exercise .................................................................................................................................. 472
Objective Type Question.................................................................................................... 472
True/False Questions ........................................................................................................ 473
Lab Practice Questions ...................................................................................................... 473
Chapter 8: ................................................................................................................................... 475
Application of Data Curation .................................................................................................... 475
8.1 What is Exploratory Data Analysis? ............................................................................... 475
8.2 Working with IRIS Dataset ............................................................................................. 475
8.3 Image of flowers .............................................................................................................. 475
8.4 About data and output – species .................................................................................... 476
8.4.1 Import data ................................................................................................................ 476
8.4.2 Statistical Summary .................................................................................................. 477
8.4.3 Checking Missing Values .......................................................................................... 478
8.4.4 Checking Duplicates.................................................................................................. 479
8.5 Data Visualization ............................................................................................................ 479
8.6 Histograms ....................................................................................................................... 490
8.7 Heatmaps.......................................................................................................................... 494
8.8 Box Plots ........................................................................................................................... 495
8.9 Outliers ............................................................................................................................. 498
8.10 Special Graphs with Pandas ......................................................................................... 500
Assessment Criteria ............................................................................................................... 502
Refrences : .............................................................................................................................. 502
Exercise .................................................................................................................................. 503
Multiple Choice Questions ................................................................................................. 503
True False Questions ......................................................................................................... 504
Fill in the Blanks ................................................................................................................. 505
Lab Practice Questions ...................................................................................................... 506
xii
Fundamentals of Data Curation using Python
Syllabus Outline
13
Fundamentals of Data Curation using Python
Assessment Criteria
S. No. Assessment Criteria for Performance Criteria Theory Practical Project Viva
Marks Marks Marks Marks
PC 1 Learn to setup Python with IDE , Learn variables 10 5
and data types , conditional statements and
methods for Python language
PC 2 Learn advanced data types and file handling. 10 10 4 4
Utilize libraries like Pandas and NumPy for data
processing and cleaning.
PC 3 Know about AI fundamentals, data science 10 5
concepts, generative AI tools, and ethical
considerations, including data handling,
statistical techniques.
PC 4 Understand data curation, its scope, and business 10 5
applications, characteristics of different data
types.
PC 5 Know about Data collection goals, methods, and 10 5
planning, including ethical considerations and
techniques like web scraping, APIs
PC 6 Learn data cleaning, transformation, and 10 5 3 3
enrichment techniques, deduplication, outlier
detection, and leveraging AI tools for automation.
PC 7 Know about data warehouse, data management 10 5
systems, distributed database, cloud storage
solutions.
PC 8 Learn techniques for assessing and ensuring data 10 5
quality, data lineage, cataloging tools, and
governance frameworks
PC 9 Learn data presentation methods, open-source 10 5 3 3
visualization tools, interactive dashboards and
real-time reporting techniques.
PC 10 Know about data governance framework, AI 10 5
assisted data curation tools , big data tools
PC 11 Working on real world data curation problems as 0 5 10 10
a team and present their project findings.
Total Marks 100 60 20 20
14
Chapter 1: Foundation in Python Programming
Chapter 1:
Foundation in Python Programming
In today’s computer world Python is gaining popularity day by day and big companies
including Google, Yahoo, Intel, IBM etc. are widely using Python. So many reasons exist for
the popularity Python from its availability to ease of use.
a. Free and Open Source: Python is free to use and available to download from its
official website: https://www.python.org/
b. Easy-to-learn: Python is comparatively very easy to learn and use than many other
computer languages. The syntax, structures, keywords etc. used in Python are very
simple and easy to understand.
c. Extensive Libraries: Library is the strength of Python. When we download Python it
comes with the huge library having immense inbuilt modules which makes coding
easier and saves valuable time.
d. Portable: Key strength of python is its portability. Users can run python programs on
various platforms. Suppose you wrote a program in windows and now you want to
run this program on Linux or Mac Operating system, You can easily run your
programs on (Windows, Mac, Linux, Raspberry Pi, etc). You can say Python is a
platform-independent programming language.
e. Interpreted: Python is interpreted language, which means it does not require any
kind of compiler to run the program. Python converts its code into bytecode, which
gives instant results. Python is interpreted means that its code is executed line by line,
which makes it easier to debug.
f. Object-Oriented: Python can be used as an object-oriented language in which data
structure and functions are combined in a single unit. Python supports both object-
oriented and procedure-oriented approach in the development. The object-oriented
approach deals with the interaction between the objects on the other hand
procedure-oriented approach deals with functions only.
g. GUI Programming: Python provides many solutions to develop a Graphical User
Interface (GUI) very fast and easily.
h. Database Connectivity: Python supports all the database required for the
development of various projects. Programmers can pick the best suitable database
for their projects. Few databases which are supported by Python are MySQL,
PostgreSQL, Microsoft SQL Server etc.
15
Chapter 1: Foundation in Python Programming
An interpreter is a kind of program that executes other programs. When you write Python
programs, it converts source code written by the developer into intermediate
language which is again translated into the native language / machine language that is
executed.
The python code you write is compiled into python bytecode (0’s and 1’s), which creates file
with extension “.py”. The bytecode compilation happened internally, and almost completely
hidden from developer. Compilation is simply a translation step, and byte code is a lower-
level, and platform-independent, representation of your source code. Roughly, each of your
source statements is translated into a group of byte code instructions. This byte code
translation is performed to speed execution byte code can be run much quicker than the
original source code statements.
Python is an interpreted language since the programs written in Python are executed using
an interpreter not by a compiler. In case of the languages like C, C++ the program written are
compiled first and the source code is converted into byte code i.e. (0’s and 1’s).
Python, doesn’t need to converted into binary. You just run the program directly from the
source code.
16
Chapter 1: Foundation in Python Programming
https://colab.research.google.com/drive/1grPTaO1Lo43Fbpv-
A66RkvMUchXPzUGM?usp=sharing
1.2.1 Literals
Literal is actually a data value assigned to a variable or given in a constant. Like 29, 1,
“Python”, ‘Yes’ etc. Literals supported by Python:
Strings literals are formed by surrounding text between single or double quotes. Example,
‘Python’, “Hello World”, ‘We are learning Python” etc. Strings are sequence of characters and
even numeric digits are treated as characters once enclosed in quotes. Multiline string
literals are also allowed like
“This is world of programming
Python is a best to learn
We are working”
A = 99 # Integer literal
B = 21.98 # Float literal
C = 5.13j # Complex literal
17
Chapter 1: Foundation in Python Programming
Y = False
1.2.2 Constants
Constants are those items which holds the values directly and these values can’t be changed
during the execution of the program thus they are called as constants. For example,
Figure 1: Constants
Now, during the execution of the program the value of 123 or 23.56 or “Python World”
cannot be modified. When you run the program the output will be:
Variables can store data of different types, and different types can do different things. Python
supports the following built in data types:
Python
Datatype
Sequence
Numeric Dictionary Boolean Set
type
Integer Strings
Float List
Complex
Tuple
nunber
Figure 2: Python Built in Data Types
18
Chapter 1: Foundation in Python Programming
In above example, A is assigned the value of 10 and thereafter B is assigned with the value of
A i.e. so in result B will be holding the value i.e. 10 as shown in the above code.
1.3.2 Expressions
Expressions are used to obtain the desired intermediate or final results. Expressions means
combination of values, which can be constants, strings, variables and operators. Few
examples of expressions are as follows:
12 + 3
12 / 3 * (1+2)
12 / a
a*b*c
19
Chapter 1: Foundation in Python Programming
With the above example, it is clear that expressions are combination of operands and
operators and produce desired final or intermediate results. In examples given 12, 3, 1, 2, a,
b, c are operands and ‘+’, ‘/’, ‘*’ are operators.
Expressions are written on the RHS of the assignment operator and their result value is
stored in a variable for future reference.
Figure 5 Expressions
In above example, the value of A is printed as 5 after evaluating the expression ‘2+3’, value
of B is 25 and assigned after evaluating the expression ‘A * 5’ since the value of A is ‘5’. Value
of C is evaluated using an expression where both operands are variables and D is evaluated
using more than one expression.
1.3.3 Operators
20
Chapter 1: Foundation in Python Programming
In above example, all operators are binary operators i.e. operators are applied on two
operands to obtain the desired output. The modulus (%) operator returns the remainder
value of the division process when 10 is divided by 2 and store the remainder value in the
variable D.
21
Chapter 1: Foundation in Python Programming
Figure 8
Figure 9
22
Chapter 1: Foundation in Python Programming
Step 1: (4+5) is evaluated and result is 9. Step 1: (3+2) is evaluated and result is 5.
Step 2: 9/3 is evaluated and result is 3. Step 2: 6/3 is evaluated and result is 2.
Step 3: 3 * 2 is calculated and result is 6. Step 3: 2*5 is calculated i.e. result is 10.
Step 4: 6-1 is calculated and result is 5. Step 4: 2 + 10 is calculated and result is 12.
> Greater than x>y Returns True if x is greater than y; else False
< Less than x<y Returns True if x is less than y; else False
23
Chapter 1: Foundation in Python Programming
In the above example, the value given to x is 12 and y is 15. Since x is less than y thus x>y
returns False whereas x<y returns True.
24
Chapter 1: Foundation in Python Programming
In the above example, the value given to x is 24 and y is 24. Since x and y are equal thus x==y
returns True whereas x<y returns False.
Example 5. Program to explain the relational operators >=, <=.
25
Chapter 1: Foundation in Python Programming
and x > y and a >b Returns True; if both statements are True.
not not ( x>y or a>b) Opposite the result, returns False if the result is true
In example above, variable x, y, a, b are initialized with values 5,10,7 and 9 respectively.
Output of (x>y) is False since 5 is smaller than 10 and the output of (a<b) is True since 7 is
smaller than 9.
Logical operator ‘or’ is applied on the two statements (x>y), (a>b) having output False and
True respectively. Since ‘or’ logical operator produces True output when any of the
statements is True and in this case one statement is True (a<b). Thus, the output is True.
Example 7. Program to explain the ‘and’ logical operator.
26
Chapter 1: Foundation in Python Programming
In example above, logical operator ‘and’ is applied on the two statements (x>y), (a>b) having
output False and True respectively. Since ‘and’ logical operator produces True output only
when both statements are True. In this case only one statement is True (a<b). Thus, the
output is False.
Example 8. Program to explain the ‘not’ logical operator.
In example above, logical operator ‘not’ is applied on the result of the statement ((x>y) and
(a>b)) which is False. Since ‘not’ logical operator reverses, the input provided. Thus, the final
output becomes True.
Precedence order of the logical operators from highest to lowest is ‘not’, ‘and’ then ‘or’.
Precedence order could be understood from the above example. According to the precedence
first of all the ‘not(a<b)’ is evaluated and output of the statement is False. In second step,
(x>y) and (a<b) is evaluated which produces output False since (x>y) is False. Finally, the
preference is given to ‘or’ and the output of step1, step2 is combined using ‘or’ operator. The
result produced is False since output of both steps 1 and step 2 is False.
27
Chapter 1: Foundation in Python Programming
Symbol Function
>> Right shift
<< Left shift
& AND
| OR
^ XOR
~ One’s Compliment
Example 10. Program to explain the working of Right Shift Operator (>>)
In above example, variable ‘num’ has been initialized to ‘9’. Assume computer is using eight
digits to represent a binary number then number ‘9’ is represented as 0000 1001. After
applying right shift operator on digit ‘9’ the number in binary form becomes 0000 0100 i.e.
4 which is the new_num1. Understand that the digits are shifted to right by 1 position i.e. 1
bit is lost and the empty position created on the left is filled by ‘0’ digit.
Similarly, variable ‘num2’ has been initialized to ‘10’ and using eight digits to represent a
binary number the number ‘10’ is represented as 0000 1010. After applying right shift
operator on digit ‘10’ the number in binary form becomes 0000 0001 i.e. 1 which is the
new_num2. Understand that the digits are shifted to right by 3 positions i.e. 3 bits are lost
and the empty position created on the left is filled by ‘0’ digit.
28
Chapter 1: Foundation in Python Programming
Example 11. Program to explain the working of Left Shift Operator (<<)
In above example, variable ‘num1’ has been initialized to ‘10’. Assume computer is using
eight digits to represent a binary number then number ‘10’ is represented as 0000 1010.
After applying right shift operator on digit ‘10’ the number in binary form becomes 0001
0100 i.e. 20 which new_num1. Understand that the digits are shifted to left by 1 position i.e.
1 bit is lost and the empty position created on the right is filled by ‘0’ digit.
Similarly, variable ‘num2’ has been initialized to ‘5’ and using eight digits to represent a
binary number then number ‘5’ is represented as 0000 0101. After applying right shift
operator on digit ‘5’ the number in binary form becomes 0001 0100 i.e. 20 which
new_num2. Understand that the digits are shifted to left by 2 positions i.e. 2 bits are lost
and the empty position created on the right is filled by ‘0’ digit.
Example 12. Program to explain the working of Bitwise AND Operator (&)
In above example, variables ‘num1’, ‘num2’, ‘num3’ has been initialized to ‘8’, ‘7’ and ‘10’
respectively.
Now in binary format:
num1 = 8 = 0000 1000
num2 = 7 = 0000 0111
res1 = 0 = 0000 0000 (num1 & num2)
Bitwise AND (&) Operator gives output 1 if both the corresponding bits are 1, otherwise 0.
29
Chapter 1: Foundation in Python Programming
So, its noticed that in variable ‘res1’ all bits are 0 since none of the corresponding bits are 1
in variable ‘num1’ and ‘num2’.
Bitwise AND (&) Operator gives output 1 if both the corresponding bits are 1, otherwise 0.
So, we can notice that in variable ‘res2’ 1 is present where the corresponding bit is also 1 in
variable ‘num1’ and ‘num2’.
In above example, variables ‘num1’, ‘num2’, ‘num3’ has been initialized to ‘8’, ‘7’ and ‘10’
respectively.
Now in binary format:
num1 = 8 = 0000 1000
num2 = 7 = 0000 0111
res1 = 15 = 0000 1111(num1 & num2)
Bitwise OR ( | ) Operator gives output 1 if any of the corresponding bits are 1, otherwise 0.
So, we can notice that in variable ‘res1’, 1 is present where any of the corresponding bit is 1
in variable ‘num1’ and ‘num2’.
num1 = 8 = 0000 1000
num3 = 10 = 0000 1010
res2 = 10 = 0000 1010 (num1 & num3)
Bitwise OR ( | ) Operator gives output 1 if any of the corresponding bits are 1, otherwise 0.
So, we can notice that in variable ‘res2’, 1 is present where any of the corresponding bit is 1
in variable ‘num1’ and ‘num2’.
30
Chapter 1: Foundation in Python Programming
In above example, variables ‘num1’, ‘num2’, ‘num3’ has been initialized to ‘8’, ‘7’ and ‘10’
respectively.
The Bitwise NOT (~) operation inverts all the bits of the number. It turns 1 into 0 and 0 into
1. For signed integers, the operation is performed on the two's complement representation,
meaning negative numbers are represented using two's complement binary.
Example 14. Program to explain the working of Bitwise One’s complement Operator ( ~ ).
31
Chapter 1: Foundation in Python Programming
In above example, variables ‘num1’, ‘num2’, ‘num3’ has been initialized to ‘8’, ‘7’ and ‘10’
respectively. Bitwise One’s complement Operator ( ~ ) is a unary operator.
Commenting your code helps explain your thought process, and helps you and others to
understand later about your code and flow of program. This allows you to more easily find
errors, to fix them, to improve the code later on, and to reuse it in other applications as well.
Commenting is important to all kinds of projects, no matter whether they are small, medium,
or large. It is an essential part of your workflow, and is seen as good practice for developers.
Without comments, things can get confusing, real fast.
In above example, single-line comment has been written using (#) statement in the
beginning. By writing comments, the user could easily understand that the program has been
developed for adding two numbers.
To add a multiline string (triple quotes) in your code, and place your comment inside it as
explained below:
32
Chapter 1: Foundation in Python Programming
In our life, many times we encounter situations where we have to make a decision be it your
game, favorite food, movie, or the cloth. Similarly, in programming conditional statements
help us to make a decision based on certain conditions. These conditions are specified by a
set of conditional statements having boolean expressions which are evaluated to a boolean
value of true or false.
We have seen in the flowcharts that flow of the program changes based on the conditions.
Thus the condition based on decisions plays an important role in programming.
False
Condition Statement
True
Statement
33
Chapter 1: Foundation in Python Programming
1.4.1.1 if STATEMENT
The if statement test a condition and when the condition is ‘true’ a statement or a set of
statements are executed and the actions are performed as per the instructions given in the
statements otherwise the statements attached with the if statement are not executed.
Figure 27 if statement
In the above example, user entered marks as 85 and the if condition is checked since marks
are greater than 80 so the condition becomes ‘true’ and the print statement is executed and
output ‘Grade A’ is printed on the screen.
Then, the second input was asked and user entered name as Prashant and the if condition is
checked since name entered is not ‘Kapil’ so the condition becomes ‘false’ and the print
statement is not executed.
So, from the execution of the program we can conclude that the statement attached with if
statement is executed only when if condition is ‘true’ otherwise they are not executed and
behave like a comment.
Example 16. Program to check the correct input entered by the user.
34
Chapter 1: Foundation in Python Programming
Figure 28
Since the user entered number value as 5, so the if statement becomes true and the statement
attached to it gets printed.
In above example, user has been asked for the input and user entered marks as 47 since the
marks are less than 50 so the if condition becomes false. Consequently, the block attached
with if is not executed and the control is transferred to statements attached to the else block
and are executed.
35
Chapter 1: Foundation in Python Programming
Figure 30
Example 20. Program to display a menu and calculate area of a square and volume of a cube.
36
Chapter 1: Foundation in Python Programming
In above example, user entered marks as 67. The first if condition becomes false since marks
are less than 75 so the print statement in block1 is not executed and the control is transferred
to elif condition which become true since the marks lies in the range and the block2 print
statement is executed.
37
Chapter 1: Foundation in Python Programming
Figure 34
We have noticed in flowcharts that some statements or steps are repeated again and again
until a particular condition is achieved. In daily routine activities also, we notice that we have
to do continuously the same tasks till the target is achieved or t we obtain the desired result.
Such concept is called iteration or looping where steps are repeated to achieve the set target.
Statements of a program are executed in a sequential manner by default until a condition is
being introduced and flow of program gets modified depending upon the condition. That
38
Chapter 1: Foundation in Python Programming
means based on the conditions the sequential flow of the program is controlled or decided
i.e. flow of control or control flow of a program depends on the set conditions.
i =i +1
True
T=2*i Display T
i <=10
False
Flowchart for the printing of table of 2 where a set of statements are repeated again and
again until the value of i is greater than 10. We can notice that the flow of program is based
on the result of the condition.
The range() function is a built-in function in Python that generates a sequence of numbers.
It's commonly used in for loops to repeat an action a certain number of times.
Syntax:
range(start, stop, step)
start (optional): The starting number of the sequence (inclusive). If not provided, it
defaults to 0.
stop: The end number of the sequence (exclusive). The sequence will stop just
before this number.
step (optional): The step size (how much to increment by). If not provided, it
defaults to 1.
39
Chapter 1: Foundation in Python Programming
Condition is an expression whose result will be either ‘true’ or ‘false’. The block of statements
will be executed till the condition remains ‘true’ and when the condition becomes ‘false’ loop
terminates.
Example 24. Program to explain the working of while statement.
In above program, value of loop index ‘i’ in this case is tested and statements attached to the
while loop are executed till the value of i remains less than equal to 10 i.e. till condition
remains ‘true’. When the value of ‘i’ becomes greater than 10 the loop terminates and
program ends. while loop is an entry-controlled loop i.e. entry into the loop for executing the
statements are allowed only when entry condition is true.
40
Chapter 1: Foundation in Python Programming
In the above example, the loop index ‘i’ took the value of the elements present in the list one
by one in sequence and the statement attached to the for loop are executed number of times
equal to the number of elements present in the list. In above case, since the number of
elements in the list are four i.e. 5, 7, 9, and 11. So the statement attached is executed four
times.
Example 26. Program to explain the working of for statement.
In this case, loop index ‘i’ will take values of the elements of list which are string values and
the statement attached with the loop is executed 5 times since the number of elements in the
list are 5. With this example, it is clear that the index value can be either string or integer and
loop iteration depends on the number of elements in the list.
Example 27. Program to explain the working of for statement using range () function.
Using range(n) function the for loop can be implemented and index variable ‘i’ automatically
initialized by 0 and takes value ranging from 0,1,2,3,4, …, n-1 where n is the upper limit. Index
variable is incremented by 1 till it reached the upper limit decremented by 1. In this case the
n is 5 so ‘i’ is initialized to 0 and upper limit is 5 so the loop is executed 5 times for the value
0,1,2,3,4 and same is printed as output.
Example 28. Program to explain the working of for statement using range() function.
41
Chapter 1: Foundation in Python Programming
In this case, the range(a,n) function takes two parameters, with first value of parameter
index variable ‘i’ is initialized and the second parameter is the upper limit. Index variable is
incremented by 1 till it reach the upper limit decremented by 1. In this example, ‘i’ is
initialized to 3 and ‘i’ takes the values from 3,4,5,6 i.e. since 7 is the upper limit and the
statements attached with the loop are executed 4 times.
Example 29. Program to explain the working of for statement using range() function.
In this case, the range(a,n,b) function takes three parameters, with first value of parameter
index variable ‘i’ of the for loop is initialized, the second parameter is the upper limit and
third parameter is the incremental value. Index variable is incremented by the value ‘b’ till it
reaches the upper limit decremented by 1. In this example, ‘i’ is initialized to 3 and ‘i’ takes
the values from 3,5,7,9 since index value ‘i’ is incremented by 2 and the statements attached
with the loop are executed 4 times.
Example 30. Program to count numbers from 1 to 5 using while and for statements.
42
Chapter 1: Foundation in Python Programming
43
Chapter 1: Foundation in Python Programming
Figure 45
Figure 46
44
Chapter 1: Foundation in Python Programming
45
Chapter 1: Foundation in Python Programming
In above example, in case ‘break’ statement would have not been written in the program then
counting from 1 to 10 would have been the output. However, due to ‘break’ statement when
the value of i becomes 5 the ‘break’ statement plays its role and terminates loop in between
and prints the counting from 1 to 5 as output and control transfers out of the loop.
Example 34. Write a program to check a number is prime number.
46
Chapter 1: Foundation in Python Programming
47
Chapter 1: Foundation in Python Programming
In case of the Python continue statement next iteration of the loop takes place while
ignoring the statements after the continue statement i.e. continue statement makes the
control jumps back at the start of the loop for iterating again according the set condition for
the entry into the loop.
Example 36. Explaining the working of the continue statement.
48
Chapter 1: Foundation in Python Programming
In above example, it can be seen that when the alphabets ‘o’ and ‘y’ occurs in ‘Python’ the
if statement becomes true and the continue statement is executed due to which statement
of ‘print’ is not executed or ignored and control is being transferred back to start of the
loop. Thus, alphabet ‘o’ and ‘y’ is not printed in output.
Write a program to enter the marks of five subjects of a student and if the marks in any
subject is less than 50 don’t print the marks.
Figure 56
Figure 57
49
Chapter 1: Foundation in Python Programming
Python pass statement is used when a condition is required to make the code complete but
the statements attached with it are not required to be executed.
Syntax:
If(condition):
pass
Write a program to if the marks entered is greater than 50 then display ‘Great’ and if
marks are less than 50 then display ‘Do Hard work’.
In above example, user gave input marks as 50 since pass statement is attached with that
condition and no code is written in this block so nothing appears in the output.
In Python assert statement is used to check the whether a condition or a logical expression
is true or false. The assert statement is very useful in tracking the errors and terminating the
program on occurrence of an error.
Syntax:
assert (condtion)
50
Chapter 1: Foundation in Python Programming
In above example, in case the user enters any other password than ‘PYTHON’ the error
message occurs and program is not executed further.
1.5 String Handling and Sequence Types
A string is a series of characters. In Python, anything inside quotes is a string. And you can
use either single or double quotes. It is just like an array in C language and they are stored
and accessed using index.
Creating a string
Strings can be created by enclosing characters inside a single quote or double-quotes i.e
my_string1 = “SCHOOL”
my_string2 = ‘SCHOOL’
In Python by any of the above ways strings can be created.
However, using both single and double quotes simultaneously for creating a string will not
work and will generate error.
51
Chapter 1: Foundation in Python Programming
In case a situation arise where we like to access the last element of a string and we are not
aware of the length of the string then in such case negative indexing is used as follows:
52
Chapter 1: Foundation in Python Programming
In above example with using negative index we are able to access the last element and second
last element easily without requiring any knowledge of length of the string. Negative
indexing is as under:
S C H O O L
-6 -5 -4 -3 -2 -1
1.5.3 String Slicing
Concept of slicing is about obtaining a sub-string or a part from the given string by slicing it
respectively from start to end. The concept is slicing is similar to take a bread slice from a
bread packet. In the slicing operation the desired part of the string is obtained using the
indexes of the string.
my_string C O M P U T E R
Positive Index 0 1 2 3 4 5 6 7
Negative Index -8 -7 -6 -5 -4 -3 -2 -1
Example: String Slicing Positive and Negative Index
53
Chapter 1: Foundation in Python Programming
We can combine more than two strings which are explained with the program as under :
54
Chapter 1: Foundation in Python Programming
There may be times when you need to use Python to automate tasks, and one way you may
do this is through repeating a string several times. You can do so with the ‘*’ operator. Like
the ‘+’ operator, the ‘*’ operator has a different use when used with numbers, where it is the
operator for multiplication. When used with one string and one integer, ‘*’ is the string
replication operator, repeating a single string however many times you would like through
the integer you provide.
Let’s print out “COMPUTER” 5 times without typing out “COMPUTER” 5 times with
the ‘*’ operator:
55
Chapter 1: Foundation in Python Programming
Comparison Operators
To compare two strings, we mean that we want to identify whether the two strings are
equivalent to each other or not, or perhaps which string should be greater or smaller than
the other.
This is done using the following operators:
‘==’ This checks whether two strings are equal
‘<’ This checks if the string on its left is smaller than that on its right
‘<=’ This checks if the string on its left is smaller than or equal to that on its right
‘>’ This checks if the string on its left is greater than that on its right
‘>=’ This checks if the string on its left is greater than or equal to that on its right
Comparison of strings is performed character by character comparison rules for ASCII and
Unicode. ASCII values of numbers 0 to 9 are from 48 to 57, uppercase (A to Z) are from 65 to
90, and lowercase (a to z) are from 97 to 122.
56
Chapter 1: Foundation in Python Programming
Here we write command and to execute the command just press enter key and your
command will be interpreted. For coding in Python you must know the basics of the console
used in Python.
You are free to write the next command on the shell only when after executing the first
command these prompts have appeared. The Python Console accepts command in Python
which you write after the prompt.
User enters the values in the Console and that value is then used in the program as it was
required.
To take input from the user we make use of a built-in function input().
57
Chapter 1: Foundation in Python Programming
Though it is not necessary to pass arguments in the print() function, it requires an empty
parenthesis at the end that tells python to execute the function rather calling it by name.
Now, let’s explore the optional arguments that can be used with the print() function.
58
Chapter 1: Foundation in Python Programming
59
Chapter 1: Foundation in Python Programming
1.6.2 tuple
A tuple in Python is similar to a list. The difference between the two is that we cannot change
the elements of a tuple once it is assigned whereas we can change the elements of a list.
Tuples are also used to store multiple items in a single variable.
Creating a Tuple
A tuple is created by placing all the items (elements) inside parentheses (), separated by
commas. The parentheses are optional, however, it is a good practice to use them.
A tuple can have any number of items and they may be of different types (integer, float,
list, string, etc.).
Tuple is created using parenthesis () as explained below:
1.6.3 Dictionary
Python dictionary is an unordered collection of items. Each item of a dictionary has
a key/value pair.
Creating a dictionary
Creating a dictionary is as simple as placing items inside curly braces {} separated by
commas.
An item has a key and a corresponding value that is expressed as a pair (key: value).
60
Chapter 1: Foundation in Python Programming
While the values can be of any data type and can repeat, keys must be of immutable type
(string, number or tuple with immutable elements) and must be unique.
61
Chapter 1: Foundation in Python Programming
62
Chapter 1: Foundation in Python Programming
63
Chapter 1: Foundation in Python Programming
Slicing in List
64
Chapter 1: Foundation in Python Programming
Figure 84: get all the items from one position to another position
Slicing in Tuple
65
Chapter 1: Foundation in Python Programming
In above example, the set () selects the unique values and list () converts the set into list.
We can concatenate a list to another list or simply merge two lists using extend() function.
66
Chapter 1: Foundation in Python Programming
Concatenation of Tuple.
Concatenation of two separate tuples can be done using ‘+’ operation.
Concatenation of two tuples.
67
Chapter 1: Foundation in Python Programming
Mutable Definition
Mutable is when something is changeable or has the ability to change. In Python, ‘mutable’ is
the ability of objects to change their values. These are often the objects that store a collection
of data.
Immutable Definition
Immutable is the when no change is possible over time. In Python, if the value of an object
cannot be changed over time, then it is known as immutable. Once created, the value of these
objects is permanent.
List of Mutable and Immutable objects
Objects of built-in type that are mutable are:
Lists
Sets
Dictionaries
User-Defined Classes (It purely depends upon the user to define the characteristics)
Objects of built-in type that are immutable are:
Numbers (Integer, Rational, Float, Decimal, Complex & Booleans)
Strings
Tuples
Frozen Sets
User-Defined Classes (It purely depends upon the user to define the characteristics)
Objects in Python
In Python, everything is treated as an object. Every object h as these three attributes:
Identity – This refers to the address that the object refers to in the computer’s
memory.
Type – This refers to the kind of object that is created. For example- integer, list, string
etc.
Value – This refers to the value stored by the object. For example – List=[1,2,3] would
hold the numbers 1,2 and 3
Identity and Type cannot be changed once it’s created, values can be changed for Mutable
objects.
68
Chapter 1: Foundation in Python Programming
69
Chapter 1: Foundation in Python Programming
70
Chapter 1: Foundation in Python Programming
71
Chapter 1: Foundation in Python Programming
72
Chapter 1: Foundation in Python Programming
73
Chapter 1: Foundation in Python Programming
74
Chapter 1: Foundation in Python Programming
1.7 Functions
The top-down approach, starting at the general levels to gain an understanding of the system
and gradually moving down to levels of greater detail is done in the analysis stage. In the
process of moving from top to bottom, each component is exploded into more and more
details.
Thus, the problem at hand is analysed or broken down into major components, each of which
is again broken down if necessary.
The top-down process involves working from the most general down to the most specific.
The design of modules is reflected in hierarchy charts such as the one shown in Figure below:
The purpose of procedure Main is to coordinate the three branch operations e.g. Get, Process,
and Put routines. These three routines communicate only through Main. Similarly, Sub1 and
Sub2 can communicate only through the Process routine.
Advantages of Top-down Approach
The advantages of the top-down approach are as follows:
This approach allows a programmer to remain “on top of” a problem and view the developing
solution in context. The solution always proceeds from the highest level downwards.
By dividing the problem into a number of sub-problems, it is easier to share problem
development. For example, one person may solve one part of the problem and the other
person may solve another part of the problem.
Since debugging time grows quickly when the program is longer, it will be to our advantage
to debug a long program divided into a number of smaller segments or parts rather than one
big chunk. The top-down development process specifies a solution in terms of a group of
smaller, individual subtasks. These subtasks thus become the ideal units of the program for
75
Chapter 1: Foundation in Python Programming
Talking of modularity in terms of files and repositories, modularity can be on different levels
-
o Libraries in projects
o Function in the files
o Files in the libraries or repositories
Modularity is all about making blocks, and each block is made with the help of other blocks.
Every block in itself is solid and testable and can be stacked together to create an entire
application. Therefore, thinking about the concept of modularity is also like building the
whole architecture of the application.
Module
A module is defined as a part of a software program that contains one or more routines.
When we merge one or more modules, it makes up a program. Whenever a product is built
on an enterprise level, it is a built-in module, and each module performs different operations
and business. Modules are implemented in the program through interfaces. The introduction
of modularity allowed programmers to reuse prewritten code with new applications.
Modules are created and merged with compilers, in which each module performs a business
or routine operation within the program.
For example – SAP (System, Applications, and Products) comprises large modules like
finance, payroll, supply chain, etc. In terms of softwares example of a module is Microsoft
Word which uses Microsoft paint to help users create drawings and paintings.
76
Chapter 1: Foundation in Python Programming
77
Chapter 1: Foundation in Python Programming
The general syntax for creating a function in Python looks something like this:
def
function_name(parameters):
function body
Keep in mind that if you forget the parentheses () or the colon (:) when trying to define a
new function, Python will let you know with a Syntax Error.
78
Chapter 1: Foundation in Python Programming
def hello_world_func():
print("hello world")
Once you've defined a function, the code will not run on its own. To execute the code inside
the function, you have make a function invocation or else a function call.
You can then call the function as many times as you want. To call a function you need to do
this:
Function_name(arguments)
The function name has to be followed by parentheses. If there are any required
arguments, they have to be passed in the parentheses. If the function doesn't take in
any arguments, you still need the parentheses.
To call the function from the example above, which doesn't take in any arguments, do the
following:
hello_world_func()
#Output
#hello world
79
Chapter 1: Foundation in Python Programming
def hello_to_you(name):
print("Hello " + name)
In the example above, there is one parameter, name.We can pass more than one parameters
in the function as shown below :
Figure 107
Fig 108: Code: Calling function with parameter
The function can be called many times, passing in different values each time.
Figure 108
Fig 109: Code: Function can be called many times
80
Chapter 1: Foundation in Python Programming
Function_declaration ():
Variable= “var_assign”
Logic statement ()
Function_declaration () //calling of the function
Syntax 2: Declaration of the Local Variable
Figure 110
function is declared, and then the variable is taken, which creates the memory, and on top of
it, a variable is assigned, which makes it a local variable after which the function is called and
then the following logic statement is called to perform a lot of manipulation and work.
How Local Variable Works in python?
This program demonstrates the local variable when defined within the function where the
variable is declared within function and then a statement followed by the function calling as
shown in the output below.
A local variable in Python plays a significant role in the sense it helps in making the function
and the code snippet access to other member variables with manipulation simple and easy.
In addition, local variables help in making the entire workflow with the global variable
compatible and less complex. Also, the nested functions or statements play a very nice blend
with local variables.
1.7.9 The Return Statement
A return statement is used to end the execution of the function call and “returns” the result
(value of the expression following the return keyword) to the caller. The statements after the
return statements are not executed. If the return statement is without any expression, then
the special value None is returned. A return statement is overall used to invoke a function
so that the passed statements can be executed.
Note: Return statement can not be used outside the function.
defun():
statements
return [expression]
Syntax 3: The return statement
81
Chapter 1: Foundation in Python Programming
In this function, the parameter name does not have a default value and is required
(mandatory) during a call.
On the other hand, the parameter msg has a default value of " Greeting of Day!". So, it is
optional during a call. If a value is provided, it will overwrite the default value.
Any number of arguments in a function can have a default value. But once we have a default
argument, all the arguments to its right must also have default values.
keyword arguments:
When we call a function with some values, these values get assigned to the arguments
according to their position.
For example, in the above function greet(), when we called it as greet("Kapil", "How do you
do?"), the value "Kapil" gets assigned to the argument name and similarly "How do you
do?" to msg.
82
Chapter 1: Foundation in Python Programming
Python allows functions to be called using keyword arguments. When we call functions in
this way, the order (position) of the arguments can be changed. Following calls to the above
function are all valid and produce the same result.
VArArgs parameters.
Python has *args which allow us to pass the variable number of non keyword arguments to
function.
In the function, we should use an asterisk * before the parameter name to pass variable
length arguments. The arguments are passed as a tuple and these passed arguments make
tuple inside the function with same name as the parameter excluding asterisk *.
83
Chapter 1: Foundation in Python Programming
A library is a collection of modules or functions in a python that allows doing specific tasks
to fulfill user’s needs.
1.8.1 input()
The input() function reads a line from the input (usually from the user), converts the line
into a string by removing the trailing newline, and returns it.
If EOF is read, it raises an EOFError exception.
1.8.2 eval()
The eval() method parses the expression passed to this method and runs python expression
(code) within the program.
84
Chapter 1: Foundation in Python Programming
The print() function prints the given object to the standard output device (screen) or to the
text stream file.
print() Parameters
objects - object to the printed. * indicates that there may be more than one object
sep - objects are separated by sep. Default value: ' '
end - end is printed at last
file - must be an object with write(string) method. If omitted, sys.stdout will be used
which prints objects on the screen.
flush - If True, the stream is forcibly flushed. Default value: False
85
Chapter 1: Foundation in Python Programming
String functions are built-in operations or methods in many programming languages that
allow you to manipulate, modify, or analyze strings (sequences of characters).
The count() method returns the number of occurrences of a substring in the given string.
86
Chapter 1: Foundation in Python Programming
count() Parameters : count() method only requires a single parameter for execution.
However, it also has two optional parameters:
substring - string whose count is to be found.
start (Optional) - starting index within the string where search starts.
end (Optional) - ending index within the string where search ends.
Note: Index in Python starts from 0, not 1. count() method returns the number of
occurrences of the substring in the given string.
The find() method returns the index of first occurrence of the substring (if found). If not
found, it returns -1.
Explanation of find() function.
find() Syntax:
str.find(sub[, start[, end]] )
Syntax 4: The syntax of the find() method
find() Parameters
The find() method takes maximum of three parameters:
sub - It is the substring to be searched in the str string.
start and end (optional) - The range str[start:end] within which substring is
searched.
find() Return Value
The find() method returns an integer value:
If the substring exists inside the string, it returns the index of the first occurence of
the substring.
If a substring doesn't exist inside the string, it returns -1.
87
Chapter 1: Foundation in Python Programming
The rfind() method returns the highest index of the substring (if found). If not found, it
returns -1.
The syntax of rfind() is:
str.rfind(sub[, start[, end]] )
The Syntax of rfind() method.
rfind() Parameters
rfind() method takes a maximum of three parameters:
sub - It's the substring to be searched in the str string.
start and end (optional) - substring is searched within str[start:end]
Return Value from rfind()
rfind() method returns an integer value.
If substring exists inside the string, it returns the highest index where substring is
found.
If substring doesn't exist inside the string, it returns -1.
1.8.8 Various string functions capitalize(), title(), lower(), upper() and swapcase()
The capitalize() method converts the first character of a string to an uppercase letter
and all other alphabets to lowercase.
The islower() method returns True if all alphabets in a string are lowercase alphabets.
If the string contains at least one uppercase alphabet, it returns False.
The upper() method converts all lowercase characters in a string into uppercase
characters and returns it.
The title() method returns a string with first letter of each word capitalized; a title
cased string.
The swapcase() method returns the string by converting all the characters to their
opposite letter case( uppercase to lowercase and vice versa).
88
Chapter 1: Foundation in Python Programming
Figure 125: Various string functions capitalize(), title(), lower(), upper() and swapcase()
The islower() method returns True if all alphabets in a string are lowercase alphabets.
If the string contains at least one uppercase alphabet, it returns False.
The upper() method converts all lowercase characters in a string into uppercase
characters and returns it.
The istitle() returns True if the string is a titlecased string. If not, it returns False.
89
Chapter 1: Foundation in Python Programming
The replace() method replaces each matching occurrence of the old character/text in the
string with the new character/text.
The strip() method returns a copy of the string by removing both the leading and the trailing
characters (based on the string argument passed).
90
Chapter 1: Foundation in Python Programming
The pow() method computes the power of a number by raising the first argument to the
second argument
91
Chapter 1: Foundation in Python Programming
1.8.13 recursion
92
Chapter 1: Foundation in Python Programming
When we call this function with a positive integer, it will recursively call itself by decreasing
the number.
Each function multiplies the number with the factorial of the number below it until it is equal
to one. This recursive call can be explained in the following steps.
factorial(5) # 1st call with 5
5 * factorial(4) # 2nd call with 4
5 * 4 * factorial(3) # 3rd call with 3
5 * 4 * 3 *factorial(2) # 4th call with 2
5 * 4 *3*2* factorial(1) # 5th call with 1
Our recursion ends when the number reduces to 1. This is called the base condition. Every
recursive function must have a base condition that stops the recursion or else the function
calls itself infinitely.
Advantages of Recursion
Recursive functions make the code look clean and elegant.
A complex task can be broken down into simpler sub-problems using recursion.
Sequence generation is easier with recursion than using some nested iteration.
Disadvantages of Recursion
Sometimes the logic behind recursion is hard to follow through.
Recursive calls are expensive (inefficient) as they take up a lot of memory and time.
Recursive functions are hard to debug.
We don't usually store all of our files on our computer in the same location. We use a well-
organized hierarchy of directories for easier access.
Similar files are kept in the same directory, for example, we may keep all the songs in the
"music" directory. Analogous to this, Python has packages for directories and modules for
files.
93
Chapter 1: Foundation in Python Programming
As our application program grows larger in size with a lot of modules, we place similar
modules in one package and different modules in different packages. This makes a project
(program) easy to manage and conceptually clear.
Similarly, as a directory can contain subdirectories and files, a Python package can have sub-
packages and modules.
A directory must contain a file named __init__.py in order for Python to consider it as a
package. This file can be left empty but we generally place the initialization code for that
package in this file.
Here is an example. Suppose we are developing a game. One possible organization of
packages and modules could be as shown in the figure below.
94
Chapter 1: Foundation in Python Programming
95
Chapter 1: Foundation in Python Programming
96
Chapter 1: Foundation in Python Programming
Importing Module
We can import the definitions inside a module to another module or the interactive
interpreter in Python. We use the import keyword to do this. To import our previously
defined module example, we type the following in the Python prompt.
97
Chapter 1: Foundation in Python Programming
Reloading a Module
The Python interpreter imports a module only once during a session. This makes things more
efficient. Here is an example to show how this works.
Suppose we have the following code in a module named my_module.
Figure 143
We can see that our code got executed only once. This goes to say that our module was
imported only once. Now if our module changed during the course of the program, we would
have to reload it. One way to do this is to restart the interpreter. But this does not help much.
File handling in Python allows us to create, read, write, and delete files. It's a vital part of
many applications that need to store or process data from files such as .txt, .csv, or .json.
Python provides built-in functions and a simple syntax for file operations using the open()
function, which gives us a file object to work with. Files can be opened in different modes,
such as:
98
Chapter 1: Foundation in Python Programming
3. Appending to a File
99
Chapter 1: Foundation in Python Programming
File handling in Python is simple and powerful. It allows reading and writing to files in
various modes, and using the with statement is a clean and safe way to work with files.
Mastering this concept is essential for working with data, logs, or any persistent storage in
Python.
A library is a collection of books or is a room or place where many books are stored to be
used later. Similarly, in the programming world, a library is a collection of precompiled codes
that can be used later on in a program for some specific well-defined operations. Other than
pre-compiled codes, a library may contain documentation, configuration data, message
templates, classes, and values, etc.
A Python library is a collection of related modules. It contains bundles of code that can be
used repeatedly in different programs. It makes Python Programming simpler and
convenient for the programmer. As we don’t need to write the same code again and again for
different programs. Python libraries play a very vital role in fields of Machine Learning, Data
Science, Data Visualization, etc.
1. NumPy:
NumPy (Numerical Python) is a library used for working with arrays and performing
numerical computations efficiently. It provides support for multi-dimensional arrays,
linear algebra, Fourier transforms, and random number capabilities.
100
Chapter 1: Foundation in Python Programming
2. Matplotlib:
Matplotlib is a plotting library used to create static, interactive, and animated
visualizations in Python.
It allows users to generate a wide range of graphs like line plots, bar charts,
histograms, and scatter plots.
3. Pandas:
Pandas is a powerful library for data manipulation and analysis, built on top of
NumPy.
It offers data structures like Series and DataFrame for handling structured data with
ease.
101
Chapter 1: Foundation in Python Programming
Assessment Criteria
Detailed PC-wise assessment criteria and assessment marks for the NOS are as follows:
102
Chapter 1: Foundation in Python Programming
Refrences :
Exercise 1
103
Chapter 1: Foundation in Python Programming
4) Write a program to generate the following output using for and while statement.
1
22
333
4444
55555
5) Write a program to generate the following output using for and while statement.
54321
4321
321
21
1
6) Write a program to display first ten prime numbers using for and while statement.
7) Write a program to find factorial of number using for and while statement.
8) Write a program to count the total number of digits in a number using for and while
statement.
9) Write a program to display Fibonacci series up to 10 terms
10)Write a program to calculate the cube of all numbers from 1 to a given number
105
Chapter 1: Foundation in Python Programming
Exercise 2
Multiple Choice Questions
c. merge items
d. None of the above
20)Concatenation operator is :
a. ‘+’
b. ‘-’
c. ‘&’
d. ‘%’
State whether statement is true or false
1) Integer values can’t be stored as strings. (T/F)
2) Indexing of string is done manually by the programmer. (T/F)
3) We can use positive and negative indexing to access string elements. (T/F)
4) Comparison operators cannot be used for comparing strings. (T/F)
5) Lists are mutable. (T/F)
6) Tuple are mutable. (T/F)
7) Dictionaries are mutable. (T/F)
8) insert () function is used add an element at desired location in a List. (T/F)
9) In dictionary elements are stored as key:value pair. (T/F)
10)Membership operator is not available in dictionaries. (T/F)
Fill in the blanks
1) A string can be traversed using an ________.
2) A string can be accessed using positive and _______ index.
3) A ______ function is used to add element in a List at last location.
4) Concatenation of two Dictionaries can be done using ____ operator.
5) ______ function is used to delete complete elements in a list.
6) Elements in a dictionary are stored as key and ____ pair.
7) The method of extracting part of tuple is called _________.
8) To display the message on monitor _____ function is used.
9) __________ function is used to take input from console.
10) To check two strings are equal _____ operator is used.
Lab Practice Questions
1) Write a program to display index value against each character in string.
2) Write a program to Maximum and Minimum frequency of a character in string.
3) Write a program to split and join a string.
4) Write a program to find length of list.
5) Write a program for reversing a list.
6) Write a program for finding largest and smallest number in a list.
7) Write a program to print all odd numbers in tuple.
8) Write a program to join two tuples if there fist element is same.
9) Write a program to explain min(), max() and mean() functions using List.
10)Write a program to explain mutability using List, Tuple and Dictionary.
Exercise 3
Multiple choice questions
21)In Top-down approach a problem _______
a. Combined to form bigger modules
b. Divided into smaller modules
107
Chapter 1: Foundation in Python Programming
26)_________ function used to convert method converts the first character of a string to an
uppercase letter and all other alphabets to lowercase.
a. capitalize()
b. upper()
c. lower()
d. None of the above
27)Using today() function we will bale to know
a. Current day
b. Current month
c. Current year
d. All of the above
28)Which of the function helps in finding power of a number:
a. pow()
b. powmin()
c. maxpow()
d. None of the above
29)With the help of which keyword we are able to include modules in our program:
a. Bypass
b. break
c. import
d. None of the above
30)With the help of reloading module we need not to do ______
a. Shutdown interpreter
108
Chapter 1: Foundation in Python Programming
b. Shutdown compiler
c. Restart interpreter
d. Restart compiler
109
Chapter 1: Foundation in Python Programming
110
Chapter 2: Basics of Artificial Intelligence & Data Science
Chapter 2:
Basics of Artificial Intelligence & Data Science
2.1 Introduction to AI
Artificial Intelligence (AI) is when a computer algorithm does intelligent work. On the other
hand, Machine Learning is a part of AI that learns from the data that also involves the
information gathered from previous experiences and allows the computer program to
change its behavior accordingly. Artificial Intelligence is the superset of Machine
Learning i.e. all Machine Learning is Artificial Intelligence but not all AI is Machine Learning.
AI manages the making of machines, What ML does, depends on the user input or a
frameworks, and different gadgets query requested by the client, the framework
savvy by enabling them to think and do checks whether it is available in the knowledge
errands as all people generally do. base or not. If it is available, it will restore the
outcome to the user related to that query,
however, if it isn’t stored initially, the machine
will take in the user input and will enhance its
knowledge base, to give a better value to the
end-user
Table 1: Artificial Intelligence vs Machine Learning
Future Scope –
Artificial Intelligence and Machine Learning are likely to replace the current model of
technology that we see these days, for example, traditional programming packages
like ERP and CRM are certainly losing their charm.
Firms like Facebook, and Google are investing a hefty amount in AI to get the desired
outcome at a relatively lower computational time.
Artificial Intelligence is something that is going to redefine the world of software and IT
in the near future.
111
Chapter 2: Basics of Artificial Intelligence & Data Science
Artificial Intelligence (AI) has been cooperatively created over decades by researchers,
scientists, and organizations worldwide. The achievement is the result of collective
endeavors by numerous pioneers and teams.
Evolution of AI
• 1950: Alan Turing introduced the concept of a machine that can simulate human
intelligence (Turing Test).
• 1956: John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon organized
the Dartmouth Conference, where the term ”Artificial Intelligence” was officially coined.
• 1960s-70s: Development of early AI programs like ELIZA (chatbot) and SHRDLU (language
understanding).
• 1980s: Rise of Expert Systems and rule-based systems in industries.
• 1990s: Advancements in Machine Learning and IBM’s Deep Blue defeated world chess
champion Garry Kasparov.
• 2000s: Emergence of data-driven AI, big data, and faster computation.
• 2010s-Present: Breakthroughs in Deep Learning, NLP (e.g., GPT, BERT), self-driving cars,
and AI in healthcare, finance, etc.
112
Chapter 2: Basics of Artificial Intelligence & Data Science
intricate issues that encompass substantial volumes of unstructured data, including photos,
audio, and text. Applications:
o Image and speech recognition
o Autonomous vehicles
o Natural language processing tasks like translation
113
Chapter 2: Basics of Artificial Intelligence & Data Science
NLP stands for Natural Language Processing, a field of Artificial Intelligence (AI) that focuses
on enabling computers to understand, interpret, and generate human language. It's a crucial
technology for many applications, including chatbots, search engines, and translation
services. NLP also plays a vital role in analyzing text data from various sources like emails,
social media, and customer feedback.
Applications:
Voice assistants: NLP is used to develop virtual assistants like Siri, Alexa, and Google
Assistant.
Email filters: NLP is used to identify spam in emails.
Translation: NLP is used to translate foreign languages.
Search results: NLP is used to improve search results.
Predictive text: NLP is used to predict what you might type next.
Sentiment analysis: NLP is used to analyze how people feel about something.
Chatbots: NLP is used to create chatbots that can understand and respond to users.
Data Science is a field that gives insights from structured and unstructured data, using
different scientific methods and algorithms, and consequently helps in generating insights,
making predictions and devising data driver solutions. It uses a large amount of data to get
meaningful insights using statistics and computation for decision making.The data used in
Data Science is usually collected from different sources, such as e-commerce sites, surveys,
social media, and internet searches. All this access to data has become possible due to the
advanced technologies for data collection. This data helps in making predictions and
providing profits to the businesses accordingly. Data Science is the most discussed topic in
today’s time and is a hot career option due to the great opportunities it has to offer.
114
Chapter 2: Basics of Artificial Intelligence & Data Science
Data Analyst:
The role of a Data Analyst is quite similar to a Data Scientist in terms of responsibilities, and
skills required. The skills shared between these two roles include SQL and data query
knowledge, data preparation and cleaning, applying statistical and mathematical methods to
find the insights, data visualizations, and data reporting.
The main difference between the two roles is that Data Analysts do not need to be skilled in
programming languages and do not need to perform data modeling or have the knowledge
of machine learning.
The tools used by both Data Scientists and Data Analysts are also different. The tools used by
Data Analysts Are Tableau, Microsoft Excel, SAP, SAS, and Qlik.Data Analysts also perform
the task of data mining and data modeling, but they use SAS, Rapid Miner, KNIME, and IBM
SPSS Moderator. They are provided with the problem statement and the goal. They just have
to perform the data analysis and deliver data reporting to the managers.
When it comes to the problem framing process, there are four key steps to follow once the
problem statement is introduced. These can help you better understand and visualize the
problem as it relates to larger business needs. Using a visual aid to look at a problem can give
your team a bigger picture view of the problem you’re trying to solve. By contextualizing,
116
Chapter 2: Basics of Artificial Intelligence & Data Science
prioritizing, and understanding the details on a deeper level, your team can develop a
different point of view when reviewing the problem with stakeholders.
Next, prioritize the pain points based on other issues and project objectives. Questions such
as, “Does this problem prevent objectives from being met?” and, “Will this problem deplete
necessary resources?” are good ones to get you started.
To understand the problem, collect information from diverse stakeholders and department
leaders. This will ensure you have a wide range of data.
Finally, it's time to get your solution approved. Quality assure your solution by testing in one
or more internal scenarios. This way you can be sure it works before introducing it to
external customers.
Before an analyst begins collecting data, they must answer three questions first:
117
Chapter 2: Basics of Artificial Intelligence & Data Science
What methods and procedures will be used to collect, store, and process the information?
Additionally, we can break up data into qualitative and quantitative types. Qualitative data
covers descriptions such as colour, size, quality, and appearance. Quantitative data,
unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc. Data
collection could mean a telephone survey, a mail-in comment card, or even some guy with a
clipboard asking passersby some questions. Data collection breaks down into two methods.
Primary.: As the name implies, this is original, first-hand data collected by the data
researchers. This process is the initial information gathering step, performed before
anyone carries out any further or related research. Primary data results are highly
accurate provided the researcher collects the information. However, there’s a
downside, as first-hand research is potentially time-consuming and expensive.
Secondary: Secondary data is second-hand data collected by other parties and already
having undergone statistical analysis. This data is either information that the
researcher has tasked other people to collect or information the researcher has
looked up. Simply put, it’s second-hand information. Although it’s easier and cheaper
to obtain than primary information, secondary information raises concerns regarding
accuracy and authenticity. Quantitative data makes up a majority of secondary data.
2.2.3 Processing
Data processing occurs when data is collected and translated into usable information.
Usually performed by a data scientist or team of data scientists, it is important for data
processing to be done correctly as not to negatively affect the end product, or data output.
2.2.4 Six stages of data processing
118
Chapter 2: Basics of Artificial Intelligence & Data Science
1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources,
including data lakes and data warehouses. It is important that the data sources available are
trustworthy and well-built so the data collected (and later used as information) is of the
highest possible quality.
2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often
referred to as “pre-processing” is the stage at which raw data is cleaned up and organized
for the following stage of data processing. During preparation, raw data is diligently checked
for any errors. The purpose of this step is to eliminate bad data (redundant, incomplete, or
incorrect data) and begin to create high-quality data for the best business intelligence.
3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data
warehouse like Redshift), and translated into a language that it can understand. Data input
is the first stage in which raw data begins to take the form of usable information.
4. Processing
During this stage, the data inputted to the computer in the previous stage is actually
processed for interpretation. Processing is done using machine learning algorithms, though
the process itself may vary slightly depending on the source of data being processed (data
lakes, social networks, connected devices etc.) and its intended use (examining advertising
patterns, medical diagnosis from connected devices, determining customer needs, etc.).
119
Chapter 2: Basics of Artificial Intelligence & Data Science
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data
scientists. It is translated, readable, and often in the form of graphs, videos, images, plain text,
etc.). Members of the company or institution can now begin to self-serve the data for their
own data analytics projects.
6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored
for future use. While some information may be put to use immediately, much of it will serve
a purpose later on. Plus, properly stored data is a necessity for compliance with data
protection legislation like GDPR. When data is properly stored, it can be quickly and easily
accessed by members of the organization when needed.
When working with data, your analysis and insights are only as good as the data you use. If
you’re performing data analysis with dirty data, your organization can’t make efficient and
effective decisions with that data. Data cleaning is a critical part of data management that
allows you to validate that you have a high quality of data.
Data cleaning includes more than just fixing spelling or syntax errors. It’s a fundamental
aspect of data science analytics and an important machine learning technique. Today, we’ll
learn more about data cleaning, its benefits, issues that can arise with your data
120
Chapter 2: Basics of Artificial Intelligence & Data Science
Data cleaning differs from data transformation because you’re actually removing data
that doesn’t belong in your dataset. With data transformation, you’re changing your data
to a different format or structure. Data transformation processes are sometimes referred to
as data wrangling or data munging. The data cleaning process is what we’ll focus on today.
To determine data quality, you can study its features and weigh them according to what’s
important to your organization and your project.
There are five main features to look for when evaluating your data:
Now that we know how to recognize high-quality data, let’s dive deeper into the process of
data science cleaning, why it’s important, and how to do it effectively.
Exploratory Data Analysis refers to the critical process of performing initial investigations on
data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions
with the help of summary statistics and graphical representations. It is a good practice to
understand the data first and try to gather as many insights from it. EDA is all about making
sense of data in hand, before getting them dirty with it.
To understanding the concept and techniques we’ll take an example of white variant of Wine
Quality data set which is available on UCI Machine Learning Repository and try to catch hold
of as many insights from the data set using EDA.
To starts with, import necessary libraries (for this example pandas, numpy, matplotlib and
seaborn) and loaded the data set.
Note: Whatever inferences are extracted, is mentioned with bullet points.
121
Chapter 2: Basics of Artificial Intelligence & Data Science
122
Chapter 2: Basics of Artificial Intelligence & Data Science
123
Chapter 2: Basics of Artificial Intelligence & Data Science
Data visualization techniques most important part of Data Science, there won’t be any doubt
about it. And even in the Data Analytics space as well the Data visualization doing a major
role. We will discuss this in detail with help of Python packages and how it helps during the
Data Science process flow. This is a very interesting topic for every Data Scientist and Data
Analyst.
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output. More preciously, we can say; first, we train
the machine with the input and corresponding output, and then we ask the machine to
predict the output using the test dataset.
124
Chapter 2: Basics of Artificial Intelligence & Data Science
Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height
(dogs are taller, cats are smaller), etc. After completion of training, we input the picture
of a cat and ask the machine to identify the object and predict the output. Now, the machine
is well trained, so it will check all the features of the object, such as height, shape, colour,
eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the
process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
125
Chapter 2: Basics of Artificial Intelligence & Data Science
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to
find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
Semi-supervised learning is a machine learning technique that uses both labeled and
unlabeled data to train models. It combines the strengths of supervised and unsupervised
learning, enabling models to learn from a mix of explicit examples and broader data
structure. This approach is particularly useful when labeled data is scarce but unlabeled data
is abundant.
Here's a more detailed breakdown:
Key Concepts:
Labeled data:
Provides explicit examples of what input data corresponds to which labels, allowing the
model to learn to predict the label for new data.
Unlabeled data:
126
Chapter 2: Basics of Artificial Intelligence & Data Science
Helps the model understand the overall structure and distribution of the data, improving its
generalization ability.
Benefits:
Reduced labeling cost: Requires less human effort to label data compared to
fully supervised learning.
Improved performance: Leverages the information in unlabeled data to
create more robust and accurate models.
Applicable to various tasks: Can be used for classification, regression, and
other machine learning tasks.
Techniques:
Self-training: Uses a supervised model to predict labels for unlabeled data,
then iteratively retrains the model with the newly labeled data.
Co-training: Trains multiple classifiers independently on different views of
the data, and they then label each other's unlabeled data.
Graph-based methods: Represent data as a graph and use graph structures
to propagate information between nodes, helping the model learn from both
labeled and unlabeled data.
Examples:
Image classification: Using a small amount of labeled images to train a model,
and then using a large amount of unlabeled images to further refine the
model's understanding of visual patterns.
Text classification: Using a small amount of labeled text documents to train
a model, and then using a large amount of unlabeled text documents to
improve the model's ability to understand and classify different types of text.
In essence, semi-supervised learning bridges the gap between supervised and unsupervised
learning, allowing models to learn from both explicit examples and implicit patterns in the
data, ultimately leading to more robust and accurate predictions.
Reinforcement learning (RL) is a subfield of machine learning where an agent learns to make
decisions by interacting with an environment to maximize a reward. It differs from
supervised learning by not relying on labeled data; instead, the agent learns through trial
and error, receiving feedback (rewards or penalties) for its actions. The goal is to develop an
optimal policy, a set of rules, that guides the agent to achieve the desired outcome in a given
environment.
Here's a more detailed explanation:
Key Concepts:
Agent: The system that learns and interacts with the environment.
Environment: The context in which the agent operates, including its state and the
consequences of the agent's actions.
Actions: The choices the agent makes within the environment.
Reward: Feedback received by the agent for each action, indicating whether it was
beneficial or detrimental.
State: The current situation of the environment that the agent observes.
127
Chapter 2: Basics of Artificial Intelligence & Data Science
Policy: The agent's strategy or decision-making rule that determines which action to
take in a given state.
How Reinforcement Learning Works:
1. Interaction:
The agent interacts with the environment, taking actions and receiving feedback.
2. Learning:
The agent uses the feedback to update its policy, learning which actions lead to higher
rewards.
3. Iteration:
This process of interaction and learning is repeated iteratively, allowing the agent to refine
its policy over time.
4. Optimization:
The goal is to find a policy that maximizes the cumulative reward received over time.
128
Chapter 2: Basics of Artificial Intelligence & Data Science
This process provides a context in which we can consider the data preparation required for
the project, informed both by the definition of the project performed before data preparation
and the evaluation of machine learning algorithms performed after.
129
Chapter 2: Basics of Artificial Intelligence & Data Science
The first subset is known as the training data - it’s a portion of our actual dataset that is fed
into the machine learning model to
discover and learn patterns. In this
way, it trains our model. The other
subset is known as the testing data.
Training data is typically larger
than testing data.
130
Chapter 2: Basics of Artificial Intelligence & Data Science
131
Chapter 2: Basics of Artificial Intelligence & Data Science
1. Linear regression: Linear regression is one of the easiest and most popular Machine
Learning algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such as sales, salary,
age, product price, etc.
2. Logistic regression : Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is used for predicting
the categorical dependent variable using a given set of independent variables.
3. Decision tree : Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving Classification
problems. It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the outcome.
4. SVM algorithm : Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
5. Naive Bayes algorithm : Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems. It is mainly used in text
classification that includes a high-dimensional training dataset.
7. K-means : K-means clustering algorithm computes the centroids and iterates until we it finds
optimal centroid. It assumes that the number of clusters are already known. It is also
called flat clustering algorithm. The number of clusters identified from data by algorithm is
represented by ‘K’ in K-means.In this algorithm, the data points are assigned to a cluster in
such a manner that the sum of the squared distance between the data points and centroid
would be minimum. It is to be understood that less variation within the clusters will lead to
more similar data points within same cluster.
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
132
Chapter 2: Basics of Artificial Intelligence & Data Science
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.
133
Chapter 2: Basics of Artificial Intelligence & Data Science
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning. Google
understands the user interest using various machine learning algorithms and suggests the
product as per customer interest. As similar, when we use Netflix, we find some
recommendations for entertainment series, movies, etc., and this is also done with the help
of machine learning.
134
Chapter 2: Basics of Artificial Intelligence & Data Science
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
135
Chapter 2: Basics of Artificial Intelligence & Data Science
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, open an email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important part. These
assistants record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that
a fraudulent transaction can take place such as fake accounts, fake ids, and steal money in
the middle of a transaction. So to detect this, Feed Forward Neural network helps us by
checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a specific
pattern which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always
a risk of up and downs in shares, so for this machine learning's long short term memory
neural network is used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this feature,
136
Chapter 2: Basics of Artificial Intelligence & Data Science
which is a Neural Machine Learning that translates the text into our familiar language, and it
called as automatic translation.
The technology behind the automatic translation is a sequence-to-sequence learning
algorithm, which is used with image recognition and translates the text from one language
to another language.
The function and popularity of Artificial Intelligence are soaring by the day. Artificial
intelligence is the ability of a system or a program to think and learn from the experience. AI
applications have significantly evolved over the past few years and has found its applications
in almost every business sector. Top artificial intelligence applications in the real world are:
Personalized Shopping
Artificial Intelligence technology is used to create recommendation engines through which
you can engage better with your customers. These recommendations are made in
accordance with their browsing history, preference, and interests. It helps in improving your
relationship with your customers and their loyalty towards your brand.
AI-powered Assistants
Virtual shopping assistants and chatbots help improve the user experience while shopping
online. Natural Language Processing is used to make the conversation sound as human and
personal as possible. Moreover, these assistants can have real-time engagement with your
customers. Did you know that on amazon.com, soon, customer service could be handled by
chatbots?
Although the education sector is the one most influenced by humans, Artificial Intelligence
has slowly begun to seep its roots in the education sector as well. Even in the education
sector, this slow transition of Artificial Intelligence has helped increase productivity among
faculties and helped them concentrate more on students than office or administration work.
137
Chapter 2: Basics of Artificial Intelligence & Data Science
grading paperwork, arranging and facilitating parent and guardian interactions, routine
issue feedback facilitating, managing enrolment, courses, and HR-related topics.
Artificial Intelligence helps create a rich learning experience by generating and providing
audio and video summaries and integral lesson plans.
Voice Assistants
Without even the direct involvement of the lecturer or the teacher, a student can access extra
learning material or assistance through Voice Assistants. Through this, printing costs of
temporary handbooks and also provide answers to very common questions easily.
Personalized Learning
Using AI technology, hyper-personalization techniques can be used to monitor students’ data
thoroughly, and habits, lesson plans, reminders, study guides, flash notes, frequency or
revision, etc., can be easily generated.
Artificial Intelligence has a lot of influence on our lifestyle. Let us discuss a few of them.
Autonomous Vehicles
Automobile manufacturing companies like Toyota, Audi, Volvo, and Tesla use machine
learning to train computers to think and evolve like humans when it comes to driving in any
environment and object detection to avoid accidents.
Spam Filters
The email that we use in our day-to-day lives has AI that filters out spam emails sending
them to spam or trash folders, letting us see the filtered content only. The popular email
provider, Gmail, has managed to reach a filtration capacity of approximately 99.9%.
Facial Recognition
Our favorite devices like our phones, laptops, and PCs use facial recognition techniques by
using face filters to detect and identify in order to provide secure access. Apart from personal
usage, facial recognition is a widely used Artificial Intelligence application even in high
security-related areas in several industries.
138
Chapter 2: Basics of Artificial Intelligence & Data Science
Recommendation System
Various platforms that we use in our daily lives like e-commerce, entertainment websites,
social media, video sharing platforms, like YouTube, etc., all use the recommendation system
to get user data and provide customized recommendations to users to increase engagement.
This is a very widely used Artificial Intelligence application in almost all industries.
Based on research from MIT, GPS technology can provide users with accurate, timely, and
detailed information to improve safety. The technology uses a combination of Convolutional
Neural Network and Graph Neural Network, which makes lives easier for users by
automatically detecting the number of lanes and road types behind obstructions on the
roads. AI is heavily used by Uber and many logistics companies to improve operational
efficiency, analyze road traffic, and optimize routes.
Robotics is another field where artificial intelligence applications are commonly used.
Robots powered by AI use real-time updates to sense obstacles in its path and pre-plan its
journey instantly.
Inventory management
Did you know that companies use intelligent software to ease the hiring process?
Artificial Intelligence helps with blind hiring. Using machine learning software, you can
examine applications based on specific parameters. AI drive systems can scan job
candidates' profiles, and resumes to provide recruiters an understanding of the talent pool
they must choose from.
Artificial Intelligence finds diverse applications in the healthcare sector. AI applications are
used in healthcare to build sophisticated machines that can detect diseases and identify
cancer cells. Artificial Intelligence can help analyze chronic conditions with lab and other
139
Chapter 2: Basics of Artificial Intelligence & Data Science
medical data to ensure early diagnosis. AI uses the combination of historical data and
medical intelligence for the discovery of new drugs.
Artificial Intelligence is used to identify defects and nutrient deficiencies in the soil. This is
done using computer vision, robotics, and machine learning applications, AI can analyze
where weeds are growing. AI bots can help to harvest crops at a higher volume and faster
pace than human laborers.
Another sector where Artificial Intelligence applications have found prominence is the
gaming sector. AI can be used to create smart, human-like NPCs to interact with the players.
It can also be used to predict human behavior using which game design and testing can be
improved. The Alien Isolation games released in 2014 uses AI to stalk the player throughout
the game. The game uses two Artificial Intelligence systems - ‘Director AI’ that frequently
knows your location and the ‘Alien AI,’ driven by sensors and behaviors that continuously
hunt the player.
140
Chapter 2: Basics of Artificial Intelligence & Data Science
Every technology has some disadvantages, and the same goes for Artificial intelligence. Being
so advantageous technology still, it has some disadvantages which we need to keep in our
mind while creating an AI system. Following are the disadvantages of AI:
High Cost: The hardware and software requirement of AI is very costly as it requires
lots of maintenance to meet current world requirements.
Can't think out of the box: Even we are making smarter machines with AI, but still
they cannot work out of the box, as the robot will only do that work for which they
are trained, or programmed.
No feelings and emotions: AI machines can be an outstanding performer, but still it
does not have the feeling so it cannot make any kind of emotional attachment with
human, and may sometime be harmful for users if the proper care is not taken.
Increase dependency on machines: With the increment of technology, people are
getting more dependent on devices and hence they are losing their mental
capabilities.
No Original Creativity: As humans are so creative and can imagine some new ideas
but still AI machines cannot beat this power of human intelligence and cannot be
creative and imaginative.
a) Chatbot
In the past few years, chatbots in Python have become wildly popular in the tech and
business sectors. These intelligent bots are so adept at imitating natural human languages
and conversing with humans, that companies across various industrial sectors are adopting
them. From e-commerce firms to healthcare institutions, everyone seems to be leveraging
this nifty tool to drive business benefits.
When you want to watch a movie or shop online, have you noticed that the items suggested
to you are often aligned with your interests or recent searches? These smart
recommendation systems have learned your behavior and interests over time by following
your online activity. The data is collected at the front end (from the user) and stored and
141
Chapter 2: Basics of Artificial Intelligence & Data Science
analyzed through machine learning and deep learning. It is then able to predict your
preferences, usually, and offer recommendations for things you might want to buy or listen
to next.
142
Chapter 2: Basics of Artificial Intelligence & Data Science
Artificial Intelligence has played a major role in decision making. Not only in the healthcare
industry but AI has also improved businesses by studying customer needs and evaluating
any potential risks.
A powerful use case of Artificial Intelligence in decision making is the use of surgical robots
that can minimize errors and variations and eventually help in increasing the efficiency of
surgeons. One such surgical robot is the Da Vinci, quite aptly named, allows professional
surgeons to implement complex surgeries with better flexibility and control than
conventional approaches.
143
Chapter 2: Basics of Artificial Intelligence & Data Science
NumPy (Numerical Python) is a powerful library in Python that provides support for
handling large, multi-dimensional arrays and matrices, along with a collection of
mathematical functions to operate on these arrays. It is widely used in scientific computing,
data analysis, machine learning, and engineering.
144
Chapter 2: Basics of Artificial Intelligence & Data Science
https://colab.research.google.com/drive/12n0C8Fz6Ck8qFMHjzhX
b3LFButuE68RC?usp=sharing
Example:
Output:
In NumPy, the core data structure is the ndarray (N-dimensional array), which can hold
elements of a single data type. NumPy provides different types of arrays to accommodate
various needs based on data type, dimensions, and memory usage.
The ndarray is the most basic and widely used array in NumPy. It is a multi-dimensional
container that holds elements of a single data type.
145
Chapter 2: Basics of Artificial Intelligence & Data Science
Key Characteristics:
Homogeneous: All elements in a NumPy array are of the same data type (e.g., all
integers, all floats).
Fixed Size: Once created, the size of a NumPy array cannot be changed.
Efficient: NumPy arrays are more efficient in terms of memory and computation
compared to Python's built-in lists.
Example 1
Output:
Example 2
146
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
Example 3
Output:
Slice from the index 3 from the end to index 1 from the end:
147
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
Example 1
From the second element, slice elements from index 1 to index 4 (not included):
Output:
Example 2
148
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
Up until now, we have been discussing some of the basic nuts and bolts of NumPy; in the next
few sections, we will dive into the reasons that NumPy is so important in the Python data
science world. Namely, it provides an easy and flexible interface to optimized computation
with arrays of data.
Computation on NumPy arrays can be very fast, or it can be very slow. The key to making it
fast is to use vectorized operations, generally implemented through NumPy's universal
functions (ufuncs).
These functions include standard trigonometric functions, functions for arithmetic
operations, handling complex numbers, statistical functions, etc. Universal functions have
various characteristics which are as follows-
These functions operate on ndarray (N-dimensional array) i.e. Numpy’s array class.
It performs fast element-wise array operations.
It supports various features like array broadcasting, type casting etc.
Numpy, universal functions are objects those belongs to numpy.ufunc class.
Python functions can also be created as a universal function
using frompyfunc library function.
Some ufuncs are called automatically when the corresponding arithmetic operator
is used on arrays. For example when addition of two array is performed element-
wise using ‘+’ operator then np.add() is called internally.
149
Chapter 2: Basics of Artificial Intelligence & Data Science
150
Chapter 2: Basics of Artificial Intelligence & Data Science
In the Python numpy module, we have many aggregate functions or statistical functions to
work with a single-dimensional or multi-dimensional array. The Python numpy aggregate
functions are sum, min, max, mean, average, product, median, standard deviation, variance,
argmin, argmax and percentile.
To demonstrate these Python numpy aggregate functions, we use the below-shown arrays.
Output:
151
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
152
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
Code 11: Numpy average along with axis (with Axis name)
Output:
Output 11: Numpy average along with axis (with Axis name)
153
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
We are finding the numpy array minimum value in the X and Y-axis.
Code 13: Numpy minimum function with and without axis name
Output:
Output 13: Numpy minimum function with and without axis name
Output:
154
Chapter 2: Basics of Artificial Intelligence & Data Science
Find the maximum value in the X and Y-axis using numpy max function.
Output:
An ndarray is a (usually fixed-size) multidimensional container of items of the same type and
size. The number of dimensions and items in an array is defined by its shape, which is a tuple
of N positive integers that specify the sizes of each dimension.
The type of items in the array is specified by a separate data-type object (dtype), one of which
is associated with each ndarray.
Like other container objects in Python, the contents of an ndarray can be accessed and
modified by indexing or slicing the array (using, for example, N integers), and via the
methods and attributes of the ndarray.
155
Chapter 2: Basics of Artificial Intelligence & Data Science
1 object
Any object exposing the array interface method returns an array, or any
(nested) sequence.
2 dtype
Desired data type of array, optional
3 copy
Optional. By default (true), the object is copied
4 order
C (row major) or F (column major) or A (any) (default)
5 subok
By default, returned array forced to be a base class array. If true, sub-classes
passed through
6 ndmin
Specifies minimum dimensions of resultant array
Table 3: Numpy array Parameters
Example:
Output:
156
Chapter 2: Basics of Artificial Intelligence & Data Science
Example:
More than one dimensions:
Output:
2.11.8 Broadcasting
The term broadcasting describes how numpy treats arrays with different shapes during
arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across
the larger array so that they have compatible shapes. Broadcasting is a powerful mechanism
that allows numpy to work with arrays of different shapes when performing arithmetic
operations. Frequently we have a smaller array and a larger array, and we want to use the
smaller array multiple times to perform some operation on the larger array.
For example, suppose that we want to add a constant vector to each row of a matrix. We
could do like this:
157
Chapter 2: Basics of Artificial Intelligence & Data Science
This works; however, when the matrix x is very large, computing an explicit loop in
Python could be slow.
Note that adding the vector v to each row of the matrix x is equivalent to forming a matrix
vv by stacking multiple copies of v vertically.
Then performing elementwise summation of x and vv. We could implement this approach
like this:
Using Tile:
Code 19: Adding a constant vector to each row of a matrix using tile() function
158
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
Output 19: Adding a constant vector to each row of a matrix using tile() function
Broadcasting:
Output:
The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to
broadcasting; this line works as if v actually had shape (4, 3), where each row was a copy
of v, and the sum was performed elementwise.
Broadcasting Rules:
159
Chapter 2: Basics of Artificial Intelligence & Data Science
The trailing axes of both arrays must either be 1 or have the same size for broadcasting to
occur. Otherwise, a “ValueError: frames are not aligned” exception is thrown.
Broadcasting in Action:
Output:
Case A
a. 1D Array:
For 1D array, let’s suppose we want to access elements at index position
of 0, 4 and -1.
Output:
Output:
161
Chapter 2: Basics of Artificial Intelligence & Data Science
Example:
Output:
Example:
Output:
162
Chapter 2: Basics of Artificial Intelligence & Data Science
Output:
163
Chapter 2: Basics of Artificial Intelligence & Data Science
Assessment Criteria
Refrences :
164
Chapter 2: Basics of Artificial Intelligence & Data Science
Exercise
165
Chapter 2: Basics of Artificial Intelligence & Data Science
7) Which of the following sets the size of the buffer used in ufuncs?
a. setsize(size)
b. bufsize(size)
c. setbufsize(size)
d. All of the mentioned
8) Which of the following attribute should be used while checking for type combination
input and output?
a. .types
b. .class
c. .type
d. None of the above
10) The ________ function returns its argument with a modified shape, whereas the
________method modifies the array itself.
a. resize, reshape
b. reshape, resize
c. reshape2, resize
d. None of the above
12) Which of the following method creates a new array object that looks at the same
data?
a. copy
b. paste
c. view
d. All of the above
13) Which of the following function take only single value as input?
a. fmin
b. minimum
c. iscomplex
d. None of the above
166
Chapter 2: Basics of Artificial Intelligence & Data Science
14) Which of the following set the floating-point error callback function or log object?
a. settercall
b. setter
c. setterstack
d. All of the above
20) Select the most appropriate situation for that a blind search can be used.
a. Real-life situation
b. Small Search Space
c. Complex game
d. All of the above
21) a robot is able to change its own trajectory as per the external conditions, then the
167
Chapter 2: Basics of Artificial Intelligence & Data Science
b. Deterministic
c. Partial
d. Stochastic
169
Chapter 2: Basics of Artificial Intelligence & Data Science
1. Machine Learning (ML) allows machines to learn from data and improve their
performance without being explicitly programmed. (T/F)
2. The first phase in the Data Science Life Cycle is Model Building. (T/F)
3. Primary data is cheaper and faster to collect compared to secondary data. (T/F)
6. Feature engineering is the process of removing features from the dataset to reduce
complexity. (T/F)
8. Validation data helps in evaluating the model's performance on unseen data before
final testing. (T/F)
10. The function np.linspace(1, 10, 5) creates an array with 5 evenly spaced values
between 1 and 10. (T/F)
2. Data Collection and Processing: Create a Python script that collects user-inputted
data, processes it, and stores it in a structured format (e.g., CSV or JSON).
3. Create a NumPy Array: Write a Python program to create a 1D NumPy array with
values from 10 to 50 and print it.
4. Array Operations: Create a 2D NumPy array with values ranging from 1 to 9 (3×3
matrix).
5. Reshaping and Slicing:
170
Chapter 2: Basics of Artificial Intelligence & Data Science
171
Chapter 3 : Introduction to Data Curation
Chapter 3:
Introduction to Data Curation
In the digital age, the volume, variety, and velocity of data generation have increased
exponentially. Organizations, researchers, and industries rely on data-driven insights for
decision-making, innovation, and competitive advantage. However, raw data is often
unstructured, incomplete, or inconsistent, necessitating a systematic approach to manage,
organize, and preserve it. This process, known as data curation, ensures that data remains
accessible, reliable, and meaningful over time.
Data curation involves collecting, cleaning, managing, and archiving data to maintain its
integrity and usability. It is a crucial component in fields such as scientific research,
healthcare, business intelligence, and artificial intelligence, where data quality directly
impacts outcomes.
By implementing robust curation practices, organizations can enhance data interoperability,
compliance, and long-term sustainability.
1. Data Collection and Acquisition: Gathering data from various sources, including
experiments, sensors, databases, and external datasets, ensuring completeness and
relevance.
1. Data Cleaning and Validation: Identifying and rectifying errors, inconsistencies, and
missing values to improve data quality and accuracy.
2. Metadata Management: Creating structured descriptions of data, including provenance,
format, and usage, to enhance discoverability and reusability.
3. Data Storage and Organization: Implementing efficient storage solutions that support
scalability, security, and accessibility.
4. Data Preservation and Archiving: Ensuring long-term availability and integrity
through version control, backups, and compliance with archival standards.
As data continues to be a strategic asset in the digital ecosystem, effective data curation
practices become indispensable for ensuring that data remains a valuable and sustainable
resource for the future.
172
Chapter 3 : Introduction to Data Curation
In the digital age, artificial intelligence (AI) and machine learning (ML) have revolutionized
industries by enabling automation, predictive analytics, and intelligent decision-making.
However, the effectiveness of AI and ML models heavily depends on the quality of data used
for training, validation, and testing. Data curation, the process of collecting, organizing, and
maintaining data for efficient use, plays a crucial role in ensuring high-quality, unbiased, and
reliable datasets for AI and ML applications.
Data curation involves systematic processes such as data cleaning, annotation, integration,
storage, and governance. Poorly curated data can lead to inaccurate models, biased
outcomes, and unreliable insights, making proper curation essential for achieving optimal AI
performance and ethical AI applications.
Data curation is applied in various AI and ML use cases across industries. Some notable
examples include:
1. Healthcare and Medical AI: In medical diagnostics, curated datasets of medical
images (e.g., X-rays, MRIs) are used to train AI models for disease detection and
treatment recommendations. Proper curation ensures data accuracy and compliance
with healthcare regulations.
2. Fraud Detection in Finance: Financial institutions use curated transaction data to
train ML models that identify fraudulent activities by detecting anomalies in spending
patterns.
3. Natural Language Processing (NLP): AI-driven chatbots, language translation tools,
and sentiment analysis models require large, well-annotated text datasets to improve
accuracy and contextual understanding.
4. Retail and Recommendation Systems: E-commerce platforms curate user behavior
data to power AI-based recommendation systems, helping personalize product
suggestions based on browsing history and preferences.
The significance of data curation in AI and ML cannot be overstated. Key reasons why it
is essential include:
173
Chapter 3 : Introduction to Data Curation
As AI and ML technologies continue to evolve, data curation will become even more
critical. Emerging trends in AI data curation include:
1. Automated Data Curation: AI-driven tools for automated data labeling, cleansing,
and augmentation to reduce human effort and improve efficiency.
2. Federated Learning: A decentralized approach to data curation where models learn
from distributed datasets while maintaining privacy and security.
3. Synthetic Data Generation: The creation of artificial yet realistic datasets to
supplement training data and address data scarcity issues.
4. Real-time Data Processing: AI models increasingly rely on real-time curated data
streams for dynamic decision-making in industries such as cybersecurity and IoT.
5. Explainability and Trustworthy AI: Transparent and well-documented curation
practices enhance AI model interpretability and regulatory compliance.
Data curation is a foundational element in AI and ML, ensuring that datasets are accurate,
unbiased, and well-structured. Without proper curation, AI models risk generating
unreliable, unfair, and even harmful outcomes. As AI-driven applications expand,
organizations must invest in robust data curation strategies to build ethical, scalable, and
high-performing AI systems. Looking ahead, advancements in AI-powered curation tools and
methodologies will continue to shape the future of AI, making it more reliable, fair, and
effective for real-world applications.
Data curation is a critical process in the management of data throughout its lifecycle,
ensuring that data is properly collected, organized, preserved, and made available for
analysis. Here’s an overview of the key stages in the data curation process:
1. Data Collection
Definition and Importance: This is the initial step where relevant data is gathered
from various sources. Data can be collected through surveys, experiments, sensors,
transactions, or public datasets.
Best Practices:
o Define clear objectives to ensure relevant data is collected.
o Use standardized methods and tools to ensure consistency.
o Consider ethical implications and compliance with regulations such as GDPR.
2. Data Organization
174
Chapter 3 : Introduction to Data Curation
5. Data Analysis
Techniques and Tools: At this stage, data is analyzed using various statistical,
analytical, and machine learning methods to extract insights and support decision-
making.
Best Practices:
o Select appropriate analytical techniques that align with the research
questions.
o Document the analysis process to ensure reproducibility.
o Use visualization tools to communicate findings effectively.
6. Data Sharing and Publication
Disseminating Results: After analysis, findings should be shared with stakeholders,
published in relevant venues, or made available in data repositories.
Best Practices:
o Adhere to data sharing policies and ethical guidelines.
o Use open data principles when possible to enhance transparency and
collaboration.
7. Continuous Monitoring and Feedback
Iterative Improvement: Data curation is an ongoing process that benefits from
feedback and continuous monitoring, allowing for improvements in data quality and
handling procedures over time.
175
Chapter 3 : Introduction to Data Curation
Best Practices:
o Collect feedback from users to enhance data relevance and accessibility.
o Stay updated on new tools and methods for data curation.
1. Healthcare
Application: Electronic Health Records (EHR) Management
Hospitals and healthcare providers use curated data to maintain accurate patient
records.
Ensures seamless integration of medical history, prescriptions, test results, and
treatment plans.
Helps in predictive analytics for disease prevention and personalized medicine.
Example: AI-driven diagnostic tools like IBM Watson Health use curated data to provide
personalized treatment recommendations.
2. Finance
Application: Fraud Detection and Risk Management
Financial institutions curate transactional data to detect fraud patterns.
Risk assessment models use structured datasets to evaluate creditworthiness.
Regulatory compliance is ensured through accurate data reporting.
Example: Banks use machine learning models trained on curated transaction data to
identify suspicious activities in real-time.
3. E-Commerce
176
Chapter 3 : Introduction to Data Curation
Data curation is essential for ensuring that information is accurate, organized, and usable for
analysis and decision-making. However, it comes with several challenges, especially as data
grows in complexity and volume. Below are some of the most significant challenges in data
curation:
1. Data Quality
Challenge: Ensuring that data is accurate, complete, and consistent.
Inconsistent formats: Data may be collected from multiple sources in different
structures (e.g., dates written as MM/DD/YYYY vs. DD/MM/YYYY).
Incomplete data: Missing values or incorrect entries can affect decision-making.
Duplicate records: Repeated or redundant data can lead to inefficiencies and errors.
Example: In healthcare, incorrect or missing patient records can lead to incorrect diagnoses
or treatments.
2. Data Volume
Challenge: Managing and processing massive amounts of data.
With the rise of big data, organizations collect vast amounts of structured and
unstructured data.
Storing, organizing, and analyzing this data requires powerful infrastructure and
computing resources.
Example: Social media platforms like Facebook handle petabytes of user-generated content
daily, requiring advanced data management strategies.
177
Chapter 3 : Introduction to Data Curation
3. Data Variety
Challenge: Handling different data types and sources.
Data comes in multiple formats, such as structured (databases), semi-structured
(JSON, XML), and unstructured (images, videos, social media posts).
Integrating and standardizing diverse data sources is complex and time-consuming.
Example: In e-commerce, companies collect data from website clicks, reviews, transaction
records, and customer service chats, all requiring different processing methods.
4. Data Integration
Challenge: Combining data from multiple sources into a unified system.
Merging data from different platforms (e.g., CRM systems, social media, IoT devices)
can lead to conflicts in data consistency.
API limitations and data silos make integration more challenging.
Example: Financial institutions need to integrate data from banking systems, credit bureaus,
and third-party sources to assess credit risks.
1. Data Collection
Objective: Gather raw data from various sources.
Data can come from structured sources (databases, APIs) or unstructured sources
(social media, documents, images).
Ensure that data collection methods comply with privacy regulations (e.g., GDPR,
HIPAA).
Example: A healthcare system collects patient records from hospitals, clinics, and wearable
devices.
179
Chapter 3 : Introduction to Data Curation
6. Metadata Management
Objective: Document and categorize data for easy discovery and governance.
Assign metadata (descriptive information) such as data source, date, format, and
ownership.
Enable data lineage tracking to monitor changes over time.
Use metadata standards (e.g., Dublin Core for digital content, FAIR principles for
research data).
Example: A research institution tags datasets with metadata like author, publication date,
and licensing terms.
180
Chapter 3 : Introduction to Data Curation
181
Chapter 3 : Introduction to Data Curation
Data Collection
Meta Management
Data collection is the foundational step in data curation, involving the gathering of raw data
from various sources for analysis, decision-making, and business intelligence. The
effectiveness of data-driven insights depends on the quality, accuracy, and reliability of the
collected data.
182
Chapter 3 : Introduction to Data Curation
183
Chapter 3 : Introduction to Data Curation
Real-Time
Manual Data Automated Web Scraping Data Logging &
Streaming Data
Collection Data Collection & Crawling Tracking
Collection
184
Chapter 3 : Introduction to Data Curation
Privacy & Compliance: Regulations like GDPR and HIPAA restrict certain data
collection methods.
Integration Issues: Combining data from different sources requires careful
standardization.
Cost & Resource Constraints: Collecting high-quality data can be expensive and
time-consuming.
Data cleaning is a crucial step in data curation that ensures datasets are accurate, consistent,
and reliable for analysis. Poor data quality can lead to incorrect insights, financial losses, and
operational inefficiencies.
185
Chapter 3 : Introduction to Data Curation
A. Identifying Inconsistencies
Format Checking: Ensure uniform data formats (e.g., date format YYYY-MM-DD).
Value Range Checking: Verify that values fall within expected limits (e.g., age should
be between 0 and 120).
Categorical Consistency: Standardize categories (e.g., "USA" vs. "United States").
186
Chapter 3 : Introduction to Data Curation
Source: https://airbyte.com/data-engineering-resources/data-curation
Data transformation is a crucial step in data processing that involves converting raw data
into a structured and meaningful format for analysis. This process ensures that data is clean,
consistent, and suitable for machine learning models, business intelligence, and decision-
making.
A. Data Normalization
Objective: Convert data into a common scale to prevent biases in analysis.
Min-Max Scaling: Rescales data between 0 and 1.
o Formula: Xsc=X−Xmin/Xmax−Xmin
o Example: Rescaling product prices between 0 and 1 for machine learning
models.
Z-score Standardization: Converts data to a normal distribution with a mean of 0
and a standard deviation of 1.
o Formula: z = (x - μ) / σ
o Example: Standardizing customer age data for better statistical comparisons.
B. Data Encoding
Objective: Convert categorical data into numerical values for analysis.
One-Hot Encoding: Creates separate binary columns for each category.
o Example: Converting a "Color" column (Red, Blue, Green) into separate
columns: [1,0,0], [0,1,0], [0,0,1].
Label Encoding: Assigns a unique integer to each category.
o Example: Converting ["Male", "Female", "Other"] to [0,1,2].
187
Chapter 3 : Introduction to Data Curation
.
C. Data Aggregation
Objective: Summarize data by grouping and combining values.
Summing or Averaging:
o Example: Aggregating total monthly sales from daily transaction data.
Grouping by Categories:
o Example: Grouping customer purchases by region for market analysis.
E. Handling Outliers
Objective: Identify and manage extreme values that may skew analysis.
Z-Score Method: Removes data points with a Z-score greater than 3.
Interquartile Range (IQR): Filters values outside the acceptable range.
o Formula: IQR=Q3−Q1 Outliers=X<Q1−1.5×IQRorX>Q3+1.5×IQR
Where:
o Example: Removing unusually high or low product prices from a sales dataset.
F. Feature Engineering
Objective: Create new variables to improve analysis and predictive modeling.
Extracting Features from Dates:
o Example: Splitting "2024-07-15" into Year = 2024, Month = 7, Day = 15.
Creating Interaction Terms:
o Example: Generating a new feature Revenue = Price × Quantity.
188
Chapter 3 : Introduction to Data Curation
Data storage and organization are critical aspects of data management, ensuring that
information is securely stored, efficiently retrieved, and properly structured for analysis.
Organizations must choose the right storage solutions and structuring techniques to
optimize performance, scalability, and security.
189
Chapter 3 : Introduction to Data Curation
190
Chapter 3 : Introduction to Data Curation
Data curation involves collecting, cleaning, transforming, organizing, and maintaining data
to ensure its accuracy and usability. Various tools help automate and streamline this process,
making data curation more efficient. Below are some of the most widely used tools for data
curation, categorized based on their functionalities.
A. OpenRefine
📌 Best for: Cleaning and transforming messy data (GUI-based).
Helps structure and clean unstructured data (e.g., deduplication, clustering).
Automates repetitive cleaning tasks with scripts.
Key Features:
o Fuzzy matching to detect inconsistent names.
191
Chapter 3 : Introduction to Data Curation
B. Apache Spark
📌 Best for: Big data processing (structured & unstructured).
Supports distributed data processing across clusters.
Works with large-scale datasets in real time.
Example: Using PySpark for big data processing:
192
Chapter 3 : Introduction to Data Curation
Understanding data types and their sensitivities is crucial for effective data management,
analysis, and security. Different data types require different handling methods, and data
sensitivity levels dictate the necessary protection measures.
Data can be classified based on its structure, format, and usability. The three main types are
structured, semi-structured, and unstructured data.
A. Structured Data
📌 Definition: Data that is highly organized, follows a predefined format, and is stored in
relational databases with clear relationships between records.
Characteristics:
Stored in rows and columns (like an Excel spreadsheet).
Easily searchable using Structured Query Language (SQL).
Has a fixed schema (e.g., predefined data types: integers, strings, dates).
Examples:
Customer databases (Name, Email, Phone Number, Purchase History).
Financial transactions (Account Number, Transaction ID, Amount).
Inventory records (Product ID, Quantity, Price).
Storage & Processing Methods:
Stored in Relational Database Management Systems (RDBMS) such as MySQL,
PostgreSQL, Microsoft SQL Server, Oracle.
193
Chapter 3 : Introduction to Data Curation
Processed using SQL queries, ETL (Extract, Transform, Load) tools, and business
intelligence (BI) software.
.
B. Semi-Structured Data
📌 Definition: Data that does not conform to a rigid schema but contains elements of
structure, such as labels, tags, or metadata.
Characteristics:
Flexible format (not stored in strict rows and columns).
Organized using markers or identifiers (JSON, XML, key-value pairs).
Easier to process than unstructured data but requires parsing.
Examples:
JSON & XML files used in APIs and web applications.
Email messages (Subject, Sender, Receiver, Timestamp, Message Body).
Sensor data from IoT devices (timestamped readings from smart devices).
Characteristics:
Cannot be easily stored in traditional databases.
Requires specialized tools for processing (Natural Language Processing, AI, Machine
Learning).
Makes up the majority (80–90%) of the world’s data.
Examples:
Text documents (Word files, PDFs, scanned documents).
Multimedia (Images, Videos, Audio recordings).
Social media data (Tweets, Facebook posts, YouTube comments).
Logs from web servers, security systems, and software applications.
194
Chapter 3 : Introduction to Data Curation
B. Internal Data
📌 Definition: Proprietary business information meant for internal use only.
Examples:
Internal reports, financial forecasts.
Employee handbooks, operational policies.
Business strategies, market research.
Security Measures:
Role-based access control (RBAC) (only authorized employees can view/edit).
Data encryption to prevent unauthorized modifications.
Example: A company stores internal documents on a secured SharePoint or Google
Drive with restricted access.
.
C. Confidential Data
📌 Definition: Sensitive business or personal information that could cause harm if disclosed.
Examples:
Customer purchase history and payment details.
Employee salary information.
Business contracts, trade secrets.
Security Measures:
Multi-factor authentication (MFA) for system access.
Encryption (AES-256) to protect stored data.
Audit trails to track access and modifications.
.
D. Personally Identifiable Information (PII)
📌 Definition: Data that can be used to identify an individual.
Examples:
Full name, Date of Birth, Address.
Social Security Number (SSN), Passport Number.
Phone numbers, Email addresses.
Security Measures:
195
Chapter 3 : Introduction to Data Curation
196
Chapter 3 : Introduction to Data Curation
Artificial Intelligence (AI) and Machine Learning (ML) algorithms are designed to process
and analyze various types of data, including structured, semi-structured, and unstructured
data. The way AI handles different data types depends on the nature of the data, the
learning model, and the preprocessing techniques used to convert raw data into a usable
format.
197
Chapter 3 : Introduction to Data Curation
198
Chapter 3 : Introduction to Data Curation
This hands-on exercise will help you recognize different data types in real-world datasets
and classify them as structured, semi-structured, or unstructured data. We will also use
Python to analyze a sample dataset and identify its data types.
199
Chapter 3 : Introduction to Data Curation
Let's explore how data from different industries fits into structured, semi-structured, or
unstructured categories.
We'll use Python to analyse a sample dataset and classify its data types.
200
Chapter 3 : Introduction to Data Curation
Expected Output:
We notice missing values in "Cabin" and "Age". We can handle them as follows:
201
Chapter 3 : Introduction to Data Curation
Data sensitivity refers to the level of confidentiality, security, and protection required for
different types of data. Organizations must classify data correctly to ensure compliance with
regulations, prevent security breaches, and protect individuals' privacy. Below are three key
types of data sensitivities with real-world scenarios.
202
Chapter 3 : Introduction to Data Curation
203
Chapter 3 : Introduction to Data Curation
4. Key Takeaways
🔒 Confidential Data: Requires strict protection (Trade secrets, business plans).
🔑 Private Data: Requires privacy laws compliance (PII, health records, bank details).
🌍 Public Data: Freely accessible with minimal security concerns (open datasets, public
reports).
Handling sensitive data requires strict adherence to legal regulations and ethical
guidelines to protect individuals' privacy, prevent misuse, and ensure data security. Failure
to comply with these principles can result in legal penalties, reputational damage, and
loss of trust.
204
Chapter 3 : Introduction to Data Curation
205
Chapter 3 : Introduction to Data Curation
206
Chapter 3 : Introduction to Data Curation
✅ Provide users with full control over their data (delete, modify, or opt-out).
✅ Conduct regular audits to check compliance with ethical and legal standards.
207
Chapter 3 : Introduction to Data Curation
card details, and investment records. Exposure of this data leads to fraud, money
laundering, and financial loss.
Examples of Sensitive Financial Data
✅ Bank Account & Transaction Data: Account numbers, balances, deposits.
✅ Credit Card Details: Card numbers, CVVs, expiration dates.
✅ Customer PII: Social Security Numbers, tax records.
✅ Investment & Trading Data: Stock trades, cryptocurrency holdings.
Legal & Compliance Standards in Finance
PCI DSS (Payment Card Industry Data Security Standard) – Global
o Mandates encryption for payment processing.
GLBA (Gramm-Leach-Bliley Act) – USA
o Requires banks to disclose how they protect customer data.
SOX (Sarbanes-Oxley Act) – USA
o Ensures corporate transparency in financial reporting.
PSD2 (Payment Services Directive 2) – EU
o Requires strong authentication for digital payments.
Example: Data Breach in a Banking App
🔹 Incident: A mobile banking app had a security flaw that allowed hackers to access user
accounts and withdraw funds.
🔹 Impact:
Customers lost millions of dollars in fraudulent transactions.
The bank faced regulatory fines for weak security.
Trust in the institution declined, causing stock value to drop.
🔹 Preventative Measures:
✅ End-to-end encryption for financial transactions.
✅ AI-driven fraud detection (detects unusual spending patterns).
✅ Two-factor authentication (2FA) for all logins.
208
Chapter 3 : Introduction to Data Curation
209
Chapter 3 : Introduction to Data Curation
Data Compromised
Names, Birthdates, Addresses
Social Security Numbers (SSN)
Medical Identification Numbers
Employment Information & Income Data
Note: No financial information or medical diagnoses were leaked, but identity theft risks
remained high.
210
Chapter 3 : Introduction to Data Curation
The Anthem breach highlights the critical importance of HIPAA compliance in healthcare.
Organizations must implement strong cybersecurity measures, train employees on data
security, and encrypt patient records to prevent similar breaches.
By following best practices and adhering to HIPAA regulations, healthcare providers can
ensure the security and confidentiality of patient data, maintaining trust and compliance in
an increasingly digital world.
Introduction
To effectively manage and protect sensitive data, organizations must leverage advanced
tools and technologies for data curation, encryption, monitoring, and compliance
tracking. The following tools are essential for ensuring data sensitivity in various industries,
particularly healthcare.
211
Chapter 3 : Introduction to Data Curation
Conclusion
The Anthem breach highlights the critical importance of HIPAA compliance in
healthcare. Organizations must implement strong cybersecurity measures, train
employees on data security, and encrypt patient records to prevent similar breaches.
Additionally, using advanced data curation, encryption, and compliance monitoring
tools enhances data security and regulatory adherence, ensuring patient confidentiality
and trust in digital healthcare systems.
Introduction
To effectively manage and protect sensitive data, organizations must leverage advanced
tools and technologies for data curation, encryption, monitoring, and compliance
tracking. The following tools are essential for ensuring data sensitivity in various industries,
particularly healthcare.
212
Chapter 3 : Introduction to Data Curation
Open-source tools provide flexible and cost-effective solutions for handling sensitive data in
compliance with HIPAA and other regulations.
Pandas (Python)
Pandas is a widely used open-source library for data manipulation and analysis. It allows
healthcare organizations to:
Clean and preprocess large datasets efficiently.
Handle missing values, duplicate records, and inconsistencies in patient records.
Merge multiple data sources to create structured datasets for analysis.
OpenRefine
OpenRefine is a powerful tool designed for cleaning messy data. It is particularly useful
in healthcare for:
Detecting and fixing errors in medical records.
Standardizing terminology and formatting across datasets.
Identifying and removing duplicate records in insurance claims.
Talend is a leading ETL (Extract, Transform, Load) tool that enables organizations to:
Integrate data from various sources (e.g., electronic health records, insurance
databases).
Perform data transformations to ensure consistency and compliance.
Automate data validation processes to detect anomalies and errors.
Apache NiFi
213
Chapter 3 : Introduction to Data Curation
Apache NiFi is an open-source data integration tool that automates and monitors real-
time data flow across multiple systems. Its benefits include:
Secure transmission of sensitive healthcare data.
Automated data tracking and lineage for regulatory compliance.
Scalability for handling large volumes of medical data.
Cloud-based data curation solutions provide scalable, flexible, and cost-effective means of
preparing, cleaning, and transforming data for analytical and operational use. These
solutions leverage the power of cloud computing to automate data ingestion, integration, and
enrichment while reducing the overhead associated with traditional on-premises data
management. They offer real-time collaboration, security, and high availability, making them
a preferred choice for modern enterprises. Two prominent cloud-based data curation
solutions are AWS Glue and Google DataPrep.
AWS Glue
214
Chapter 3 : Introduction to Data Curation
AWS Glue is a fully managed extract, transform, and load (ETL) service that simplifies the
process of preparing and managing data for analytics. It supports a serverless environment,
enabling users to clean and catalog data without provisioning or managing infrastructure.
Key Features:
Automated Data Discovery: AWS Glue crawlers automatically scan data sources and
infer schemas, creating metadata in the AWS Glue Data Catalog.
ETL Capabilities: It provides a visual interface for designing ETL workflows and
supports Python and Scala-based transformations.
Serverless Execution: Runs ETL jobs on a fully managed infrastructure,
automatically scaling resources as needed.
Integration with AWS Services: Seamlessly integrates with Amazon S3, Redshift,
Athena, and other AWS analytics services.
Data Governance and Security: Supports encryption, role-based access control, and
integration with AWS IAM for data protection.
Job Scheduling and Monitoring: AWS Glue allows users to schedule jobs, monitor
execution, and troubleshoot errors efficiently.
Machine Learning Integration: Enables predictive data transformation by
integrating with AWS SageMaker and other ML tools.
Google DataPrep
215
Chapter 3 : Introduction to Data Curation
Source: https://airbyte.com/data-engineering-resources/data-curation
216
Chapter 3 : Introduction to Data Curation
Both AWS Glue and Google DataPrep offer powerful cloud-based data curation solutions,
catering to different user needs. AWS Glue is ideal for developers and data engineers who
require automated, serverless ETL pipelines with deep integration into the AWS ecosystem.
Google DataPrep, on the other hand, is best suited for business analysts and data scientists
who need an intuitive, no-code interface for data cleaning and transformation. Organizations
should select the appropriate solution based on their infrastructure, expertise, and specific
data processing requirements. As cloud-based data curation tools continue to evolve, they
will play an increasingly critical role in enhancing data quality, governance, and analytics
capabilities.
Ensuring data security is a critical aspect of modern data management, particularly when
dealing with sensitive information such as personally identifiable information (PII), financial
data, and proprietary business records. Various tools offer robust encryption, secure key
management, and data protection mechanisms to prevent unauthorized access and data
breaches. Two widely used solutions for handling sensitive data are AWS Key Management
Service (KMS) and Python Cryptography.
AWS KMS is a fully managed encryption service that helps users create and control
cryptographic keys to secure their applications and data.
Key Features:
217
Chapter 3 : Introduction to Data Curation
Access Control: Uses AWS Identity and Access Management (IAM) to enforce fine-
grained permissions on key usage.
Automatic Key Rotation: Periodic key rotation enhances security without requiring
user intervention.
Logging and Auditing: Integrated with AWS CloudTrail to track key usage and
monitor security events.
Compliance and Certifications: Meets compliance requirements for security
standards such as HIPAA, GDPR, and FedRAMP.
Python Cryptography
Python Cryptography is a robust library that provides secure encryption, decryption, and
cryptographic operations for software applications.
Key Features:
218
Chapter 3 : Introduction to Data Curation
To effectively safeguard sensitive data, organizations should implement the following best
practices:
1. Use Strong Encryption: Encrypt data at rest and in transit using industry-standard
cryptographic algorithms.
2. Enforce Access Controls: Implement role-based access control (RBAC) to limit data
access to authorized personnel.
3. Enable Key Rotation: Regularly rotate encryption keys to minimize risks associated
with compromised credentials.
4. Monitor and Audit Key Usage: Use logging and auditing tools to detect unauthorized
access attempts and security anomalies.
5. Secure Data Transmission: Use Transport Layer Security (TLS) and other secure
protocols for data communication.
6. Implement Multi-Factor Authentication (MFA): Add an extra layer of security for
accessing sensitive data and key management services.
7. Comply with Regulations: Ensure data handling practices align with legal and
industry-specific compliance requirements.
Both AWS KMS and Python Cryptography provide essential security tools for protecting
sensitive data. AWS KMS is a fully managed encryption service ideal for enterprises using
AWS infrastructure, ensuring seamless integration and compliance with security
regulations. On the other hand, Python Cryptography offers greater flexibility for developers
who require custom encryption implementations at the application level. By selecting the
219
Chapter 3 : Introduction to Data Curation
appropriate tools and following best practices, organizations can effectively safeguard their
data assets and maintain compliance with regulatory standards.
Data curation is a critical process in managing and maintaining high-quality datasets for
analysis, decision-making, and artificial intelligence (AI) model training. Traditionally, data
curation involves a combination of manual data cleaning, classification, and validation, which
can be time-consuming and prone to human error. AI and machine learning (ML) have
revolutionized this process by introducing automation that enhances efficiency, accuracy,
and scalability.
AI and ML algorithms can automate several key aspects of data curation, including:
1. Data Cleaning – Machine learning models can identify and correct inconsistencies,
missing values, and anomalies in datasets. Techniques such as outlier detection,
imputation algorithms, and automated data deduplication improve data quality.
Advanced AI-driven pipelines can preprocess data by normalizing, standardizing, and
detecting errors at scale.
2. Data Classification and Labeling – AI-driven classification algorithms can
categorize data into relevant groups based on predefined criteria. Natural Language
Processing (NLP) and deep learning models enable automatic annotation of text,
images, and audio data, reducing the need for manual labeling. This is particularly
useful in industries such as healthcare, finance, and customer service, where vast
amounts of unstructured data require processing.
3. Data Integration – AI-powered systems can merge datasets from different sources,
detecting redundancies and aligning mismatched records. Entity resolution
techniques and knowledge graphs help in linking related data points across
heterogeneous data sources. AI can also facilitate schema matching and automatic
transformation of datasets to fit predefined formats.
4. Metadata Generation – Machine learning algorithms can extract and generate
metadata automatically, ensuring proper documentation and traceability of data. This
improves searchability, usability, and compliance with data governance policies. AI-
based metadata generation enhances interoperability between datasets, making data
more accessible and reusable.
5. Data Validation and Quality Assurance – AI models can continuously monitor data
streams to detect inconsistencies and enforce data quality rules. Automated anomaly
detection helps in identifying and resolving discrepancies before they affect
downstream analytics. AI-driven quality assessment frameworks can implement
adaptive rules that evolve with the dataset's complexity.
6. Automated Data Governance – AI can help enforce governance policies by
monitoring data access, ensuring compliance with regulatory standards such as GDPR
220
Chapter 3 : Introduction to Data Curation
Efficiency – Reduces the time and effort required for manual data processing,
allowing data teams to focus on higher-value tasks.
Accuracy – Minimizes human errors and enhances data consistency, leading to more
reliable insights.
Scalability – Handles large volumes of data across multiple domains, making it ideal
for big data applications.
Cost Reduction – Lowers operational costs associated with manual data curation by
reducing labor-intensive tasks.
Improved Decision-Making – Ensures that high-quality, well-curated data is
available for analytics and AI applications, leading to better strategic decisions.
Real-Time Processing – AI models can curate data in real-time, enhancing
responsiveness in dynamic environments such as financial markets and autonomous
systems.
Bias and Fairness – Machine learning models must be trained on diverse and
unbiased datasets to prevent skewed outcomes that could impact decision-making.
Interpretability – Understanding how AI models curate data is crucial for trust and
regulatory compliance. Explainable AI techniques are necessary to provide
transparency.
Data Privacy and Security – Automated curation systems must adhere to data
protection laws and ensure secure handling of sensitive information. Encryption and
access control mechanisms must be implemented.
Continuous Monitoring – AI models require regular updates and monitoring to
maintain data quality over time. Without ongoing oversight, model drift can reduce
effectiveness.
Ethical Concerns – The automation of data curation raises ethical questions about
job displacement and the responsible use of AI in handling personal data.
221
Chapter 3 : Introduction to Data Curation
By leveraging AI and ML, organizations can streamline data curation processes, making them
more robust and efficient. As technology advances, AI-driven data curation will continue to
evolve, shaping the future of data management and analytics while ensuring compliance,
security, and ethical considerations are met.
3.19 Hands-On Exercise: Using Python Pandas for Data Cleaning and Transformation
Data cleaning and transformation are essential steps in preparing raw data for analysis.
Python's Pandas library provides powerful tools to automate these processes efficiently. In
this hands-on exercise, we will explore how to use Pandas to clean, transform, and prepare
datasets for further analysis.
Pandas supports multiple file formats, including CSV, Excel, and JSON. For this exercise, we
will use a sample CSV dataset:
222
Chapter 3 : Introduction to Data Curation
2.Removing Duplicates
4.Handling Outliers
223
Chapter 3 : Introduction to Data Curation
1.Renaming Columns
3.Filtering Data
4.Aggregating Data
224
Chapter 3 : Introduction to Data Curation
Assessment criteria
Refrences :
225
Chapter 3 : Introduction to Data Curation
Exercise
Multiple Choice Questions
4. Which tool is commonly used in Python for data cleaning and transformation?
a. TensorFlow
b. Pandas
c. Matplotlib
d. Jupyter
226
Chapter 3 : Introduction to Data Curation
9. Which industry is particularly concerned with HIPAA compliance for data sensitivity?
a. Retail
b. Financial
c. Healthcare
d. Education
True/False Questions:
1. Data curation only involves collecting data from various sources. (T/F)
2. In AI and Machine Learning, the quality of data is just as important as the algorithm
used.(T/F)
3. Handling missing, duplicate, and inconsistent data is a part of the data cleaning
process. (T/F)
4. Data transformation is not required if the dataset is already cleaned. (T/F)
5. Pandas is a commonly used Python library for data cleaning and transformation. (T/F)
6. Data curation tools are only available as paid software solutions. (T/F)
7. Legal and ethical considerations are not important when handling sensitive data. (T/F)
8. Cloud-based data curation solutions can improve scalability and accessibility. (T/F)
9. Automating data curation with AI reduces human intervention and increases
efficiency. (T/F)
10. All industries have the same data sensitivity and security requirements. (T/F)
227
Chapter 3 : Introduction to Data Curation
1. The process of collecting, cleaning, organizing, and maintaining data for use in
analysis is called __________.
2. __________ is a Python library widely used for data manipulation and cleaning.
3. The process of converting raw data into a structured format suitable for analysis is
known as __________.
4. In data cleaning, missing values can be filled using techniques such as mean,
median, or __________.
5. A __________ is used to uniquely identify and retrieve records efficiently in a
database.
6. Duplicate records in a dataset can be removed using the __________ method in
Pandas.
7. The __________ industry is governed by HIPAA regulations for data sensitivity and
privacy.
8. __________ data refers to data that does not follow a pre-defined data model or
structure.
9. One of the key benefits of cloud-based data curation solutions is __________, allowing
data to be accessed from anywhere.
10. Ethical and __________ considerations are crucial when handling sensitive or
personal data.
1. Load the Iris dataset using Pandas and display the first 5 rows.
a. Check if there are any missing values in the dataset.
b. Create a new column called sepal_area (sepal_length × sepal_width).
2. Find the average (mean) of petal_length for each flower species.
3. What do you understand by Data Curation? Mention any two areas where it is
useful.
4. Why is Data Cleaning important before analyzing any dataset? Give two simple
examples.
5. What is Data Transformation? How does it help in making data more useful?
6. Mention any two challenges in data collection and explain them briefly.
7. What is the difference between Structured and Unstructured Data? Give one
example of each.
8. List any two tools used in Data Curation and explain how they help in managing
data.
228
Chapter 4 : Data Collection and Acquisition Methods
Chapter 4 :
Data Collection & Acquisition Methods
Data collection is the systematic process of gathering, measuring, and recording information
from various sources for analysis, decision-making, and research purposes. The data
collected can take many forms, including numerical, textual, or visual, and it may come from
various mediums such as surveys, experiments, observations, sensors, databases, or online
platforms. Data collection is the first step in the data analysis process and is crucial for
ensuring that the resulting analysis is based on accurate and relevant information.
229
Chapter 4 : Data Collection and Acquisition Methods
o Accurate data collection processes help ensure the reliability and validity of
the information used in analyses. Inconsistent or biased data collection can
lead to erroneous conclusions, while well-structured data collection
methodologies help maintain the integrity of the data.
4.1.2 Steps Involved in Data Collection:
Data collection is a structured process that involves several key steps to ensure that the
data gathered is accurate, reliable, and useful for analysis. Below are the primary steps
involved in the data collection process:
230
Chapter 4 : Data Collection and Acquisition Methods
included in the study. This step is crucial when it is not feasible to collect data from
the entire population.
Importance: Defining the population and sample ensures that the data collected is
representative of the larger group and can be generalized accurately.
Example: If you're conducting a survey about consumer preferences, you may
choose to sample 500 people from a target demographic rather than surveying the
entire population.
231
Chapter 4 : Data Collection and Acquisition Methods
232
Chapter 4 : Data Collection and Acquisition Methods
Goal-setting for data collection is a crucial step in ensuring that the data gathered is both
relevant and useful for the purpose at hand. Without clearly defined objectives, the data
collection process can become unfocused and inefficient. Establishing clear and measurable
objectives helps ensure that the data collected directly addresses the research or business
questions, aligns with broader goals, and can be used effectively for analysis and decision-
making.
The first step in goal-setting for data collection is identifying the purpose of the collection.
This involves understanding the problem or question that needs to be addressed. The
purpose could be to explore a particular trend, measure specific behaviors, assess an
outcome, or solve a problem. A well-defined purpose ensures that the data collected is
relevant and aligned with the research or business objectives.
Once the purpose is clear, the next step is to define specific research or business
questions. These questions will form the basis of the data collection process. They help to
clarify what exactly needs to be measured or observed and guide the choice of data
collection methods. For example, if the goal is to measure customer satisfaction, the specific
questions might include, "How satisfied are customers with our product?" or "What aspects
of the product need improvement?" Defining these questions ensures that the data
collected is targeted and directly relevant to the overall goal.
After defining the research questions, it’s essential to establish clear and measurable
goals. Setting measurable goals means that the success of the data collection effort can be
assessed objectively. These goals should quantify what needs to be achieved, such as
"collect responses from at least 200 customers" or "increase customer satisfaction by 10%
in six months." Measurable goals provide a benchmark for evaluating whether the data
collection process has been successful and whether the objectives have been met.
The next step involves determining the scope and boundaries of the data collection. This
includes identifying the population to be studied, the time period for which data will be
collected, and any geographic limits, if applicable. It also involves deciding what specific
data points or variables will be measured. Defining the scope ensures that the data
collected is manageable and relevant. For example, if you're studying employee
satisfaction, the scope might be limited to full-time employees in a particular department
over the past year. This prevents the collection of unnecessary data and helps focus efforts
on the most relevant information.
In addition to the scope, it’s important to consider the resources needed for the data
collection process. This includes budget, time, tools, technology, and personnel.
Understanding what resources are available ensures that the data collection process is
realistic and feasible within the constraints of the project. For instance, conducting an
extensive survey may require specific survey software, while interviewing employees may
require trained staff to conduct and analyze interviews. Knowing the resources available
helps set realistic objectives and avoid over-promising.
233
Chapter 4 : Data Collection and Acquisition Methods
Next, choosing the appropriate data collection methods is vital. Based on the research
questions, purpose, and resources, the right methods should be selected. These methods
could include surveys, interviews, observations, or experiments, depending on the type of
data needed. For example, if the goal is to understand customer preferences in-depth, a
combination of surveys and focus groups might be chosen. If the objective is to track a
behavior over time, observational data collection might be more suitable. The chosen
methods should be capable of answering the specific research questions and achieving the
set goals.
Setting a timeline is another crucial step in defining objectives. A clear timeline ensures
that data collection is completed within the required time frame and helps in managing the
project efficiently. Timelines should include milestones for different stages of data
collection, such as when surveys will be distributed, when data will be gathered, and when
analysis will begin. A timeline also ensures that the objectives are achieved within the
constraints of time, which is especially important for projects with tight deadlines.
Finally, it’s important to consider how the data will be used once it’s collected.
Understanding the end purpose of the data—whether it will be used to inform business
decisions, improve a product, or validate a hypothesis—helps ensure that the data
collection process is designed to support those outcomes. The way data will be used can
impact decisions regarding the level of detail needed, the format in which the data should
be collected, and how it will be analyzed.
In summary, goal-setting for data collection involves a systematic approach to defining the
research or business objectives. Clear and measurable goals help guide the entire process,
ensuring that the data collected is relevant, accurate, and useful for achieving the intended
outcomes. Through careful planning of the purpose, scope, resources, methods, and
timeline, the data collection process can be aligned with the overall objectives, leading to
valuable insights and informed decision-making.
234
Chapter 4 : Data Collection and Acquisition Methods
235
Chapter 4 : Data Collection and Acquisition Methods
236
Chapter 4 : Data Collection and Acquisition Methods
Data collection plays a pivotal role across various industries, providing organizations with
the insights needed to make informed decisions and improve outcomes. Different fields
utilize data collection methods tailored to their specific needs, allowing them to enhance
operational efficiency, develop targeted strategies, and achieve their goals. Some of the
most prominent applications of data collection can be seen in areas such as market
research, healthcare, and finance.
In the healthcare sector, data collection is crucial for improving patient care, advancing
medical research, and optimizing operational workflows. Medical professionals collect data
from patient records, diagnostic tests, and clinical trials to monitor patient health, diagnose
conditions, and evaluate treatment efficacy. For example, during clinical trials, researchers
gather data on the effects of new drugs or medical devices to assess their safety and
effectiveness before they are introduced to the market. Healthcare organizations also
collect data from routine procedures, hospital visits, and patient feedback to improve
service delivery and patient satisfaction. This data is not only valuable for direct patient
care but also for long-term public health initiatives, where patterns in diseases and health
behaviors can inform policy decisions and resource allocation.
1. Bias:
Sampling Bias: Occurs when certain groups are overrepresented or
underrepresented, leading to a skewed view of the population.
Response Bias: Happens when respondents’ answers are influenced by social
desirability, misunderstanding, or fear of judgment.
Impact: Leads to inaccurate or distorted conclusions.
Mitigation: Use randomized sampling methods, ask neutral and clear questions, and
ensure diversity in the sample.
2. Volume:
Large Datasets: In today’s digital age, enormous amounts of data are generated,
especially in fields like healthcare and finance.
Challenges: Managing, storing, and analyzing large volumes of data requires
substantial resources and computational power.
Impact: Information overload can occur if data isn’t properly structured or
analyzed.
Mitigation: Implement data cleaning, machine learning algorithms, and advanced
analytics tools to manage and process big data effectively.
3. Variety:
Different Data Types: Data comes in various forms such as structured (numbers,
dates), unstructured (text, images), and semi-structured (logs, emails).
Challenges: Combining data from different sources with varying formats can be
complex, especially when dealing with unstructured data.
Impact: Difficulty in extracting meaningful insights when data isn’t properly
integrated.
Mitigation: Use data integration tools, data warehouses, and data lakes to handle
and unify diverse datasets.
4. Privacy Concerns:
Sensitive Data: Privacy regulations, especially in sectors like healthcare and
finance, dictate strict handling and protection of personal data.
Challenges: Ensuring compliance with privacy laws while collecting and managing
data.
Mitigation: Adhere to privacy regulations like GDPR and HIPAA, and use encryption
and secure data storage methods.
5. Data Quality Issues:
Incorrect or Incomplete Data: Poor-quality data can lead to misleading insights.
Challenges: Ensuring the accuracy, consistency, and reliability of data collected.
Mitigation: Regular data cleaning, validation, and verification processes to maintain
high-quality data.
6. Cost of Data Collection:
Expenses: Large-scale surveys, experiments, or acquiring data from third-party
sources can be costly.
238
Chapter 4 : Data Collection and Acquisition Methods
Challenges: Balancing the cost of data collection with the potential benefits and
ensuring sustainability.
Mitigation: Assess the costs versus expected outcomes, and optimize data collection
methods to be efficient and cost-effective.
Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on top of the NumPy
library. Pandas is fast and it has high performance & productivity for users.
Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc.
There are different tools are available for fast data processing, such as Numpy, Scipy,
Cython, and Panda. But we prefer Pandas because working with Pandas is fast, simple and
more expressive than other tools.
Pandas is built on top of the Numpy package, means Numpy is required for operating the
Pandas.
History:
Pandas were initially developed by Wes McKinney in 2008 while he was
working at AQR Capital Management. He convinced the AQR to allow him
to open source the Pandas. Another AQR employee, Chang She, joined as
the second major contributor to the library in 2012. Over time many
versions of pandas have been released. The latest version of the pandas is
1.4.4.
239
Chapter 4 : Data Collection and Acquisition Methods
Pandas Series is a one dimensional indexed data, which can hold datatypes like integer,
string, boolean, float, python object etc. A Pandas Series can hold only one data type at a
time. The axis label of the data is called the index of the series. The labels need not to be
unique but must be a hashable type. The index of the series can be integer, string and even
time-series data. In general, Pandas Series is nothing but a column of an excel sheet with
row index being the index of the series.
The parameters for the constructor of a Python Pandas Series are detailed as under:-
Parameters Remarks
data : array-like, Contains data stored in Series. Changed in version 0.23.0: If
Iterable, dict, or scalar data is a dict, argument order is maintained for Python 3.6 and
value later.
index : array-like or Values must be hashable and have the same length as data.
Index (1d) Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, …, n) if not provided. If both a dict and
index sequence are used, the index will override the keys
found in the dict.
dtype : str, Data type for the output Series. If not specified, this will be
numpy.dtype, or inferred from data. See the user guide for more usages.
ExtensionDtype,
optional
copy : bool, default Copy input data.
False
Table 4: Pandas Series Parameters
Output:
240
Chapter 4 : Data Collection and Acquisition Methods
241
Chapter 4 : Data Collection and Acquisition Methods
Another DataFrame
The parameters for the constuctor of a Pandas Dataframe are detailed as under:-
Parameters Remarks
data : ndarray (structured or Dict can contain Series, arrays, constants, or list-like objects
homogeneous), Iterable, dict, Changed in version 0.23.0: If data is a dict, column order
or DataFrame follows insertion-order for Python 3.6 and later. Changed in
version 0.25.0: If data is a list of dicts, column order follows
insertion-order for Python 3.6 and later.
index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if
no indexing information part of input data and no index
provided
columns : Index or array-like Column labels to use for resulting frame. Will default to
RangeIndex (0, 1, 2, …, n) if no column labels are provided
dtype, default None Data type to force. Only a single dtype is allowed. If None, infer
copy : bool, default False Copy data from inputs. Only affects DataFrame / 2d ndarray
input
Table 5: Parameters for Pandas Dataframe
You can create an empty Pandas Dataframe using pandas.Dataframe() and later on you can
add the columns using df.columns = [list of column names] and append rows to it.
Output:
A pandas dataframe can be created from a 2 dimensional numpy array by using the
following code:-
242
Chapter 4 : Data Collection and Acquisition Methods
Output:
Indexing in Pandas:
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of
the rows and all of the columns, or some of each of the rows and columns. Indexing can
also be known as Subset Selection.
244
Chapter 4 : Data Collection and Acquisition Methods
Output:
This function selects data by the label of the rows and columns. The df.loc indexer selects
data in a different way than just the indexing operator. It can select subsets of rows or
columns. It can also simultaneously select subsets of rows and columns.
Selecting a single row
In order to select a single row using .loc[], we put a single row label in a .loc function.
Output:
As shown in the output image, two series were returned since there was only one
parameter both of the times.
246
Chapter 4 : Data Collection and Acquisition Methods
In order to select multiple rows, we put all the row labels in a list and pass that
to .loc function.
Output:
Code 33: Selecting two rows and three columns using .loc[]
Output:
247
Chapter 4 : Data Collection and Acquisition Methods
Output 40: Selecting two rows and three columns using .loc[]
Code 34: Selecting all rows and some columns using .loc[]
Output:
Output 41: Selecting all rows and some columns using .loc[]
248
Chapter 4 : Data Collection and Acquisition Methods
This function allows us to retrieve rows and columns by position. In order to do that, we’ll
need to specify the positions of the rows that we want, and the positions of the columns
that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer
locations to make its selections.
Output:
249
Chapter 4 : Data Collection and Acquisition Methods
Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed. For Example, Suppose different users being surveyed may choose not to share their
income, some users may choose not to share the address in this way many datasets went
missing.
In Pandas missing data is represented by two value:
None: None is a Python singleton object that is often used for missing data in Python
code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point representation
250
Chapter 4 : Data Collection and Acquisition Methods
Pandas treat None and NaN as essentially interchangeable for indicating missing or null
values. To facilitate this convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame:
isnull()
notnull()
dropna()
fillna()
replace()
interpolate()
In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not. These function can
also be used in Pandas Series in order to find null values in a series.
Checking missing values using isnull()
In order to check null values in Pandas DataFrame, we use isnull() function this function
return dataframe of Boolean values which are True for NaN values.
Code #1:
251
Chapter 4 : Data Collection and Acquisition Methods
Output:
Before manipulating the dataframe with pandas we have to understand what is data
manipulation. The data in the real world is very unpleasant & unordered so by performing
certain operations we can make data understandable based on one’s requirements, this
process of converting unordered data into meaningful information can be done by data
manipulation.
Pandas is an open-source library that is used from data manipulation to data analysis & is
very powerful, flexible & easy to use tool which can be imported using import pandas as
pd. Pandas deal essentially with data in 1-D and 2-D arrays; Although, pandas handles these
two differently. In pandas, 1-D arrays are stated as a series & a dataframe is simply a 2-D
array.
252
Chapter 4 : Data Collection and Acquisition Methods
Output:
We can read the dataframe by using head() function also which is having an argument
(n) i.e. number of rows to be displayed.
Output:
253
Chapter 4 : Data Collection and Acquisition Methods
Counting the rows and columns in DataFrame using shape(). It returns the no. of rows
and columns enclosed in a tuple.
Code 39: Counting the rows and columns in DataFrame using shape().
Output:
Output 46: Counting the rows and columns in DataFrame using shape().
Output:
254
Chapter 4 : Data Collection and Acquisition Methods
Output:
Output:
Creating another column in DataFrame, Here we will create column name percentage
which will calculate the percentage of student score by using aggregate function sum().
Output:
256
Chapter 4 : Data Collection and Acquisition Methods
Syntax:
257
Chapter 4 : Data Collection and Acquisition Methods
Parameters :
Example #1: Use groupby() function to group the data based on the “Team”.
Output:
Example #2: Use groupby() function to form groups based on more than one category (i.e.
Use more than one column to perform the splitting).
Code 47: groupby() function to form groups based on more than one category
Output:
Output 54: groupby() function to form groups based on more than one category
groupby() is a very powerful function with a lot of variations. It makes the task of splitting
the dataframe over some criteria really easy and efficient.
259
Chapter 4 : Data Collection and Acquisition Methods
4.2.6 Filtering
Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.
Syntax:
Parameters:
axis: The axis to filter on. By default, this is the info axis, ‘index’ for Series, ‘columns’ for
DataFrame
The items, like, and regex parameters are enforced to be mutually exclusive. axis defaults
to the info axis that is used when indexing with [].
Example #1: Use filter() function to filter out any three columns of the dataframe.
260
Chapter 4 : Data Collection and Acquisition Methods
Output:
Code 49: Use filter() function to filter out any three columns of the dataframe.
Output:
Output 56: Use filter() function to filter out any three columns of the dataframe.
261
Chapter 4 : Data Collection and Acquisition Methods
Example #2: Use filter() function to subset all columns in a dataframe which has the letter
‘a’ or ‘A’ in its name.
Note : filter() function also takes a regular expression as one of its parameter.
Code 50: Use filter() function to subset all columns in a dataframe which has the letter ‘a’ or
‘A’ in its name.
Output:
Output 57: Use filter() function to subset all columns in a dataframe which has the letter ‘a’ or
‘A’ in its name.
The regular expression ‘[aA]’ looks for all column names which has an ‘a’ or an ‘A’ in its
name.
262
Chapter 4 : Data Collection and Acquisition Methods
4.2.7 Slicing
With the help of Pandas, we can perform many functions on data set like Slicing,
Indexing, Manipulating, and Cleaning Data frame.
Output:
263
Chapter 4 : Data Collection and Acquisition Methods
Output:
264
Chapter 4 : Data Collection and Acquisition Methods
Output:
In the above example, we sliced the columns from the data frame.
4.2.8 Sorting
Parameters:
By: str or list of str
Name or list of names to sort by.
if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.
Axis: {0 or ‘index’, 1 or ‘columns’}, default 0
265
Chapter 4 : Data Collection and Acquisition Methods
Axis to be sorted.
Ascending: bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools,
must match the length of the by.
Inplace: bool, default False
If True, perform operation in-place.
Kind: {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’
Choice of sorting algorithm. See also numpy.sort() for more
information. mergesort and stable are the only stable algorithms. For Data Frames, this
option is only applied when sorting on a single column or label.
na_position: {‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the end.
ignore_index: bool, default False
If True, the resulting axis will be labelled 0, 1, …, n - 1.
Key: callable, optional
Apply the key function to the values before sorting. This is similar to the key argument in
the built-in sorted() function, with the notable difference that this key function should
be vectorised. It should expect a Series and return a Series with the same shape as the
input. It will be applied to each column in by independently.
Table 9: Sorting parameters
266
Chapter 4 : Data Collection and Acquisition Methods
Output:
Output:
267
Chapter 4 : Data Collection and Acquisition Methods
4.2.9 Ufunc
Universal functions in Numpy are simple mathematical functions. It is just a term that
we gave to mathematical functions in the Numpy library. Numpy provides various
universal functions that cover a wide variety of operations.
These functions include standard trigonometric functions, functions for arithmetic
operations, handling complex numbers, statistical functions, etc. Universal functions have
various characteristics which are as follows-
These functions operates on ndarray (N-dimensional array) i.e Numpy’s array class.
It performs fast element-wise array operations.
It supports various features like array broadcasting, type casting etc.
Numpy, universal functions are objects those belongs to numpy.ufunc class.
Python functions can also be created as a universal function using frompyfunc library
function.
Some ufuncs are called automatically when the corresponding arithmetic operator is
used on arrays. For example when addition of two array is performed element-wise
using ‘+’ operator then np.add() is called internally.
print(A), print(B)
print(A.add(B))
Web scraping is a technique used to extract data from websites. It involves using a script or
software to automatically gather information from the web by accessing and extracting
data from web pages in a structured format. Web scraping is valuable for collecting large
amounts of information from the internet, which can then be used for a variety of purposes,
including data analysis, research, and business intelligence.
268
Chapter 4 : Data Collection and Acquisition Methods
2. Retrieving the web page: The server responds by sending back the content of the
web page, often in HTML format.
3. Parsing the HTML content: The web scraper parses the HTML content of the web
page, searching for the relevant data within the tags (e.g., <div>, <table>, <span>)
and attributes.
4. Extracting the data: After parsing the HTML, the scraper extracts the required data
(e.g., text, images, links, etc.).
5. Storing the data: Finally, the extracted data is stored in a structured format such as
CSV, JSON, or a database, making it easier to analyze or use in other applications.
Several tools and libraries are available for web scraping, each with its own
strengths:
BeautifulSoup: A Python library used for parsing HTML and XML documents. It
helps extract data from web pages by navigating the HTML structure and selecting
elements.
Scrapy: An open-source web crawling framework for Python that allows you to
build complex web scrapers. It is designed for large-scale web scraping tasks and
includes built-in support for handling various data extraction and storage needs.
Selenium: While often used for automating browsers for testing purposes, Selenium
can also be used for web scraping, particularly when the website relies on JavaScript
for content rendering.
Puppeteer: A Node.js library that provides a high-level API to control Chrome or
Chromium, enabling the scraping of dynamic content generated by JavaScript.
Legal and Ethical Considerations
While web scraping is a powerful tool, there are several important legal and ethical
considerations to keep in mind:
269
Chapter 4 : Data Collection and Acquisition Methods
Terms of Service: Many websites have terms of service that explicitly prohibit
scraping, and violating these terms could result in legal action. It's essential to
review the website’s terms before scraping data.
Copyright Issues: Some data on websites may be copyrighted, and scraping such
data without permission could violate intellectual property rights.
Rate Limiting and Blocking: Websites may impose rate limits or block IP
addresses that engage in excessive scraping. Ethical scraping involves respecting
these limitations to avoid disrupting a website's functionality.
Data Privacy: If scraping involves personal data, such as user details or sensitive
information, it’s crucial to comply with data privacy regulations, such as the General
Data Protection Regulation (GDPR).
Challenges of Web Scraping
While web scraping offers many advantages, there are several challenges that users
may encounter:
Dynamic Content: Many websites now use JavaScript to load content dynamically,
which can make it difficult for traditional web scrapers to extract data. This may
require using tools like Selenium or Puppeteer, which can interact with JavaScript-
rendered pages.
Website Structure Changes: Websites often update their design or structure,
which can break existing scraping scripts. Maintaining scrapers can be time-
consuming and may require frequent adjustments.
Anti-Scraping Mechanisms: Websites may deploy anti-scraping techniques such as
CAPTCHAs, bot detection systems, or IP blocking. Overcoming these mechanisms
may require advanced techniques or services designed to handle them, such as
rotating IP addresses or using CAPTCHA-solving services.
Example: Webscrapping.glitch.me
270
Chapter 4 : Data Collection and Acquisition Methods
1. BeautifulSoup
BeautifulSoup is one of the most widely used Python libraries for web scraping,
particularly for beginners due to its simplicity and ease of use. It allows you to parse HTML
and XML documents and extract data from them by navigating the document's structure.
Key Features:
o Easy to use for parsing HTML content.
o Supports different parsers such as lxml and html5lib, which allows for
flexible HTML parsing.
o Can be used in conjunction with requests to fetch web pages and extract
specific data points like tags, attributes, and text.
Use Cases:
271
Chapter 4 : Data Collection and Acquisition Methods
Output:
Pros:
Simple and intuitive API.
Ideal for smaller scraping tasks and learning.
Cons:
Can be slow for scraping large datasets or handling JavaScript-heavy sites.
272
Chapter 4 : Data Collection and Acquisition Methods
3 Scrapy
Scrapy is an open-source and powerful Python framework used for building web scrapers
and web crawlers. It is more advanced than BeautifulSoup and is designed for handling
larger, more complex scraping tasks, such as crawling multiple pages, managing requests,
and storing scraped data in various formats.
Key Features:
o Handles both simple and complex web scraping tasks.
o Built-in support for handling multiple requests concurrently.
o Handles various data storage options like JSON, CSV, XML, and databases.
o Can crawl websites and follow links to scrape data across multiple pages.
o Offers built-in support for handling AJAX, cookies, and sessions.
Use Cases:
o Large-scale web scraping projects, such as scraping entire websites or
specific sections of a website.
o Crawling multiple pages or following pagination to gather extensive datasets.
Example:
Pros:
Highly efficient for large scraping projects.
Excellent support for concurrent requests and handling multiple pages.
Cons:
Steeper learning curve compared to simpler tools like BeautifulSoup.
Requires more setup and configuration, making it less beginner-friendly.
Web scraping is a powerful technique for extracting valuable data from websites, but it
raises important ethical considerations. As with any data collection method, it’s essential to
approach web scraping with respect for privacy, legality, and the integrity of the websites
being scraped. The following ethical concerns are crucial to keep in mind when engaging in
web scraping:
273
Chapter 4 : Data Collection and Acquisition Methods
One of the first ethical concerns in web scraping is the terms of service (TOS) or terms of
use of the website being scraped. Many websites have explicit clauses that prohibit
automated scraping or bots. Ignoring these terms could lead to legal consequences,
including fines or being permanently banned from the website.
Ethical Approach: Always check a website's TOS before scraping. If scraping is
prohibited, consider contacting the website owner for permission or explore
alternative ways to access the data, such as using an API if one is provided.
2. Impact on Website Performance
Web scraping can place a significant load on a website’s servers, especially when scraping
large volumes of data in a short amount of time. If done aggressively, it can slow down the
site for other users, or even cause crashes, leading to downtime or a degraded user
experience.
Ethical Approach: To minimize the impact, scrape data at a reasonable rate by
implementing rate limiting, which involves adding delays between requests.
Additionally, avoid scraping during peak traffic times. Some websites may provide a
robots.txt file, which indicates which pages or sections of the site are open to
scraping. Respect these guidelines to avoid overloading servers.
3. Respecting Privacy and Data Protection Laws
When scraping data, it’s important to be aware of privacy laws and data protection
regulations, such as the General Data Protection Regulation (GDPR) in the European
Union or the California Consumer Privacy Act (CCPA). If scraping involves collecting
personal data, it’s essential to comply with these laws to avoid violating individuals'
privacy rights.
Ethical Approach: Avoid scraping personally identifiable information (PII) unless
you have explicit permission from users or the data is publicly available. Ensure that
any data collected is stored securely and used in compliance with data protection
laws.
4. Copyright and Intellectual Property
Web scraping can sometimes infringe upon the intellectual property (IP) rights of
content creators. Websites often host copyrighted material such as text, images, and other
media. Scraping and using this material without permission could violate copyright laws.
Ethical Approach: Ensure that the data you scrape is either publicly available or
falls under fair use. If you're scraping content for commercial purposes, it's critical
to obtain permission from the content owner or rely on publicly available data that
does not violate copyright.
5. Avoiding Data Manipulation and Misuse
Once data is scraped from a website, how it’s used can present ethical challenges. Scraped
data can be misrepresented, manipulated, or used to mislead others, especially when it
comes to aggregating data from multiple sources to create a misleading narrative.
Ethical Approach: Use the data responsibly, ensuring that any insights or analyses
drawn from it are accurate and truthful. If publishing or sharing the data, be
transparent about its source and how it was collected. Avoid using the data in ways
that could mislead or harm others, such as manipulating product reviews or social
media content.
274
Chapter 4 : Data Collection and Acquisition Methods
What is an API?
An API is a set of rules, protocols, and tools that allow one software application to interact
with another. APIs define the methods and data formats that allow different systems to
work together, facilitating communication between a server and a client (such as a browser
or mobile app).
When you use an API to access data, you make a request to a server, and the server sends
back the requested data, usually in formats like JSON or XML.
275
Chapter 4 : Data Collection and Acquisition Methods
POST: Sends data to the server, often used for creating new resources or submitting
data.
PUT: Updates existing data on the server.
DELETE: Removes a specific resource or data from the server.
How to Access Data from an API
To access data from an API, you typically need to follow these steps:
1. Find the API Endpoint: The endpoint is the URL that specifies where the data is
located. For example, a weather API might have an endpoint like
https://api.weather.com/current.
2. Obtain an API Key: Many APIs require an API key, a unique identifier for your
requests. API keys help the service track usage, limit requests, and ensure that only
authorized users can access the data.
3. Send an HTTP Request: Once you have the endpoint and API key, you send an
HTTP request using tools like requests in Python or directly through browser-based
tools like Postman. The request will usually include parameters such as the type of
data you want, filters, or search queries.
4. Process the Response: The server responds with data, typically in JSON or XML
format. You can then parse this data and use it for your application or analysis.
Example: Using Python to Access an API
Let’s consider an example where we want to access weather data using an API. We'll use
the requests library in Python to make the API call.
Define the API endpoint and key (Replace with actual API endpoint and key):
276
Chapter 4 : Data Collection and Acquisition Methods
1. Social Media APIs: Many social media platforms (such as Twitter, Facebook,
Instagram) provide APIs that allow users to retrieve posts, followers, and
engagement metrics programmatically.
2. Weather APIs: Weather services offer APIs to get real-time weather conditions,
forecasts, and historical data.
3. Payment APIs: Services like Stripe, PayPal, and Square offer APIs to handle online
payments, subscriptions, and transaction data.
4. E-commerce APIs: E-commerce platforms such as Shopify and Amazon provide
APIs to retrieve product listings, inventory status, and pricing information.
5. Geolocation APIs: APIs such as Google Maps and OpenCage provide geolocation
and mapping services, allowing developers to access maps, coordinates, and address
data.
Benefits of Using APIs
1. Legality: APIs are provided by companies to give developers controlled access to
their data, making them a legal and authorized method for retrieving information,
unlike web scraping.
2. Data Quality: Data retrieved from APIs is typically clean, structured, and up-to-date,
unlike web scraping, where raw data may require significant processing.
3. Efficiency: APIs allow you to get only the data you need, reducing the overhead of
scraping entire web pages.
4. Reliability: APIs are designed to be stable, with documented endpoints and reliable
data delivery mechanisms, which ensures more consistent results than scraping.
Challenges with API Usage
While APIs provide many advantages, there are a few challenges associated with
their use:
1. Rate Limiting: Many APIs limit the number of requests you can make in a given
time period. If you exceed this limit, you may face delays or temporary access
restrictions.
2. Authentication: Some APIs require complex authentication mechanisms, such as
OAuth or API keys, which may need to be refreshed periodically.
277
Chapter 4 : Data Collection and Acquisition Methods
3. Data Restrictions: Some APIs restrict the type or amount of data you can access,
limiting your ability to gather comprehensive data for analysis.
4. Dependency on External Services: If the API provider changes its data structure,
introduces new rate limits, or discontinues the service, your access to the data may
be disrupted.
Best Practices for API Usage
1. Respect Rate Limits: Always check the API documentation for rate limits and make
sure to adhere to them to avoid being blocked.
2. Use Authentication Securely: Keep API keys and authentication tokens secure.
Never expose them in your code or public repositories.
3. Monitor API Usage: Track how often you access an API to ensure you stay within
usage limits and avoid unnecessary requests.
4. Error Handling: Always include error handling in your code to gracefully handle
API downtimes or unexpected responses.
5. Check API Documentation: Before using an API, carefully read its documentation
to understand the request methods, endpoints, and available data.
278
Chapter 4 : Data Collection and Acquisition Methods
Scalability: REST APIs are scalable because they can handle large numbers of
requests and are stateless, meaning the server doesn’t need to store information
about previous interactions.
Advantages of REST:
Simplicity: RESTful APIs are straightforward to understand and implement.
Flexibility: REST allows developers to interact with any type of resource and
supports multiple formats, making it versatile for different kinds of applications.
Performance: Due to its lightweight nature, REST can perform faster than other
protocols, especially when using JSON.
This request would fetch the data for the user with ID 12345 from the server.
279
Chapter 4 : Data Collection and Acquisition Methods
4.4 Data Quality Issues and Techniques for Cleaning and Transforming Data
Data quality is a critical factor for the success of data-driven decision-making and
analytics. Poor data quality can lead to inaccurate insights, incorrect predictions, and
potentially costly mistakes. As organizations collect vast amounts of data from various
sources, ensuring that data is clean, accurate, and in a usable format is essential. This
involves identifying and resolving data quality issues through cleaning and
transformation techniques.
1. Missing Data:
o Definition: Missing data occurs when certain values in a dataset are absent,
leading to gaps in the information. This issue can arise due to errors during
data entry, system glitches, or data corruption.
o Impact: Missing data can distort statistical analyses, lead to inaccurate
conclusions, and affect machine learning model training.
2. Inconsistent Data:
o Definition: Inconsistencies arise when data is recorded in different formats,
units, or conventions, leading to discrepancies.
o Examples: A date field may contain dates in different formats (e.g.,
MM/DD/YYYY vs. DD/MM/YYYY), or a column for country names may
contain abbreviations (USA, U.S., United States).
o Impact: Inconsistent data prevents accurate comparisons, analysis, and
integration with other datasets.
3. Duplicate Data:
o Definition: Duplicate data occurs when identical records are repeated within
a dataset.
o Impact: Duplicates can artificially inflate data counts, distort analyses, and
lead to incorrect conclusions in reports and models.
4. Outliers:
o Definition: Outliers are extreme values that differ significantly from the rest
of the data. They can result from errors in data collection or from rare but
legitimate occurrences.
o Impact: Outliers can skew analyses, create misleading trends, and negatively
affect machine learning model accuracy, especially if the model is sensitive to
extreme values.
5. Data Entry Errors:
o Definition: Data entry errors occur when incorrect or invalid values are
input into a dataset. Common examples include typographical errors,
incorrect formatting, or misclassification of data.
o Impact: These errors can distort analysis results and lead to incorrect
insights.
280
Chapter 4 : Data Collection and Acquisition Methods
Data quality issues can significantly impact the effectiveness of data analysis and decision-
making. It's crucial to recognize the common types of data quality issues to address them
appropriately. Below are some of the most prevalent data quality issues:
i. Missing Data
Definition: Missing data refers to the absence of values in one or more fields of a dataset.
This issue can occur in both structured and unstructured data due to various reasons such
as data corruption, errors in data entry, or unavailability of data at the time of collection.
Common Causes:
Data entry errors
Sensor or system failures
Incomplete forms or surveys
Data not collected or recorded in certain instances
Impact:
Missing data can lead to biased analyses and inaccurate insights.
Statistical methods and machine learning models may produce unreliable
predictions or conclusions when handling incomplete datasets.
Incomplete datasets may result in reduced sample sizes, affecting the validity of
analyses.
Solutions:
Imputation: Replace missing values with estimated values using statistical methods
like mean, median, or mode imputation or predictive models.
Deletion: Remove rows or columns with missing data if the missing portion is small
and doesn't significantly impact the analysis.
Forward/Backward Filling: In time series data, use previous or subsequent data
points to fill missing values.
Definition: Duplicate data refers to repeated records within a dataset, where the same
information appears more than once. This issue can occur due to multiple entries of the
same record, errors in data merging, or data imports from multiple sources.
Common Causes:
Manual data entry errors
Merging datasets from different sources without proper checks
Lack of primary keys or identifiers for uniqueness
Impact:
Duplicates can artificially inflate counts and distort analytical results, leading to
skewed reports or predictions.
Repeated records can create redundancy, increasing storage requirements and
processing time.
Incorrect conclusions or decisions may be drawn if duplicate records are treated as
separate, unique entities.
Solutions:
281
Chapter 4 : Data Collection and Acquisition Methods
Definition: Inconsistent data occurs when there are discrepancies in how information is
recorded across different sources or within a dataset. This could be due to variations in
formatting, units of measurement, or spelling.
Common Causes:
Different systems or departments using varied formats for the same data.
Manual data entry errors leading to variations in data representation.
Lack of standardized data collection processes.
Examples:
Date formats: Dates might be recorded as MM/DD/YYYY in one column and
DD/MM/YYYY in another.
Spelling differences: A product name might appear as “Apple” in one instance and
“apple” in another, or “USA” vs. “United States.”
Unit discrepancies: One dataset might use pounds while another uses kilograms
for weight, creating inconsistencies.
Impact:
Inconsistent data prevents accurate data integration and analysis, making it difficult
to compare or aggregate information.
It may lead to errors in reporting, as different formats or values are interpreted
differently.
Inconsistent data complicates decision-making and reduces the trustworthiness of
insights derived from the data.
Solutions:
Standardization: Implement data standardization processes, such as converting all
date fields to a single format (e.g., YYYY-MM-DD) and ensuring consistent use of
units.
Data Mapping: Use automated tools to map different formats or values to a
standard convention across all datasets.
Data Validation: Apply consistency checks to ensure that data entries adhere to
predefined rules (e.g., standard country codes or consistent abbreviations).
4.4.2 Outliers
Outliers are data points that significantly differ from other observations in a dataset. These
values are unusually high or low compared to the rest of the data and can skew analysis
282
Chapter 4 : Data Collection and Acquisition Methods
and statistical results if not handled properly. Outliers can arise from various sources, such
as errors in data collection, natural variations, or rare but valid occurrences.
Common Causes of Outliers:
1. Data Entry Errors: Mistakes made during manual data entry, such as typing errors
or incorrect values, can result in outliers.
2. Measurement Errors: Faulty equipment, malfunctions in sensors, or incorrect
readings can produce outliers.
3. Sampling Issues: Sometimes, outliers may arise due to issues with the sampling
method, such as including a non-representative sample.
4. Rare Events or True Variations: Outliers might represent rare but legitimate
events or natural variations in the data, such as an extremely high income or an
unusually low temperature.
5. Data Integration: Merging datasets from different sources might introduce
discrepancies that result in outliers.
Impact of Outliers:
Skewed Results: Outliers can distort statistical measures like the mean, leading to
incorrect analyses and predictions. For example, a few extremely high values can
pull the average up, making it unrepresentative of the general trend.
Influencing Machine Learning Models: Outliers can disproportionately influence
models, especially those that rely on distances (e.g., k-nearest neighbors) or
regression models, leading to biased predictions.
Distorting Visualizations: In graphs and charts, outliers can create misleading
visualizations, making it difficult to identify trends or patterns in the data.
Inaccurate Decision-Making: If outliers are not handled properly, they can lead to
wrong conclusions and impact decisions based on faulty insights.
Techniques for Handling Outliers:
1. Identification of Outliers:
o Visual Inspection: Use box plots, scatter plots, or histograms to visually
detect outliers in the data. A box plot shows data spread and identifies values
outside of the "whiskers," which are potential outliers.
o Statistical Methods:
Z-Score: The Z-score measures how far a data point is from the mean,
in terms of standard deviations. A Z-score greater than 3 or less than -
3 is typically considered an outlier.
IQR (Interquartile Range): Outliers can also be identified using the
IQR method. Data points outside the range defined by Q1−1.5×IQRQ1 -
1.5 \times IQRQ1−1.5×IQR and Q3+1.5×IQRQ3 + 1.5 \times
IQRQ3+1.5×IQR are considered outliers, where Q1 and Q3 are the first
and third quartiles, respectively.
2. Handling Outliers:
o Transformation: Apply mathematical transformations (such as logarithmic
or square root transformations) to reduce the influence of outliers. This is
particularly effective when the data follows an exponential distribution.
283
Chapter 4 : Data Collection and Acquisition Methods
284
Chapter 4 : Data Collection and Acquisition Methods
Data quality is crucial for the effectiveness of AI and machine learning models. Poor
data quality can significantly impact model performance, leading to inaccurate
predictions, biased outcomes, and inefficient learning. Here’s how data quality
influences AI/ML models:
285
Chapter 4 : Data Collection and Acquisition Methods
2. Model Performance
Overfitting and Underfitting: Noisy or outlier data can lead to overfitting, where
the model fits the training data too closely but struggles with new data. Conversely,
missing or insufficient data can lead to underfitting, where the model fails to capture
patterns in the data.
Training Time: Low-quality data increases the time it takes to train a model, as it
may require additional cleaning or preprocessing.
Upon inspecting the dataset, the following data quality issues were identified:
286
Chapter 4 : Data Collection and Acquisition Methods
1. Missing Data: Several columns, including customer age, usage frequency, and
customer service interactions, contain missing values for a large number of records.
o For example, 15% of records have missing values for the "Age" column, and
10% lack data in the "Monthly Spend" column.
2. Duplicate Data: Some customer records are duplicated, where the same customer
is listed multiple times with slightly different details. This is particularly evident in
the "Subscription Plan" and "Usage Data" columns.
o Multiple entries for customers with the same "Customer ID" but different
usage statistics can distort the analysis and model training.
5. Bias in Data: Upon examining the customer demographics, it is found that the
dataset is skewed towards a specific geographic region and customer age group,
with overrepresentation of customers between 30-40 years old. This can lead to
biased predictions when the model encounters data from underrepresented groups.
Step 2: Addressing the Data Quality Issues
To address these issues, the following steps are taken:
288
Chapter 4 : Data Collection and Acquisition Methods
289
Chapter 4 : Data Collection and Acquisition Methods
290
Chapter 4 : Data Collection and Acquisition Methods
o dplyr: This R package offers functions for data manipulation and cleaning,
including dealing with missing values, standardizing data, and removing
duplicates.
o tidyr: Helps to clean and transform datasets by reshaping data, handling
missing values, and ensuring consistent formatting.
4.4.6 Techniques for Imputing Missing Data
Imputing missing data is a critical step in data preprocessing as missing values can reduce
the accuracy and effectiveness of machine learning models and statistical analyses. The
method chosen for imputing missing data depends on the type of data, the proportion of
missing values, and the potential impact of the imputation on model performance. Below
are some of the most common techniques for imputing missing data:
1. Mean, Median, and Mode Imputation
Mean Imputation: This involves replacing missing numerical values with the mean
(average) of the available data for that feature. It works well when the data is
approximately normally distributed.
o Use Case: Numeric data like age, income, or scores that do not have extreme
outliers.
o Limitation: It can distort the distribution of the data and is sensitive to
outliers.
Median Imputation: Instead of using the mean, the median (the middle value when
data is sorted) is used to fill in missing values. This is particularly useful when the
data is skewed or contains outliers, as the median is more robust to extreme values.
o Use Case: Data with skewed distributions, such as income or house prices.
o Limitation: Does not preserve the variance of the data.
Mode Imputation: For categorical data, the most frequent category (mode) is used
to replace missing values.
o Use Case: Categorical data like gender, region, or product type.
o Limitation: May lead to an overrepresentation of the most common
category, which can introduce bias.
2. K-Nearest Neighbors (KNN) Imputation
KNN Imputation: This method uses the K nearest neighbors to estimate the missing
value based on other similar observations. The "nearest" neighbors are determined
by distance metrics such as Euclidean distance.
o Use Case: Useful when data points have relationships with each other. For
instance, predicting missing values in customer demographics based on
other similar customer profiles.
o Limitation: Computationally expensive, especially with large datasets, and
requires the selection of an optimal "k" value.
3. Regression Imputation
Regression Imputation: In this method, a regression model is built to predict the
missing values based on other observed variables. A regression equation is used to
predict the missing data points.
291
Chapter 4 : Data Collection and Acquisition Methods
o Use Case: When there is a strong relationship between the feature with
missing values and other features. For example, predicting missing income
values based on age, education, and occupation.
o Limitation: Assumes that the relationship between variables is linear, which
may not always be true.
4. Multiple Imputation
Multiple Imputation: This technique generates multiple imputed values for each
missing data point to reflect the uncertainty of the imputation. Each dataset with
imputed values is then analyzed, and the results are combined to provide a more
reliable estimate.
o Use Case: Suitable for datasets with a large proportion of missing data or
when imputations need to capture uncertainty, such as in medical or survey
data.
o Limitation: Requires more computational resources and careful handling to
avoid bias in combining the results from different imputed datasets.
5. Last Observation Carried Forward (LOCF)
LOCF Imputation: This technique is primarily used in time-series data. The missing
value is replaced by the most recent observed value from previous time steps.
o Use Case: In datasets where observations are taken sequentially over time,
such as customer activity or sensor readings.
o Limitation: Assumes that the missing values are close to the last known
values, which might not be true, leading to inaccurate imputation in some
cases.
6. Interpolation
Linear Interpolation: This technique estimates the missing value by drawing a
straight line between the two adjacent values and filling in the missing value based
on this line. It assumes that the change between two consecutive points is linear.
o Use Case: Time-series data with missing values that lie in between known
observations (e.g., stock prices, weather data).
o Limitation: May not work well for non-linear or highly fluctuating data.
Spline Interpolation: This method fits a smooth curve (spline) through the
available data points and estimates missing values along this curve. It is more
flexible than linear interpolation.
o Use Case: Data with non-linear patterns, such as scientific measurements or
sensor data.
o Limitation: Computationally more intensive and may overfit the data.
7. Random Forest Imputation
Random Forest Imputation: This technique uses a random forest model to predict
missing values. The missing value is predicted using multiple decision trees, which
are trained on the available data, and the final prediction is averaged over all trees.
o Use Case: Works well when there are complex relationships between the
features and is suitable for both numerical and categorical data.
o Limitation: Computationally expensive, especially for large datasets, and
requires careful tuning.
8. Deep Learning (Autoencoders)
292
Chapter 4 : Data Collection and Acquisition Methods
Definition: Duplicates refer to identical or nearly identical rows in the dataset, which can
skew the analysis and negatively impact model training. These may arise due to data entry
errors, merging datasets, or improper data collection methods.
Why Removing Duplicates Is Important:
Redundancy: Duplicate records introduce unnecessary redundancy in the dataset,
leading to overrepresentation of certain values, which can bias statistical analyses
and machine learning models.
Model Overfitting: In machine learning, duplicates can cause overfitting, where the
model learns to memorize the repeated data points, resulting in poor generalization
on unseen data.
Distortion of Analysis: Statistical analysis and results (such as averages or
medians) can be skewed due to duplicated data, leading to incorrect conclusions.
293
Chapter 4 : Data Collection and Acquisition Methods
Partial Duplicates: Sometimes, duplicates may exist only in certain columns (e.g., two
rows with the same name but different addresses). In such cases, duplicates should be
identified based on specific columns rather than the entire row.
Definition: Outliers are data points that deviate significantly from the rest of the data.
These values can be much higher or lower than the majority of the data points. While
outliers can sometimes provide valuable insights, they may also distort statistical models
and lead to inaccurate predictions.
Why Handling Outliers Is Important:
Influence on Statistical Measures: Outliers can distort statistical measures like the
mean, leading to inaccurate conclusions about the data.
Impact on Machine Learning Models: In machine learning, especially in
algorithms like linear regression, decision trees, or clustering, outliers can heavily
influence the model’s performance, making it less accurate.
Data Quality: Outliers might represent errors in data collection or recording, in
which case they need to be handled before conducting any analysis.
Techniques for Identifying and Removing Outliers:
Statistical Methods:
o Z-Score Method: A Z-score measures how many standard deviations a data
point is away from the mean. Typically, a Z-score greater than 3 or less than -
3 is considered an outlier. If data points have Z-scores beyond this threshold,
they can be removed.
Example:
294
Chapter 4 : Data Collection and Acquisition Methods
Data transformation is a crucial step in data preprocessing that prepares the data for
analysis or machine learning models. Transformation techniques modify the structure or
scale of the data to improve the quality of insights, enhance model performance, and meet
the assumptions of certain algorithms. Key data transformation techniques include
normalization, standardization, and encoding, which address different aspects of data
processing.
Definition: Normalization (also known as Min-Max scaling) is the process of rescaling the
values of a numerical feature so that they fall within a specific range, typically [0, 1] or [-1,
1]. This is done to eliminate any potential bias due to varying magnitudes of different
features.
Why Normalization Is Important:
Uniform Range: When features have different ranges (e.g., one feature ranges from
1 to 1000, and another from 0 to 1), some features may dominate others, affecting
the model's performance.
Gradient Descent: Many machine learning algorithms, particularly those that rely
on gradient descent (such as neural networks), benefit from normalization since it
ensures faster convergence.
295
Chapter 4 : Data Collection and Acquisition Methods
3. Binary Encoding:
Definition: Binary encoding is a combination of label encoding and one-hot
encoding. It first assigns an integer to each category and then converts these
integers into binary code. This method can be more efficient than one-hot encoding
when dealing with high-cardinality features (i.e., features with many unique
categories).
Use Case: For categorical features with many categories (e.g., product IDs).
4. Target Encoding:
Definition: Target encoding involves replacing each category with the mean of the
target variable for that category. This technique is often used in supervised learning
and is especially useful when dealing with high-cardinality categorical features.
Use Case: When there is a strong relationship between the categorical variable and
the target variable.
296
Chapter 4 : Data Collection and Acquisition Methods
297
Chapter 4 : Data Collection and Acquisition Methods
Definition of Data Enrichment: Data enrichment is the process of enhancing existing data
by adding additional information or attributes from external or internal sources. This
process is aimed at improving the value, accuracy, and completeness of the dataset. Data
enrichment typically involves supplementing raw data with information such as
demographic details, behavioral insights, geographic data, or other relevant variables that
are not present in the original dataset.
For example, if a company has customer names and email addresses, they may enrich their
data by adding details such as the customer’s age, location, buying history, or social media
activity. This additional information can come from various external databases, public
records, third-party data providers, or internal sources.
Importance of Data Enrichment:
1. Improved Decision Making:
o Enriched data allows organizations to make more informed and accurate
decisions. By incorporating additional context, businesses can gain a deeper
understanding of their customers, products, or services, which leads to more
effective strategies and outcomes.
o For instance, enriching customer data with demographic information can
help a business segment its audience more effectively, improving marketing
strategies and sales targeting.
298
Chapter 4 : Data Collection and Acquisition Methods
299
Chapter 4 : Data Collection and Acquisition Methods
Another major benefit of data augmentation is the ability to unlock deeper insights.
By incorporating data from external sources, organizations can better understand the
factors influencing their operations. For instance, adding weather data to sales
information could reveal patterns in how weather affects consumer buying behavior,
allowing for more targeted marketing campaigns. Furthermore, external data can
provide valuable insights into customer preferences, competitive dynamics, or
emerging market trends that are difficult to observe in isolation.
Examples of Augmenting Datasets with External Data: There are many ways
businesses can augment their datasets with external data. For instance, geographic
data can be used to complement customer datasets, such as adding regional economic
information, demographic profiles, or details on population density. This helps
organizations better understand their customers and tailor their services or products
to specific regions or markets. Similarly, social media data can be incorporated to
track customer sentiment or analyze public opinions on specific products, helping
businesses adapt their marketing strategies and improve customer engagement.
External data comes from various sources that provide additional information to
enhance existing datasets. These sources are invaluable for organizations seeking to
gain deeper insights, improve data quality, or expand their dataset to include
variables that were not initially captured.
Public Datasets are one of the most widely available sources of external data. They
are typically released by governments, research institutions, non-profit organizations,
or international bodies, and cover a wide range of topics, including economic data,
public health, environmental conditions, and education. Examples include census
data, economic indicators (such as GDP, unemployment rates), climate data
(temperature, rainfall), and health statistics. These datasets are often freely accessible
to the public and can be downloaded in various formats like CSV or Excel. Public
datasets are valuable for researchers, businesses, and policymakers as they provide
300
Chapter 4 : Data Collection and Acquisition Methods
In addition to public datasets and APIs, commercial data providers also play a
significant role in sourcing external data. These providers offer specialized datasets,
often for a fee, that are tailored to specific industries or business needs. For example,
businesses might purchase data from credit scoring agencies, market research firms,
or advertising platforms to obtain customer insights, purchasing behaviors,
demographic data, or industry trends. These commercial data sources are often
enriched and highly curated, making them a valuable resource for organizations
looking for more specific, high-quality data that may not be readily available through
public channels. For example, companies like Nielsen provide consumer behavior
data, while financial data providers like Bloomberg and Reuters offer stock market,
economic, and company performance data.
Crowdsourced Data is another emerging source of external data, often obtained from
platforms where users contribute information voluntarily. Examples include user-
generated content on platforms like Wikipedia or open-source projects that gather
data from contributors globally. This type of data can be particularly useful for real-
time analytics or when seeking insights into large, diverse datasets.
Finally, Private Data Providers and Data Brokers sell data that is highly specialized
and often aggregated from a variety of sources. These providers might collect and sell
consumer data, online activity data, or business performance data. While this type of
data can be costly, it often provides highly specific insights that are useful for targeted
marketing, customer segmentation, and competitive analysis.
Text enrichment techniques are methodologies applied to unstructured text data in order
to extract deeper insights, add meaningful context, and make it more useful for analysis.
301
Chapter 4 : Data Collection and Acquisition Methods
These techniques allow businesses and organizations to gain valuable information from
large volumes of text, such as customer reviews, social media posts, articles, or any other
textual content. The most common and impactful text enrichment techniques include
Sentiment Analysis, Named Entity Recognition (NER), and Topic Modeling.
1. Sentiment Analysis:
o Sentiment analysis, also known as opinion mining, involves analyzing text to
determine the sentiment or emotional tone behind it. The text is classified as
positive, negative, or neutral, and can even detect emotions such as joy,
anger, sadness, or fear. This technique is especially useful in understanding
public perception of products, services, or brands, especially in social media,
customer feedback, and product reviews.
o For example, a company analyzing product reviews can use sentiment
analysis to gauge whether customer feedback is generally positive or
negative, and identify areas of improvement or satisfaction. By analyzing
customer sentiment, businesses can tailor their strategies, improve customer
relations, and enhance product offerings.
o Applications: Customer service, brand monitoring, social media listening,
product feedback analysis.
2. Named Entity Recognition (NER):
o Named Entity Recognition (NER) is a technique used to identify and classify
specific entities in text, such as names of people, organizations, locations,
dates, and other key terms. This helps in extracting structured data from
unstructured text. For example, in a news article, NER can identify names of
people, places, dates, and events, making it easier to index, categorize, and
analyze the content.
o NER is particularly useful for structuring raw text into meaningful
components that can be used for further processing, like building knowledge
graphs, creating search engine optimizations, or enhancing chatbots with
more contextual understanding.
o Applications: Document categorization, information extraction, content
tagging, enhancing search functionality.
3. Topic Modeling:
o Topic modeling is a technique used to discover hidden thematic structures or
topics within a set of documents. It helps in identifying the underlying
themes that appear frequently in a collection of text. One of the most
common methods for topic modeling is Latent Dirichlet Allocation (LDA),
which groups words that frequently co-occur in documents into distinct
topics.
o Topic modeling is particularly useful for categorizing large volumes of text
into themes, enabling organizations to quickly identify what subjects are
being discussed. For example, in a large set of customer reviews, topic
modeling might reveal recurring themes like "product quality," "shipping
experience," or "customer service," helping businesses focus on key areas of
concern or success.
302
Chapter 4 : Data Collection and Acquisition Methods
303
Chapter 4 : Data Collection and Acquisition Methods
Assessment Criteria
S. No. Assessment Criteria for Performance Criteria Theory Practic Proje Viva
Marks al ct Mark
Marks Mark s
s
PC1 Knowledge of the steps involved in data collection, 30 20 7 7
goal-setting, and choosing appropriate methods for
different scenarios.
PC2 Understanding web scraping, API usage, and data 30 20 7 7
feeds as methods of acquiring data. Awareness of
tools for data collection, metadata management, and
dataset publishing.
PC3 Understanding of common data quality issues and 20 10 3 3
techniques for cleaning and transforming data, such
as normalization, standardization, and encoding.
Familiarity with automated data cleaning and AI
tools to manage missing data and outliers.
PC4 Understanding of data enrichment methods like 20 10 3 3
augmenting datasets with external data and text
enrichment techniques.
100 60 20 20
Total Marks 200
Refrences :
304
Chapter 4 : Data Collection and Acquisition Methods
Exercise
1) Which of the following is the first step in the data collection process:
a. Data cleaning
b. Goal setting
c. Data analysis
d. Data visualization
2) What method involves extracting data from websites automatically:
a. API usage
b. Data feeds
c. Web scraping
d. Metadata management
3) Normalization is a technique used for:
a. Data visualization
b. Data cleaning
c. Data enrichment
d. Data analysis
4) Augmenting datasets with external data is a method of:
a. Data cleaning
b. Data enrichment
c. Data visualization
d. Data analysis
5) Choosing appropriate data collection methods depends on:
a. The size of the dataset only
b. The budget available only
c. The specific scenario and goals
d. The software used only
6) APIs (Application Programming Interfaces) are used to:
a. Scrape websites
305
Chapter 4 : Data Collection and Acquisition Methods
True/False Questions
306
Chapter 4 : Data Collection and Acquisition Methods
8. Text enrichment can involve adding sentiment analysis results to text data. (T/F)
10. Automated data cleaning tools are not effective for removing outliers. (T/F)
LAB Exercise
1. Describe the steps you would take to extract data from a specific website using
Python's BeautifulSoup library. Provide a code snippet demonstrating how to
extract a specific piece of information (e.g., product names from an e-commerce
site).
2. Given a dataset with missing values, write Python code using pandas to impute the
missing values with the mean of the respective columns.
3. Using a sample text dataset, demonstrate how to perform text enrichment by adding
sentiment analysis scores using a Python library like VADER.
4. Design a data collection plan for a project that requires gathering social media data
for sentiment analysis. Outline the steps, methods, and tools you would use.
5. Explain how you would use an API to retrieve data from a public dataset (e.g.,
weather data from an open API). Provide a conceptual outline of the code and the
data you would expect to receive.
307
Chapter 5 : Data Integration, Storage and Visualization
Chapter 5 :
Data Integration, Storage and Visualization
5.1 Introduction to ETL Processes and Data Consolidation
In today's data-driven world, organizations generate vast amounts of data from multiple
sources. To derive meaningful insights, data must be efficiently extracted, transformed, and
loaded (ETL) into a centralized repository. ETL processes play a critical role in data
management by ensuring data integrity, consistency, and accessibility for analytics and
decision-making.
3. Loading – The processed data is loaded into a target system, such as a data
warehouse, data lake, or analytical platform, where it is stored and made available
for reporting and analysis. Depending on business needs, the loading process can be
done in batch mode or real-time streaming.
308
Chapter 5 : Data Integration, Storage and Visualization
309
Chapter 5 : Data Integration, Storage and Visualization
Apache Spark – A big data processing engine that supports scalable and distributed
ETL workflows, ideal for large-scale data transformations.
Challenges in ETL and Data Consolidation
Data Complexity – Handling diverse data formats and structures can be difficult,
requiring advanced parsing and normalization techniques.
Scalability Issues – Large datasets require efficient resource management to prevent
performance bottlenecks and ensure fast processing times.
Data Latency – Real-time data processing can be challenging when dealing with
batch-based ETL workflows that introduce delays.
Security and Compliance – Ensuring data privacy and regulatory compliance
requires robust governance measures, such as encryption and access controls.
Cost Considerations – Cloud-based ETL solutions can become expensive with high
data volumes, requiring organizations to optimize resource usage.
Maintaining Data Consistency – Synchronizing data across different sources and
destinations while maintaining accuracy is a significant challenge.
The ETL landscape is evolving with advancements in technology. Key trends include:
Cloud-Based ETL – Increasing adoption of cloud-native ETL solutions for scalability,
cost-effectiveness, and easier integration with cloud storage solutions.
AI-Driven Automation – Leveraging machine learning for intelligent data
processing, anomaly detection, and predictive analytics.
Real-Time Data Integration – Transitioning from traditional batch processing to
real-time streaming ETL using tools like Apache Kafka and AWS Kinesis.
Data Mesh and Data Fabric Architectures – Modernizing data management
through decentralized and interconnected data frameworks that enhance data
accessibility.
No-Code and Low-Code ETL Solutions – Providing business users with tools to build
and manage ETL pipelines without deep technical expertise.
Data Lineage and Observability – Improving data transparency and trust by
tracking data movement, transformations, and usage across systems.
ETL processes and data consolidation are fundamental to modern data management. By
leveraging the right tools and strategies, organizations can streamline data integration,
improve decision-making, and drive business success. As technology continues to evolve,
automated and intelligent ETL solutions will play an increasingly important role in managing
complex data ecosystems, ensuring efficiency, security, and compliance.
310
Chapter 5 : Data Integration, Storage and Visualization
Effective data integration strategies help organizations break down data silos, improve
interoperability, and enhance data usability.
ETL processes, data consolidation, and integration are fundamental to modern data
management. By leveraging the right tools, organizations can streamline data workflows,
improve analytics, and enhance decision-making capabilities. As technology advances, AI-
driven automation, real-time data processing, and cloud-based integration solutions will
play an even more significant role in shaping the future of data management. Investing in
robust data integration strategies will enable organizations to stay competitive in an
increasingly data-driven economy
311
Chapter 5 : Data Integration, Storage and Visualization
ETL, which stands for Extract, Transform, Load, is a data integration process that moves data
from multiple sources into a centralized system for analysis and reporting. This process is
widely used in data warehousing, business intelligence, and analytics applications.
Extract (E)
The extraction phase involves retrieving raw data from various sources, including relational
databases, cloud storage, APIs, logs, and third-party applications. The key considerations in
extraction include:
Handling different data formats (structured, semi-structured, unstructured).
Managing data latency (real-time vs. batch extraction).
Ensuring data integrity during retrieval.
Transform (T)
The transformation phase cleanses, structures, and enriches the extracted data to meet
business requirements. Key transformation tasks include:
Data Cleansing – Removing duplicates, correcting inconsistencies, and handling
missing values.
Data Normalization – Standardizing formats and data types to ensure uniformity.
Aggregation – Summarizing data for analytical reporting.
Business Rule Application – Implementing logic such as currency conversion,
category classification, or predictive modeling.
Load (L)
The final step in the ETL process is loading the transformed data into a target system, such
as a data warehouse, data lake, or business intelligence platform. Depending on business
needs, data can be loaded in:
Batch Mode – Data is processed in bulk at scheduled intervals.
Incremental Mode – Only new or updated records are processed to optimize
performance.
Real-Time Streaming – Data is continuously updated as new information becomes
available.
Importance of ETL
ETL plays a critical role in:
Data Consolidation – Integrating multiple data sources into a single repository.
Data Quality Enhancement – Ensuring clean and structured data for analysis.
Optimized Decision-Making – Enabling businesses to gain actionable insights from
well-prepared datasets.
Scalability – Handling increasing data volumes and complexity as organizations
grow.
312
Chapter 5 : Data Integration, Storage and Visualization
ETL processes, data consolidation, and integration are fundamental to modern data
management. By leveraging the right tools, organizations can streamline data workflows,
improve analytics, and enhance decision-making capabilities. As technology advances, AI-
driven automation, real-time data processing, and cloud-based integration solutions will
play an even more significant role in shaping the future of data management. Investing in
robust data integration strategies will enable organizations to stay competitive in an
increasingly data-driven economy.
ETL processes rely on various data sources, each serving different purposes based on
business requirements. Understanding these sources helps organizations design efficient
ETL workflows for data consolidation.
Databases
Databases are one of the most common sources for ETL pipelines. They store structured data
in organized tables, making them ideal for transactional and analytical processing. Common
types include:
Relational Databases (RDBMS) – MySQL, PostgreSQL, SQL Server, and Oracle store
structured data with relationships between tables.
NoSQL Databases – MongoDB, Cassandra, and DynamoDB handle semi-structured or
unstructured data, offering scalability and flexibility.
Cloud Databases – Services like Google BigQuery, Amazon RDS, and Azure SQL
Database provide managed database solutions for scalability and ease of use.
APIs enable data exchange between applications and systems, making them crucial for real-
time data integration. Key types include:
REST APIs – Most commonly used for data access over HTTP, supporting JSON and
XML formats.
SOAP APIs – Used for secure and structured data exchange, typically in enterprise
environments.
GraphQL APIs – Allow flexible and efficient querying, reducing the amount of data
transferred between systems.
APIs are essential for integrating data from external services such as CRM systems, financial
platforms, and IoT devices into ETL workflows.
313
Chapter 5 : Data Integration, Storage and Visualization
CSV (Comma-Separated Values) files and other flat files serve as simple and widely used data
exchange formats. They are commonly used for:
Although CSV files are easy to handle, they require careful preprocessing in ETL workflows
to address issues like missing values, incorrect delimiters, and inconsistent formatting.
ETL processes, data consolidation, and integration are fundamental to modern data
management. By leveraging the right tools, organizations can streamline data workflows,
improve analytics, and enhance decision-making capabilities. As technology advances, AI-
driven automation, real-time data processing, and cloud-based integration solutions will
play an even more significant role in shaping the future of data management. Investing in
robust data integration strategies will enable organizations to stay competitive in an
increasingly data-driven economy
The ETL process involves several systematic steps to ensure seamless data integration and
transformation. Below is a step-by-step guide to executing an effective ETL pipeline.
Determine the various data sources such as relational databases, APIs, flat files, or
cloud storage.
Define the frequency of data extraction (real-time, batch, or scheduled).
Ensure proper connectivity to these sources through database connectors, API
endpoints, or data streaming frameworks.
Extract raw data from identified sources while ensuring minimal disruption to live
systems.
Use query-based extraction for databases, API calls for external services, or file
parsing for structured/unstructured data.
Store extracted data in a staging area before transformation.
314
Chapter 5 : Data Integration, Storage and Visualization
Load transformed data into the target system such as a data warehouse, data lake, or
analytics platform.
Choose an appropriate loading strategy:
o Full Load – Replaces all existing data with a new dataset.
o Incremental Load – Updates only the changed or new records, improving
efficiency.
o Real-Time Load – Streams data continuously into the target system for real-
time analysis.
Continuously monitor ETL pipelines using automated logging and alerting systems.
Maintain data lineage tracking to ensure transparency and auditability.
Periodically refine transformation rules to accommodate evolving business
requirements.
315
Chapter 5 : Data Integration, Storage and Visualization
ETL processes, data consolidation, and integration are fundamental to modern data
management. By following a structured ETL workflow, organizations can streamline data
workflows, improve analytics, and enhance decision-making capabilities. As technology
advances, AI-driven automation, real-time data processing, and cloud-based integration
solutions will play an even more significant role in shaping the future of data management.
Investing in robust data integration strategies will enable organizations to stay competitive
in an increasingly data-driven economy.
Several tools facilitate ETL processes by automating data extraction, transformation, and
loading. Some widely used ETL tools include:
1. Apache NiFi – A powerful, open-source ETL tool that automates the movement and
transformation of data between systems. It is known for its real-time data processing
capabilities and intuitive interface.
2. Talend – A comprehensive ETL platform offering drag-and-drop functionalities,
advanced data transformation, and cloud-based integration.
3. Python Pandas – A flexible, open-source library in Python that enables data
manipulation, cleansing, and transformation through DataFrames.
4. Microsoft SQL Server Integration Services (SSIS) – A Microsoft-based ETL tool
designed for enterprise-level data integration and automation.
5. AWS Glue – A cloud-based, serverless ETL service that simplifies big data processing
and integration within Amazon Web Services.
6. Google Cloud Dataflow – A scalable ETL solution for streaming and batch data
processing within the Google Cloud ecosystem.
Each of these tools provides unique benefits, allowing organizations to choose the best
option based on scalability, cost, ease of use, and integration capabilities.
ETL processes, data consolidation, and integration are fundamental to modern data
management. By following a structured ETL workflow and leveraging powerful ETL tools,
organizations can enhance data quality, improve analytics, and drive better decision-making.
As technology advances, AI-driven automation, real-time data processing, and cloud-based
ETL solutions will continue to revolutionize data management and integration.
Data cleaning and transformation are crucial steps in the ETL process, ensuring high-quality,
accurate, and usable data for analytics. The following techniques are commonly used:
316
Chapter 5 : Data Integration, Storage and Visualization
1. Normalization – Scaling numeric data to fit within a specified range (e.g., Min-Max
scaling, Z-score normalization).
2. Encoding Categorical Variables – Converting categorical data into numerical form
using one-hot encoding or label encoding.
3. Aggregation – Summarizing data to generate meaningful insights, such as calculating
average sales by region.
4. Merging and Joining Data – Combining multiple datasets using joins (inner, outer,
left, right) to create comprehensive datasets.
5. Data Binning – Grouping continuous numerical data into discrete intervals to
improve interpretability.
By applying these data cleaning and transformation techniques, organizations can ensure
high data quality and improved decision-making capabilities.
Conclusion
ETL processes, data consolidation, and integration are fundamental to modern data
management. By following a structured ETL workflow and leveraging powerful ETL tools,
organizations can enhance data quality, improve analytics, and drive better decision-making.
As technology advances, AI-driven automation, real-time data processing, and cloud-based
ETL solutions will continue to revolutionize data management and integration.
Once data is extracted, transformed, and cleaned, the final step in the ETL process is
consolidating it into a unified dataset. Data consolidation ensures that all relevant
information is integrated into a single, accurate, and reliable source, enabling businesses to
make informed decisions and enhance operational efficiency.
317
Chapter 5 : Data Integration, Storage and Visualization
2. Data Lakes – Large-scale repositories that store structured and unstructured data,
enabling advanced analytics and machine learning. Popular data lakes include
Microsoft Azure Data Lake and AWS Lake Formation.
3. Master Data Management (MDM) – A framework that ensures consistency,
accuracy, and control over key business data, creating a single source of truth across
an organization.
1. ETL Pipelines – Using automated workflows to extract, transform, and load data into
a target system ensures that data from multiple sources is standardized and merged
seamlessly.
2. Data Virtualization – Allows access to disparate data sources in real-time without
requiring physical data movement.
3. Data Replication – Copying data from one source to another, ensuring redundancy
and backup while maintaining a consistent dataset.
4. Schema Integration – Harmonizing different data structures and formats to create a
unified schema that supports seamless querying and reporting.
318
Chapter 5 : Data Integration, Storage and Visualization
1. Data Redundancy – Duplicate records may arise when integrating multiple data
sources, leading to inconsistencies and bloated storage.
2. Data Quality Issues – Inaccurate, incomplete, or outdated data can hinder effective
consolidation, requiring continuous validation and cleansing.
3. Scalability – Large volumes of data must be efficiently managed to prevent
performance bottlenecks.
4. Security and Compliance – Ensuring data governance policies, such as GDPR and
HIPAA, are followed to maintain data privacy and regulatory adherence.
Retail companies use ETL to consolidate data from various sales channels, inventory
management systems, and customer interactions to optimize business operations.
Example: A global e-commerce company extracts sales data from multiple sources
(websites, mobile apps, and in-store purchases), transforms it by standardizing product
categories, and loads it into a centralized data warehouse for real-time inventory tracking
and sales analysis.
319
Chapter 5 : Data Integration, Storage and Visualization
Example: A hospital network extracts patient data from electronic health record (EHR)
systems, transforms it by anonymizing sensitive information, and loads it into a secure
database for disease trend analysis and predictive healthcare analytics.
Example: A bank extracts transaction data from various sources, transforms it by detecting
suspicious activity patterns, and loads it into a fraud monitoring system to prevent
unauthorized transactions.
Example: A digital marketing agency extracts customer engagement data from email
campaigns and social media, transforms it by segmenting customers based on behavior, and
loads it into a marketing dashboard for targeted advertising.
Example: A national statistics agency extracts demographic data from multiple sources,
transforms it by standardizing region-based metrics, and loads it into an open data portal for
policy-making and urban planning.
These real-world examples illustrate how ETL processes drive efficiency, enhance decision-
making, and support business growth. As industries continue to generate vast amounts of
data, ETL will remain a critical component of data integration and analytics strategies
320
Chapter 5 : Data Integration, Storage and Visualization
5.2 Understanding Modern Data Storage Architectures - Data Lakes vs. Data Warehouses
As organizations generate vast amounts of data, selecting the right storage architecture
becomes crucial for efficient data management, analytics, and decision-making. Two of the
most widely used storage solutions are Data Lakes and Data Warehouses, each serving
distinct purposes. Understanding their differences, benefits, and use cases helps
organizations optimize their data strategies.
A Data Warehouse is a centralized repository designed for structured data that has been
processed, cleaned, and formatted for analysis. It follows a schema-on-write approach,
meaning data is structured before being stored. This makes it ideal for business intelligence
(BI) and reporting applications.
Amazon Redshift – A cloud-based, fully managed data warehouse optimized for big
data analytics.
Google BigQuery – A serverless, highly scalable warehouse with machine learning
capabilities.
321
Chapter 5 : Data Integration, Storage and Visualization
Snowflake – A flexible, cloud-native data warehouse with strong security and sharing
capabilities.
Microsoft Azure Synapse Analytics – Integrates big data and traditional data
warehousing for enterprise analytics.
A Data Lake is a scalable storage repository that holds structured, semi-structured, and
unstructured data in its raw form. Unlike data warehouses, data lakes follow a schema-on-
read approach, meaning data is stored as-is and structured when queried.
1. Schema-On-Read – Data is ingested in its raw form and structured only when
accessed.
2. Supports All Data Types – Can store structured (relational databases), semi-
structured (JSON, XML), and unstructured (videos, images, logs) data.
3. Big Data Processing – Supports real-time analytics, artificial intelligence (AI), and
machine learning (ML) applications.
4. Cost-Effective Storage – Stores massive amounts of data at lower costs compared to
structured databases.
5. Flexible Access Methods – Supports batch processing, real-time streaming, and
interactive analytics.
Amazon S3 + AWS Lake Formation – Scalable cloud storage with data lake
management tools.
Microsoft Azure Data Lake Storage – Provides high-performance data lake services
with security and compliance.
Google Cloud Storage + BigLake – Unifies data lake and warehouse capabilities for
large-scale analytics.
Apache Hadoop + HDFS – Open-source ecosystem for distributed big data storage
and processing.
322
Chapter 5 : Data Integration, Storage and Visualization
Both Data Lakes and Data Warehouses play critical roles in modern data management.
Choosing the right architecture depends on data types, processing needs, analytics
requirements, and cost considerations. By leveraging the right storage solution—or a hybrid
Data Lakehouse approach—organizations can maximize data usability, enhance analytics,
and drive business innovation.
323
Chapter 5 : Data Integration, Storage and Visualization
Data storage refers to the collection, management, and retention of digital information in
various formats. As data continues to grow exponentially, businesses and organizations rely
on different storage solutions to ensure accessibility, security, and efficiency. Traditional
data storage methods include databases, file storage systems, and cloud-based solutions,
each serving specific purposes in data management.
Data storage has evolved from physical storage devices, such as hard drives and tapes, to
cloud-based and distributed storage solutions. The need for efficient data handling has led
to the development of modern storage architectures, including:
A Data Lake is a storage repository designed to hold vast amounts of raw data in its native
format until it is needed. Unlike traditional databases that require structured data with
predefined schemas, data lakes allow for storing structured, semi-structured, and
unstructured data without requiring transformation at the time of ingestion.
1. Scalability: Data lakes can store petabytes of data efficiently, accommodating growing
data volumes.
2. Flexibility: Supports multiple data formats, including JSON, CSV, images, videos, and
logs.
3. Schema-on-Read: Data is stored in its raw form and structured only when needed,
providing agility in data analysis.
4. Cost-Effective: Leveraging cloud storage, data lakes offer a cost-efficient way to store
large datasets.
5. Advanced Analytics and AI Integration: Data lakes support machine learning, big data
analytics, and real-time processing.
324
Chapter 5 : Data Integration, Storage and Visualization
Supports Advanced Use Cases: Facilitates data science, artificial intelligence, and
real-time analytics.
Data Governance: Without proper management, data lakes can become “data
swamps,” making retrieval and organization difficult.
Security Risks: Storing sensitive data in a centralized repository increases the risk
of unauthorized access.
Complex Data Processing: Requires robust tools and strategies for effective data
analysis and transformation.
Data lakes have revolutionized the way organizations store and manage data, offering a
flexible and scalable solution for handling massive datasets. While they provide numerous
advantages, proper governance and security measures are essential to maximizing their
potential. By integrating data lakes into their infrastructure, businesses can harness the
power of big data and drive innovation through advanced analytics and AI applications.
While both Data Lakes and Data Warehouses are data storage solutions, they serve
different purposes and use different approaches to storing, processing, and analyzing data.
Understanding their key differences helps businesses select the right solution based on their
analytical and operational needs.
Key Differences
325
Chapter 5 : Data Integration, Storage and Visualization
Use a Data Lake if: Your organization deals with large amounts of raw data, requires
flexibility, and focuses on advanced analytics and machine learning.
Use a Data Warehouse if: Your business needs well-structured, processed data for
quick insights, reports, and business intelligence.
Hybrid Approach: Many organizations use a combination of both, storing raw data
in a data lake and moving refined data to a data warehouse for reporting.
Data Lakes and Data Warehouses complement each other in modern data architectures.
While data lakes provide flexibility and scalability for diverse data types, data warehouses
offer structured, high-performance querying for business intelligence. Selecting the right
solution depends on an organization’s analytical needs, data processing capabilities, and cost
considerations.
326
Chapter 5 : Data Integration, Storage and Visualization
Big Data Analytics: Data lakes allow organizations to store and analyze massive volumes of
unstructured and semi-structured data, enabling insights through AI and machine learning
models.
Internet of Things (IoT) Data Management: IoT devices generate vast amounts of sensor
data that require a scalable storage solution like a data lake to process real-time analytics.
Fraud Detection and Risk Analysis: Financial institutions use data lakes to detect
anomalies in transactions and assess risks by analyzing raw historical data.
Healthcare and Genomics Research: Medical institutions utilize data lakes to store diverse
patient records, genomic sequences, and medical images for predictive analysis and
research.
Cybersecurity and Threat Detection: Organizations analyze logs and real-time data in data
lakes to detect security threats and prevent cyberattacks.
327
Chapter 5 : Data Integration, Storage and Visualization
Media and Entertainment: Streaming services store large amounts of raw content, user
preferences, and viewing history in data lakes to provide personalized recommendations.
Business Intelligence and Reporting: Data warehouses store structured data optimized
for reporting and dashboards, enabling quick decision-making for enterprises.
Sales and Financial Analysis: Organizations use data warehouses to track revenue,
expenses, and sales performance, generating structured reports for stakeholders.
Supply Chain and Inventory Management: Businesses analyze structured logistics and
inventory data to optimize supply chain operations.
Retail and E-commerce Analysis: Businesses use data warehouses to track customer
purchases, optimize pricing strategies, and analyze buying behaviors to enhance marketing
efforts.
Both data lakes and data warehouses provide essential capabilities for different use cases.
Data lakes excel in storing and analyzing vast amounts of raw data for AI-driven applications,
while data warehouses offer structured, high-performance querying for business
intelligence and compliance. Choosing the right solution depends on an organization’s needs,
data types, and analytical goals.
328
Chapter 5 : Data Integration, Storage and Visualization
Fault Tolerance: Ensures data availability even if some nodes fail, improving
reliability.
Data Replication: Copies of data are stored across multiple locations to prevent data
loss.
Global Applications: Companies with global user bases use distributed databases to
reduce latency and improve data access speed.
Financial Services: Banks and financial institutions use distributed databases for
real-time transaction processing and fraud detection.
Distributed databases offer a scalable, fault-tolerant solution for organizations dealing with
large-scale data across multiple locations. Their ability to ensure high availability and
efficient processing makes them essential in modern cloud-based architectures.
5.2.5 Cloud Storage Solutions (AWS S3, Azure Data Lake, Google Cloud Storage)
Cloud storage solutions provide scalable, secure, and cost-effective ways to store, manage,
and access data over the internet. Leading cloud providers offer specialized storage services
to cater to different business needs.
1. Amazon S3 (Simple Storage Service): A highly scalable object storage service from
AWS that provides durability, security, and integration with various AWS services.
329
Chapter 5 : Data Integration, Storage and Visualization
2. Azure Data Lake Storage: A cloud storage solution optimized for big data analytics,
offering hierarchical namespace, security, and integration with Microsoft Azure
services.
3. Google Cloud Storage: A multi-tiered storage service that supports object storage,
lifecycle management, and seamless integration with Google Cloud’s AI and analytics
tools.
Cloud storage solutions play a vital role in modern data management, enabling organizations
to handle vast amounts of data efficiently while ensuring security, reliability, and scalability.
5.3 Interactive Data Visualization: Building Dashboards with Plotly and Matplotlib
Interactive data visualization is a powerful technique that allows users to explore and
analyze data dynamically. Unlike static charts, interactive visualizations enable users to
zoom, filter, and manipulate data points, making them valuable for business intelligence,
scientific research, and operational monitoring. Two popular Python libraries for building
interactive dashboards are Plotly and Matplotlib.
330
Chapter 5 : Data Integration, Storage and Visualization
Plotly is a high-level library that supports a wide range of chart types, including line charts,
scatter plots, bar charts, heatmaps, and 3D visualizations. It integrates well with Python, R,
and JavaScript and is widely used for building interactive web-based dashboards.
1. Interactivity: Users can hover over data points, zoom, pan, and filter data dynamically.
3. Integration with Dash: Dash, a Python framework, enables users to create fully
interactive web applications powered by Plotly visualizations.
4. Support for Multiple Data Formats: Plotly can process data from CSV files, databases,
and APIs seamlessly.
5. Cloud and Offline Capabilities: It offers both cloud-based and offline rendering modes,
making it flexible for various deployment scenarios.
Matplotlib is a widely used visualization library that provides static, animated, and
interactive plots. While it is not as interactive as Plotly, it is highly customizable and
integrates well with Jupyter Notebooks and scientific computing tools.
331
Chapter 5 : Data Integration, Storage and Visualization
1. Wide Variety of Charts: Supports histograms, scatter plots, bar charts, pie charts,
and more.
2. Customization: Allows users to modify colors, fonts, grid lines, and axes properties.
3. Compatibility: Works seamlessly with NumPy, Pandas, and SciPy for data analysis.
4. Static and Animated Plots: Enables the creation of animated visualizations and
time-series data charts.
5. Embedding in Applications: Can be integrated into GUI applications using Tkinter,
PyQt, and other frameworks.
332
Chapter 5 : Data Integration, Storage and Visualization
Interactive data visualization is crucial for making data-driven decisions, and libraries like
Plotly and Matplotlib offer robust solutions for different use cases. Plotly is ideal for web-
based dashboards and interactive data exploration, while Matplotlib provides powerful
customization for static visualizations. Combining both libraries in a data analytics workflow
allows for greater flexibility in presenting and analyzing data effectively.
333
Chapter 5 : Data Integration, Storage and Visualization
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools make complex data more
accessible, understandable, and actionable. Effective data visualization enables
organizations to identify trends, outliers, and patterns, facilitating better decision-making.
Modern tools like Tableau, Power BI, and Python libraries (Matplotlib, Seaborn) enhance
data-driven storytelling, making insights clearer for stakeholders.
In today’s data-driven world, raw data can be overwhelming and difficult to interpret. Data
visualization transforms complex datasets into meaningful insights by:
Data analysis involves extracting insights from raw data, but without proper visualization,
understanding these insights can be challenging. Visualization translates numbers and
patterns into graphical formats that help analysts and decision-makers grasp trends,
334
Chapter 5 : Data Integration, Storage and Visualization
Without visualization, data analysis can become cumbersome and ineffective due to:
335
Chapter 5 : Data Integration, Storage and Visualization
Introduction to Matplotlib
Matplotlib is one of the most widely used Python libraries for data visualization. It provides
a flexible and powerful way to create static, animated, and interactive visualizations.
Whether you need to plot simple line graphs or complex multi-panel figures, Matplotlib
offers extensive customization options.
Installing Matplotlib
To start using Matplotlib, you need to install it. You can install it using pip:
Once installed, you can create a simple plot using the pyplot module:
1. Figure & Axes: The Figure is the overall container, while Axes represent the
plotting area.
2. Labels & Titles: xlabel(), ylabel(), and title() help in providing context.
3. Legend: legend() is used to label different data series.
4. Grid: grid() enhances readability by adding grid lines.
336
Chapter 5 : Data Integration, Storage and Visualization
Matplotlib allows extensive customization, such as changing colors, line styles, and adding
annotations:
Data visualization is essential for analyzing and interpreting data effectively. In this section,
we will learn how to create basic charts—Line Chart, Bar Chart, and Pie Chart—using
Matplotlib in Python.
1. Line Chart
A line chart is used to display trends over time. It is useful for showing changes in data at
equal intervals.
Example Code:
337
Chapter 5 : Data Integration, Storage and Visualization
1. Bar Chart
A bar chart is used to compare different categories. It represents data with rectangular
bars.
Example Code:
338
Chapter 5 : Data Integration, Storage and Visualization
2. Pie Chart
1. What is Plotly?
339
Chapter 5 : Data Integration, Storage and Visualization
Plotly is a powerful Python library used for creating interactive visualizations. Unlike
Matplotlib and Seaborn, which generate static images, Plotly allows users to interact with
graphs by zooming, panning, hovering, and toggling data points.
Supports interactive charts like line, bar, scatter, and pie charts.
Works seamlessly in Jupyter Notebooks and web applications.
Supports multiple languages, including Python, JavaScript, and R.
Can be integrated with Dash to create full-fledged web applications.
2. Installing Plotly
Before using Plotly, you need to install it. Run the following command:
340
Chapter 5 : Data Integration, Storage and Visualization
Output:
An interactive line chart where users can hover over data points to see values.
Users can zoom in/out and pan across the graph.
Output:
An interactive bar chart where hovering over bars displays exact values.
341
Chapter 5 : Data Integration, Storage and Visualization
Example Code:
Output:
The plotly Python library is an interactive, open-source plotting library that supports over
40 unique chart types covering a wide range of statistical, financial, geographic, scientific,
and 3-dimensional use-cases.
Built on top of the Plotly JavaScript library (plotly.js), plotly enables Python users to create
beautiful interactive web-based visualizations that can be displayed in Jupyter notebooks,
saved to standalone HTML files, or served as part of pure Python-built web applications
using Dash. The plotly Python library is sometimes referred to as "plotly.py" to differentiate
it from the JavaScript library.
342
Chapter 5 : Data Integration, Storage and Visualization
1. Introduction to Dashboards
A dashboard is a user interface that visually represents key metrics and data insights. It
allows users to interact with and explore data dynamically.
2. What is Dash?
Dash is a Python framework that allows users to build web-based interactive dashboards
using Plotly and Flask.
Installing Dash
343
Chapter 5 : Data Integration, Storage and Visualization
344
Chapter 5 : Data Integration, Storage and Visualization
i. Interactive Dropdown – Users can switch between North and South regions.
ii. Dynamic Graph Updates – The sales trend updates instantly.
iii. User-Friendly Web Interface – No coding needed for interaction.
Real-time data visualization is crucial for monitoring and analyzing dynamic datasets, such
as:
i Stock market prices
ii Live sensor data
iii Website traffic analytics
iv IoT device monitoring
Unlike static charts, real-time visualizations update dynamically without refreshing the
page. In Python, we can achieve this using Dash, Plotly, and WebSockets.
345
Chapter 5 : Data Integration, Storage and Visualization
To visualize live data, we will use Dash with a periodic callback that updates the chart at
regular intervals.
346
Chapter 5 : Data Integration, Storage and Visualization
The dcc.Graph() component displays the live chart, and dcc.Interval() triggers updates
every second.
The update_graph() function simulates stock price changes and appends new data to the
DataFrame.
347
Chapter 5 : Data Integration, Storage and Visualization
5.4 Cloud Storage Solutions: Security, Scalability, and Compliance for Data Management
Cloud storage solutions have revolutionized data management by offering scalable, secure,
and cost-effective alternatives to traditional on-premise storage. Leading cloud providers
such as Amazon Web Services (AWS) S3, Microsoft Azure Data Lake, and Google Cloud
Storage provide businesses with robust data storage capabilities tailored for different use
cases, including big data analytics, backup and recovery, and enterprise data sharing.
Security is a critical aspect of cloud storage solutions, ensuring data is protected from
unauthorized access, breaches, and cyber threats. Cloud providers implement various
security measures, including:
348
Chapter 5 : Data Integration, Storage and Visualization
349
Chapter 5 : Data Integration, Storage and Visualization
Hot Storage: High-performance storage optimized for frequently accessed data (e.g.,
AWS S3 Standard, Azure Hot Blob Storage).
Cold Storage: Cost-efficient storage for infrequently accessed data (e.g., Google
Coldline Storage, AWS Glacier).
Archival Storage: Long-term data retention for compliance and backup needs (e.g.,
AWS Glacier Deep Archive, Azure Archive Storage).
350
Chapter 5 : Data Integration, Storage and Visualization
Access Logs and Monitoring: Tracks and logs all data access activities to ensure
compliance with internal and external policies.
Automated Compliance Reports: Cloud providers offer built-in tools to generate
reports for regulatory audits.
Cloud storage solutions provide businesses with unparalleled security, scalability, and
compliance capabilities. Security measures like encryption, IAM, and threat detection
safeguard data from breaches. Scalability ensures that businesses can dynamically expand
their storage capacity without additional infrastructure costs. Compliance features help
organizations adhere to industry regulations while maintaining proper data governance.
With the ever-increasing volume of data, cloud storage solutions like AWS S3, Azure Data
Lake, and Google Cloud Storage offer enterprises a secure, efficient, and compliant way to
manage their data, ensuring long-term sustainability in the digital era.
351
Chapter 5 : Data Integration, Storage and Visualization
Cloud storage is a data management solution that allows users to store, access, and manage
data over the internet instead of relying on local storage devices or on-premise servers. It
provides scalable, flexible, and cost-effective storage solutions for businesses, organizations,
and individuals. Cloud storage is widely used for data backup, disaster recovery, and big data
analytics, ensuring data availability and accessibility from anywhere in the world.
Cloud storage is categorized into several types based on storage architecture and data access
patterns:
Object Storage – Used for unstructured data storage, such as images, videos,
backups, and large datasets. Examples include Amazon S3, Azure Blob Storage, and
Google Cloud Storage.
File Storage – Provides a hierarchical file system similar to traditional network-
attached storage (NAS). Examples include Amazon EFS, Azure Files, and Google Cloud
Filestore.
Block Storage – Used for applications requiring low-latency and high-performance
data access, such as databases and virtual machines. Examples include Amazon EBS,
Azure Managed Disks, and Google Persistent Disks.
Flexibility – Supports multiple data formats and workloads, from simple file storage
to big data processing.
Automated Backup and Recovery – Cloud storage solutions include built-in backup
and versioning features to protect against accidental data loss.
352
Chapter 5 : Data Integration, Storage and Visualization
Cost Saving Cloud storage has transformed the way organizations store and manage
data. With its scalability, security, and cost-effectiveness, cloud storage solutions
provide an essential foundation for modern digital infrastructure. Whether for small
businesses or large enterprises, adopting cloud storage ensures flexibility,
accessibility, and data resilience in an increasingly data-driven world.
Scalable You can upgrade the service plan if the storage included in the current plan
is insufficient. Additionally, the additional space will be provided to your data storage
environment with some new capabilities, so you won’t need to migrate any data from
one place to another. Scalable and adaptable cloud storage is offered.
5.4.2 Overview of Cloud Providers (AWS, Azure, Google Cloud)
Cloud computing has become the backbone of modern IT infrastructure, offering scalable
storage, computing power, and a range of managed services. The three leading cloud
providers—Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform
(GCP)—dominate the industry, each providing unique features and capabilities tailored to
various business needs.
AWS is the market leader in cloud computing, offering a vast range of services, including
computing, storage, databases, networking, analytics, machine learning, and security.
Key Features:
353
Chapter 5 : Data Integration, Storage and Visualization
Advantages:
Largest global cloud infrastructure with data centers in multiple regions.
Extensive service catalog for enterprise, startup, and developer needs.
Strong integration with artificial intelligence and machine learning tools.
Microsoft Azure
Azure is a powerful cloud computing platform that integrates seamlessly with Microsoft
products and enterprise applications. It is widely adopted by businesses using Windows-
based solutions.
Key Features:
Azure Blob Storage – Highly scalable object storage for unstructured data.
Azure Virtual Machines – Flexible virtual machine deployment for various
workloads.
Azure Synapse Analytics – Big data analytics and data warehousing solutions.
Azure Active Directory – Enterprise identity and access management for enhanced
security.
Azure Kubernetes Service (AKS) – Managed Kubernetes for deploying
containerized applications.
Advantages:
Strong enterprise adoption due to seamless integration with Microsoft tools (e.g.,
Office 365, SQL Server, Windows Server).
Advanced hybrid cloud capabilities for on-premise and cloud integration.
High compliance standards suitable for regulated industries (finance, healthcare,
government).
GCP is known for its data analytics, AI, and machine learning capabilities. It provides cost-
effective storage and computing solutions, making it a popular choice for startups and AI-
driven applications.
Key Features:
354
Chapter 5 : Data Integration, Storage and Visualization
Advantages:
Scalability
Scalability is one of the most significant advantages of cloud storage, allowing businesses to
expand their storage needs dynamically without investing in physical infrastructure.
355
Chapter 5 : Data Integration, Storage and Visualization
Global Data Distribution: Cloud storage solutions replicate data across multiple
geographically distributed data centers to enhance performance, redundancy, and
availability.
Serverless Storage Solutions: Cloud platforms like AWS S3, Azure Blob Storage, and
Google Cloud Storage provide scalable, serverless storage models where users pay
only for the storage they use.
Security:
Cloud storage providers implement robust security measures to protect data from cyber
threats, unauthorized access, and data breaches.
Data Encryption: Cloud storage encrypts data at rest and in transit using advanced
encryption standards (AES-256, TLS/SSL) to prevent unauthorized access.
Identity and Access Management (IAM): Role-based access control (RBAC) and
multi-factor authentication (MFA) enhance security by restricting data access to
authorized users.
Threat Detection & Prevention: AI-driven security monitoring tools detect
anomalies, unauthorized access attempts, and potential cyber threats in real-time.
Disaster Recovery & Backup: Cloud storage solutions provide built-in redundancy,
ensuring automatic backups and disaster recovery strategies to minimize data loss.
DDoS Protection & Network Security: Leading cloud providers integrate firewall
services, virtual private networks (VPNs), and Distributed Denial of Service (DDoS)
protection mechanisms to safeguard data.
Compliance:
Audit Logs & Monitoring: Cloud providers maintain detailed logs of access history,
data modifications, and security incidents to ensure accountability.
356
Chapter 5 : Data Integration, Storage and Visualization
Data Retention & Lifecycle Policies: Organizations can define automated policies to
manage data storage duration, archival, and deletion based on compliance
requirements.
Geographic Data Residency: Many cloud storage providers allow organizations to
specify where their data is stored to comply with regional data sovereignty laws.
Cloud storage offers a powerful combination of scalability, security, and compliance, making
it an essential component of modern IT infrastructure. Organizations benefit from elastic
storage growth, robust security protections, and regulatory adherence, ensuring their data
remains accessible, protected, and compliant with industry standards. By leveraging cloud
storage solutions, businesses can optimize their data management strategies and focus on
innovation while minimizing operational risks.
With the growing reliance on cloud storage, ensuring data security has become a top priority
for organizations. Cloud data security encompasses encryption, access control, and proactive
threat management to prevent unauthorized access, data breaches, and cyber threats.
Encryption at Rest: Cloud providers use strong encryption algorithms such as AES-
256 to protect stored data from unauthorized access.
357
Chapter 5 : Data Integration, Storage and Visualization
Encryption in Transit: Data transmitted over the internet is protected using TLS
(Transport Layer Security) or SSL (Secure Sockets Layer) to prevent interception
by malicious actors.
End-to-End Encryption: Some cloud services offer encryption where only the data
owner holds the decryption key, ensuring maximum security.
Key Management Services (KMS): Cloud providers like AWS, Azure, and Google
Cloud offer managed key storage solutions, ensuring encryption keys are securely
stored and rotated regularly.
Access control ensures that only authorized users and applications can interact with cloud-
stored data.
Identity and Access Management (IAM): Cloud providers offer IAM frameworks to
define user roles, permissions, and authentication requirements.
Role-Based Access Control (RBAC): Organizations can assign permissions based on
user roles, minimizing unnecessary access to sensitive data.
Multi-Factor Authentication (MFA): Adds an extra layer of security by requiring
users to verify their identity through multiple authentication methods (e.g., password
and OTP).
Zero Trust Security Model: Modern cloud security adopts the Zero Trust approach,
which requires verification at every access point instead of assuming internal
network users are automatically trusted.
Proactively detecting threats and responding to security incidents is critical for maintaining
cloud data security.
1. Enable Strong Encryption – Use encryption for both stored and transmitted data.
2. Implement Strict Access Controls – Apply the principle of least privilege (PoLP) to
restrict unnecessary access.
358
Chapter 5 : Data Integration, Storage and Visualization
3. Regularly Rotate Security Keys – Ensure encryption keys are updated periodically
to prevent compromise.
4. Monitor and Audit Activity Logs – Set up automated alerts for suspicious activities.
5. Adopt Multi-Factor Authentication (MFA) – Strengthen identity verification for
user access.
6. Secure APIs and Endpoints – Protect data interactions by securing APIs and access
points.
Data security in the cloud is essential for protecting sensitive information from cyber threats
and unauthorized access. By implementing strong encryption, robust access control
mechanisms, and proactive threat detection, organizations can enhance their cloud security
posture. Cloud providers offer a range of security tools and frameworks to help businesses
safeguard their data while maintaining compliance with industry regulations.
In today’s digital landscape, data privacy and security regulations play a crucial role in
safeguarding personal and sensitive information. Organizations that handle personal data
must adhere to compliance requirements such as the General Data Protection Regulation
(GDPR) and the Health Insurance Portability and Accountability Act (HIPAA).
Understanding and implementing these regulations ensures legal compliance, protects
consumer rights, and enhances cybersecurity measures.
The GDPR is a data protection law enacted by the European Union (EU) in May 2018. It
applies to any organization worldwide that processes the personal data of individuals
residing in the EU. GDPR focuses on giving individuals greater control over their data and
requires organizations to implement stringent data protection measures. Key principles of
GDPR include:
359
Chapter 5 : Data Integration, Storage and Visualization
Failure to comply with GDPR can result in hefty fines of up to €20 million or 4% of the
company’s annual global revenue, whichever is higher.
1. Privacy Rule – Establishes national standards for the protection of PHI and defines
patients' rights over their health information.
2. Security Rule – Requires the implementation of administrative, physical, and
technical safeguards to protect electronic PHI (ePHI).
3. Breach Notification Rule – Mandates organizations to notify affected individuals
and authorities of data breaches involving PHI.
4. Enforcement Rule – Outlines penalties for non-compliance and investigation
procedures.
5. Omnibus Rule – Expands HIPAA’s requirements to business associates and
strengthens enforcement.
360
Chapter 5 : Data Integration, Storage and Visualization
Non-compliance with HIPAA can result in penalties ranging from $100 to $50,000 per
violation, with a maximum annual fine of $1.5 million per provision.
Both GDPR and HIPAA present unique challenges for organizations, including:
Data Mapping and Classification – Identifying and categorizing data to ensure
proper handling.
Cross-Border Data Transfers – Navigating international data transfer restrictions
under GDPR.
Third-Party Risk Management – Ensuring vendors and partners comply with
regulations.
Incident Response Planning – Developing protocols for responding to data
breaches.
By adhering to GDPR and HIPAA, organizations not only avoid legal consequences but also
build trust with customers and stakeholders. Investing in compliance safeguards data
privacy and enhances overall cybersecurity resilience.
1. Storage Type and Class – Different cloud providers offer various storage classes,
such as standard, infrequent access, and archival storage. Choosing the right class
based on data usage patterns can significantly reduce costs.
2. Data Transfer and Egress Fees – Moving data between cloud regions or retrieving
it from cloud storage can incur substantial fees. Minimizing unnecessary data
transfers helps control costs.
361
Chapter 5 : Data Integration, Storage and Visualization
5.4.7 Hands-On Project: Storing and Retrieving Data from the Cloud
In this hands-on project, we will walk through the process of storing and retrieving data
using a cloud storage service, such as AWS S3, Google Cloud Storage, or Azure Blob Storage.
This project aims to provide practical experience in managing cloud storage efficiently and
securely.
Objectives:
Learn how to upload, retrieve, and manage data in a cloud storage system.
Understand access control and data security settings.
362
Chapter 5 : Data Integration, Storage and Visualization
Prerequisites:
Before starting this project, ensure you have:
1. AWS S3:
o Sign up at AWS Console.
o Navigate to S3 in the AWS Management Console.
o Click Create Bucket, provide a unique name, and choose a region.
2. Google Cloud Storage:
o Sign up at Google Cloud Console.
o Go to Cloud Storage and click Create Bucket.
o Assign a globally unique name and select a storage class.
3. Azure Blob Storage:
o Sign up at Azure Portal.
o Create a Storage Account and select Blob Storage.
o Navigate to Containers and create a new container for storing objects.
AWS S3 Upload:
363
Chapter 5 : Data Integration, Storage and Visualization
364
Chapter 5 : Data Integration, Storage and Visualization
AWS S3 Retrieval:
Python SDK:
Managing access and security ensures data safety in the cloud. Here are some best practices:
AWS S3: Set up IAM roles, bucket policies, and enable encryption.
Google Cloud Storage: Use IAM permissions and signed URLs for controlled access.
Azure Blob Storage: Configure access policies and enable private/public access
restrictions.
365
Chapter 5 : Data Integration, Storage and Visualization
Lifecycle policies help manage storage costs and automate data retention.
AWS S3: Set up lifecycle rules to move infrequently accessed data to Glacier.
Google Cloud Storage: Use Object Lifecycle Policies to automatically delete or
transition data.
Azure Blob Storage: Enable blob tiering to move data to cold storage.
366
Chapter 5 : Data Integration, Storage and Visualization
By completing this hands-on project, you have gained real-world experience in storing,
retrieving, securing, and optimizing data in cloud storage. These skills are essential for
managing cloud infrastructure efficiently and ensuring cost-effective, secure, and scalable
data management practices. You can further explore cloud automation tools like AWS
Lambda, Google Cloud Functions, or Azure Automation to enhance storage workflows.
5. Performance Optimization
367
Chapter 5 : Data Integration, Storage and Visualization
By following these best practices, organizations can achieve efficient cloud data
management, strengthen security, optimize costs, and maintain compliance with industry
standards. Effective data governance and automation further ensure a scalable and resilient
cloud storage infrastructure, minimizing risks and maximizing performance in cloud
operations.
368
Chapter 5 : Data Integration, Storage and Visualization
To understand the impact and application of cloud storage in real-world scenarios, let’s
examine case studies from various industries. These case studies illustrate how
organizations use cloud storage to enhance performance, security, and cost efficiency.
Challenge: Netflix needed a scalable and highly available storage solution to manage its vast
content library while ensuring smooth content delivery to millions of users worldwide.
Solution: Netflix adopted Amazon S3 as its primary storage service, utilizing AWS’s global
infrastructure and content delivery networks (CDNs) to store and stream high-resolution
video content efficiently.
Outcome:
Seamless streaming experience with low latency.
Scalable storage that adapts to fluctuating user demands.
Cost optimization through intelligent storage tiering and lifecycle policies.
Challenge: Dropbox initially relied on third-party cloud providers but needed greater
control over its data storage infrastructure to optimize performance and reduce costs.
Solution: Dropbox developed its in-house storage infrastructure called Magic Pocket,
which allows the company to store and manage vast amounts of user data while integrating
with cloud-based redundancy solutions.
Outcome:
Improved storage efficiency and cost savings.
Enhanced data security and privacy controls.
Greater flexibility in handling massive volumes of user-generated files.
Challenge: A leading healthcare provider needed a cloud storage solution that complied
with HIPAA regulations while securely managing patient records and medical imaging.
Solution: The provider implemented Microsoft Azure Blob Storage with built-in encryption,
access controls, and compliance tools to protect sensitive health information.
Outcome:
Enhanced data security and regulatory compliance.
369
Chapter 5 : Data Integration, Storage and Visualization
Case Study 4: Financial Sector – Secure Cloud Storage for Banking Data
Challenge: A multinational bank required a secure, scalable, and resilient storage system to
manage transactional data and customer records while meeting strict compliance
requirements.
Solution: The bank utilized Google Cloud Storage with multi-region replication, encryption
at rest, and IAM (Identity and Access Management) policies to ensure secure and compliant
data storage.
Outcome:
High availability and data durability.
Strong access controls to prevent unauthorized data breaches.
Compliance with financial regulations like GDPR and PCI-DSS.
Case Study 5: NASA – Cloud Storage for Research and Space Exploration
Challenge: NASA needed a robust storage solution to handle vast datasets from space
missions, telescopes, and research projects.
Solution: NASA adopted AWS cloud storage to archive massive datasets, leveraging Amazon
S3 and Glacier for long-term data retention and easy access to research teams worldwide.
Outcome:
Efficient data management for large-scale scientific research.
Cost-effective archival solutions with pay-as-you-go pricing.
Increased collaboration among global research institutions.
These case studies demonstrate the versatility and benefits of cloud storage across
industries. Whether for streaming services, healthcare, finance, or scientific research,
organizations leverage cloud storage to improve efficiency, security, scalability, and cost-
effectiveness. As cloud technology evolves, its role in data management will continue to
expand, offering new opportunities for innovation and optimization.
370
Chapter 5 : Data Integration, Storage and Visualization
AI and machine learning will play a crucial role in automating storage management,
improving data indexing, and optimizing storage performance.
Predictive analytics will help organizations allocate resources efficiently and reduce
unnecessary storage costs.
AI-powered threat detection will enhance security by identifying and mitigating risks
in real time.
The rise of IoT (Internet of Things) and edge computing will drive the need for
decentralized storage solutions.
Data processing will increasingly occur closer to the source, reducing latency and
improving real-time decision-making.
Hybrid cloud and multi-cloud storage strategies will become more prominent to
balance performance, cost, and compliance.
Cloud providers will continue strengthening encryption and implementing Zero Trust
security models to mitigate cyber threats.
Confidential computing and secure multi-party computation (SMPC) will enable more
secure data processing in shared environments.
More organizations will adopt immutable storage solutions to protect against
ransomware and unauthorized modifications.
With growing concerns about environmental impact, cloud providers will focus on
energy-efficient data centers powered by renewable energy.
Organizations will adopt carbon footprint tracking tools to monitor and optimize
storage-related energy consumption.
Advances in cold storage and tape archival solutions will help reduce energy
consumption for infrequently accessed data.
Projects like IPFS (InterPlanetary File System) and decentralized cloud storage
providers will offer enhanced security and data integrity.
Peer-to-peer storage networks will enable cost-effective and censorship-resistant
data management.
The future of cloud storage will be defined by AI-driven efficiency, decentralized storage
solutions, enhanced security, and sustainable practices. Organizations must stay ahead of
these trends to ensure their data management strategies align with emerging technologies
and industry standards. As cloud storage evolves, it will continue to provide innovative
solutions for businesses and individuals worldwide.
372
Chapter 5 : Data Integration, Storage and Visualization
Assessment Criteria
Refrences :
373
Chapter 5 : Data Integration, Storage and Visualization
Exercise
374
Chapter 5 : Data Integration, Storage and Visualization
8) Which of the following is NOT a key component of the ETL (Extract, Transform, Load)
process?
a. Extraction
b. Transformation
c. Loading
d. Encryption
9) What is the primary purpose of a data warehouse?
a. To store unstructured data
b. To provide a platform for real-time analytics
c. To integrate data from multiple sources for reporting and analysis
d. To manage transactional databases
10)Which of the following technologies is widely used for data visualization on the web?
a. Tableau
b. Excel
c. D3.js
d. MongoDB
2. Interactive dashboards help users explore data more effectively than static charts. (T/F)
3. Data redundancy is a major concern in data storage and should be minimized. (T/F)
4. A distributed file system is commonly used for handling large-scale data storage in big
data applications. (T/F)
7. Data silos help improve data integration and sharing across departments. (T/F)
8. Cloud storage systems are not commonly used in data integration due to security
concerns. (T/F)
9. Structured data is highly organized, typically stored in rows and columns in relational
databases. (T/F)
10. A pie chart is the best visualization tool to show the trend of a variable over time. (T/F)
375
Chapter 5 : Data Integration, Storage and Visualization
Q1. Write a Python program that integrates data from multiple JSON and CSV files. The files
contain customer information (name, address, email, and phone number). Merge these data
sources into a single, unified data structure while ensuring that there are no duplicate
entries.
Q2. Given a list of raw data containing various measurements (e.g., temperature, humidity,
pressure), write a function that normalizes the data into a range from 0 to 1 using min-max
scaling. This function should handle missing or invalid data gracefully.
Q3. Write a Python script that integrates data from two different APIs: one providing user
details and the other providing posts made by those users. Merge the data into a single
dataset with each user’s name, email, and the posts they’ve made.
Q4. Write a program that connects to a MySQL or PostgreSQL database, fetches data from
multiple tables (such as customer and order tables), and performs a join operation to
integrate the data into a single dataset.
Q5. Build a simple ETL pipeline using Python. The program should extract data from a CSV
file, transform it by cleaning the data and load the data into a SQL database.
Q6. Write a Python script that reads sales data (CSV or JSON format) containing product
name, sales amount, and date of sale. Use a library like Matplotlib or Seaborn to visualize
the total sales per product over time (e.g., bar chart, line graph).
Q7. Write a Python program that reads data from a CSV file containing the number of items
in different categories (e.g., Electronics, Clothing, Home Appliances). Visualize the
distribution of items across categories using a pie chart.
Q8. Write a Python program that connects to a SQLite database and performs CRUD
(Create, Read, Update, Delete) operations on a table storing product information
(product_id, product_name, price, quantity).
Q9. Implement a basic system for data sharding (splitting data across multiple storage
systems). Write a program that distributes customer records across two databases based
on the customer’s geographic region, ensuring even distribution of the records.
Q10. Write a Python program to implement a simple key-value store that allows you to
insert, retrieve, and delete data. Use a hash table or dictionary to store the data, and ensure
that the operations are efficient.
376
Chapter 6 : Data Quality and Governance
Chapter 6 :
Data Quality and Governance
1. Accuracy
Definition: Data must be correct and free from errors. It should represent the real-
world scenario it is intended to model.
Example: If customer information is stored, the phone number, email, and address
should be up-to-date and valid.
2. Completeness
Definition: All necessary data should be present. Missing data can affect analysis
and insights.
Example: If a survey is being analyzed, every participant’s responses should be
included in the dataset. Missing answers can skew results.
3. Consistency
Definition: Data should be consistent within itself and across datasets. If the same
data appears in multiple places, it should be identical.
Example: If the same customer is recorded in two systems, their name and address
should match exactly.
4. Timeliness
Definition: Data must be up-to-date and available when needed. Old or outdated data
can lead to incorrect decisions.
Example: For stock market analysis, data must be collected in real-time to be relevant
for trading decisions.
5. Relevance
Definition: Data must be pertinent to the task at hand. Irrelevant data can distract
from key insights.
Example: A retail store analyzing customer purchase habits doesn't need data on
weather patterns unless it's part of the analysis.
6. Uniqueness
Definition: Duplicate entries should be avoided. Each piece of data should
represent a unique entity or event.
Example: A customer database should not have multiple entries for the same
individual unless they are separate records (e.g., in case of multiple purchases).
377
Chapter 6 : Data Quality and Governance
7. Integrity
Definition: Data should have logical relationships and adhere to predefined rules or
structures, maintaining its validity over time.
Example: In a database for a school, the enrollment year should be a valid number
within a certain range, not something out of context like 1980 for students enrolled
in 2025.
8. Accessibility
Definition: Data should be easily accessible to the right people, with proper
permissions in place.
Example: Company financial reports should be accessible to stakeholders but not to
unauthorized personnel.
9. Traceability
Definition: Data should be traceable to its origin, allowing you to verify where it
came from and how it was processed.
Example: In healthcare, patient records need to have an audit trail to ensure they
are properly handled and any changes are documented.
Understanding
Importance of
Data Quality
Data Quality in AI
Dimensions
Tools and
Ensuring and
Techniques for
Maintaining Data
Data Quality
Quality
Management
378
Chapter 6 : Data Quality and Governance
Ensuring high data quality is crucial in all fields, especially in areas like analytics, decision-
making, and regulatory compliance.
Data
Cleansing
Standardiza
Automation tion
Data
Governance
Data quality metrics help assess how well your data meets the standards of accuracy,
consistency, completeness, and other quality attributes. These metrics provide valuable
insights into the health of your data and can help identify areas that need improvement.
1. Completeness
o Definition: Measures the extent to which data is missing or incomplete.
o Metric Example: Percentage of missing values or incomplete records in a
dataset.
2. Consistency
o Definition: Measures whether data is consistent within itself and across
different sources.
o Metric Example: Number of data inconsistencies between records or
between systems.
379
Chapter 6 : Data Quality and Governance
3. Timeliness
o Definition: Measures whether the data is up-to-date and available when
needed.
o Metric Example: Percentage of data records that are outdated or stale.
4. Uniqueness
o Definition: Measures how often duplicate data appears in the dataset.
o Metric Example: Number of duplicate records within a dataset.
5. Integrity
o Definition: Measures how well data adheres to relationships and business
rules (e.g., valid data types, valid ranges, referential integrity).
o Metric Example: Percentage of records that violate integrity constraints,
such as invalid IDs or incorrect foreign key relationships.
6. Relevance
o Definition: Measures whether the data is useful for the intended analysis or
task.
o Metric Example: Percentage of data that contributes to decision-making
versus irrelevant data.
Accuracy
Integrity Completeness
Data Quality
Metrics
Uniqueness Consistency
Timeliness
380
Chapter 6 : Data Quality and Governance
2. Data Comparison
o Method: Compare data across different systems, datasets, or sources to
ensure consistency. For instance, if a customer record is present in multiple
systems, compare the data points (e.g., name, address) to see if they match.
o Tools: Apache Nifi, Informatica, DQLabs.
3. Cross-Validation Techniques
o Method: Split the data into subsets and validate consistency and accuracy by
running them through separate validation processes.
o Tools: Python libraries (e.g., pandas, NumPy) and R packages for statistical
analysis.
4. Outlier Detection
o Method: Use statistical methods or machine learning models to identify data
points that deviate significantly from the rest, which may indicate errors or
inconsistencies.
o Tools: Anaconda, DataRobot, RapidMiner.
5. Rule-based Checks
o Method: Implement business rules that define the conditions for data
accuracy (e.g., "Age cannot be negative," "Date of birth must be in the past").
o Tools: Talend, Oracle Data Quality, SAS Data Quality.
2. Trifacta
o Description: Trifacta focuses on data wrangling and profiling, allowing you
to clean, enrich, and validate data. It provides tools for checking data
completeness and consistency and detecting anomalies.
o Features: Automated data profiling, data transformation, and advanced
anomaly detection.
381
Chapter 6 : Data Quality and Governance
4. Apache Nifi
o Description: Apache Nifi is an open-source data integration tool that can
help in automating data flow, including validation, monitoring, and profiling.
o Features: Real-time data monitoring, validation workflows, and data routing.
6. DataRobot
o Description: DataRobot offers automated machine learning and data
validation tools that allow you to profile and clean data before model
building, ensuring accurate insights.
o Features: Data quality checks, profiling, and automated machine learning
pipelines.
8. OpenRefine
o Description: OpenRefine is an open-source tool for working with messy
data. It provides features for cleaning and transforming data, including
deduplication and validation.
o Features: Data clustering, data transformation, and schema validation.
9. Ataccama
o Description: Ataccama offers a suite of data quality tools for data profiling,
cleansing, and monitoring. It can help ensure data consistency across
systems.
o Features: Automated data quality assessments, monitoring, and data
profiling.
382
Chapter 6 : Data Quality and Governance
Example:
383
Chapter 6 : Data Quality and Governance
Output:
Data Integrity refers to the accuracy, consistency, and reliability of data stored in a
database. It ensures that the data is accurate, consistent, and safeguarded against any type
of corruption. Ensuring data integrity is crucial because data is often the foundation for
decision-making in organizations, businesses, and systems.
384
Chapter 6 : Data Quality and Governance
Accurate
Decision-
Making
Importance
of Data
Operational Trustworthi-
Efficiency Integrity in ness of Data
Decision-
Making
Legal and
Compliance
Requireme-
nts
Example:
385
Chapter 6 : Data Quality and Governance
Data Validation:
Input Validation: Validating data as it enters the database ensures that only
accurate and relevant data is accepted. This can be done via forms, data entry
applications, or triggers within the database.
Application-Level Validation: Before inserting or updating data, applications
should ensure it meets all the necessary conditions and is consistent with the
system requirements.
Example:
Ensuring that a DateOfBirth field contains only dates in the past, or that an Email
field follows a valid email format.
Database Triggers:
Triggers are automatic actions that are triggered by specific changes to the
database. These can be used to maintain data integrity by performing checks before
data is inserted, updated, or deleted.
For example, a trigger might automatically check if a change to an employee's
department is valid before committing it to the database.
Example:
Referential Integrity:
Ensuring that relationships between tables remain intact. For example, if an
employee record is deleted, you may want to ensure that their corresponding
department ID is updated or deleted as well.
This is often enforced using Foreign Keys.
Database Auditing:
Maintaining logs that record changes made to the database. These logs are essential
for tracking changes, identifying errors, and verifying the integrity of the data.
Example: Audit logs that track who updated a record, what change was made, and
when the change occurred.
386
Chapter 6 : Data Quality and Governance
Orders table:
387
Chapter 6 : Data Quality and Governance
Constraints:
Constraints are used to enforce data integrity within the database. They ensure that
data meets specific rules or conditions, preventing incorrect or inconsistent data
from being entered.
Types of Constraints:
o Primary Key: Ensures uniqueness and identifies each record in the table.
o Foreign Key: Ensures that relationships between tables remain valid.
o Check Constraint: Ensures that a column’s values meet specific conditions.
o Not Null: Ensures that a column cannot have null values, maintaining
completeness.
Example:
Importance of Data Integrity: Ensuring data integrity is crucial for making reliable,
accurate, and legally compliant decisions. It enhances trust, reduces errors, and ensures
operational efficiency.
Techniques for Maintaining Integrity:
Constraints such as Primary Key, Foreign Key, and Check constraints ensure that
data adheres to predefined rules.
Normalization helps organize data and reduces redundancy, improving accuracy
and consistency.
Triggers can enforce data rules before changes are made.
Auditing and Backup help track changes and protect against data loss.
Role of Normalization and Constraints:
Normalization minimizes redundancy and ensures the logical organization of data.
Constraints enforce rules that guarantee data integrity, consistency, and accuracy
across the database.
388
Chapter 6 : Data Quality and Governance
Deduplication Techniques
Deduplication refers to the process of identifying and removing duplicate records from a
dataset. Duplicate data can distort analysis, cause inefficiencies, and reduce data quality.
Deduplication Techniques:
1. Exact Match Deduplication:
o Definition: Identifies and removes duplicate records that are exactly the
same across all or selected fields.
o Method: Compare records by matching all fields (or a subset of fields) for
exact equality. Remove one of the duplicate records.
Challenges:
389
Chapter 6 : Data Quality and Governance
o May miss cases where the duplicates are not exact (e.g., if there’s a slight
difference in spelling or formatting).
2. Fuzzy Matching Deduplication:
o Definition: Identifies duplicates that are not exactly the same but are likely
to be the same due to typographical errors, name variations, or other
inconsistencies.
o Method: Use fuzzy matching algorithms (such as Levenshtein distance, Jaro-
Winkler, or cosine similarity) to find similar records and flag them for review
or merging.
Challenges:
o Fuzzy matching can sometimes generate false positives or negatives.
3. Clustering-Based Deduplication:
o Definition: Uses machine learning techniques (e.g., clustering algorithms
such as k-means or DBSCAN) to group similar records and identify
duplicates.
o Method: Cluster records based on their similarity and remove duplicates
within each cluster.
Challenges:
o Requires careful tuning of similarity measures and parameters for accurate
results.
4. Rule-Based Deduplication:
o Definition: Uses predefined business rules to identify duplicate records.
o Method: For example, a rule might state that records with the same name
and email address are duplicates, even if they differ slightly in other fields.
Challenges:
o Rules must be carefully designed to balance strictness and flexibility in
identifying duplicates.
5. Probabilistic Deduplication:
o Definition: Uses statistical models to determine the probability that two
records represent the same entity.
o Method: Compare fields across records and assign a probability score based
on the likelihood of duplication. Records with a high score are flagged as
duplicates.
Challenges:
o Requires a well-trained model to make accurate predictions and might
involve complex calculations.
o Standardizes numerical data to fit within a particular range (e.g., [0, 1]) or
adjusts the scale (e.g., z-scores, min-max scaling).
o This is especially important for machine learning models that are sensitive to
the scale of features.
2. Pivoting and Unpivoting:
o Pivoting: Converts rows into columns to create a more aggregated or
summarized view of the data.
o Unpivoting: Converts columns back into rows to normalize data for easier
processing.
3. Aggregation:
o Summarizes or consolidates data by combining multiple rows into a single
value (e.g., sum, average, max, count).
o Useful for reporting or creating higher-level summaries from detailed data.
4. Filtering and Subsetting:
o Involves selecting relevant data from a large dataset by applying conditions
(e.g., selecting records that meet certain criteria like date ranges, specific
categories, etc.).
5. Type Conversion:
o Converts data types to ensure consistency (e.g., converting strings to
datetime objects or numeric values).
Data Enrichment
Data enrichment is the process of adding new, external data to existing datasets to improve
their quality, completeness, and relevance.
Common Data Enrichment Techniques:
1. Geospatial Enrichment:
o Adding location-based data to enrich records with geospatial attributes such
as latitude, longitude, or addresses. For example, enriching customer data
with city, state, or country information based on postal codes.
2. Third-Party Data Integration:
o Enhancing data with external datasets, such as demographic information,
market trends, or financial data. For instance, adding economic indicators or
social media metrics to customer records.
3. Data Synthesis:
o Deriving new insights or features from existing data through calculations or
algorithms. For example, calculating a customer’s lifetime value (LTV) or
predicting future behavior using historical data.
4. Data Merging and Joining:
o Combining data from different sources or tables to enrich a dataset with
complementary information. For instance, merging customer data with
transaction data to get a complete view of customer activity.
5. Text Enrichment:
o Enhancing textual data by extracting key information, sentiments, or topics
using Natural Language Processing (NLP) techniques. This could involve
tagging text with keywords, sentiment scores, or entities (e.g., names,
locations, dates).
6. Categorization:
391
Chapter 6 : Data Quality and Governance
392
Chapter 6 : Data Quality and Governance
o After feedback, the system evolves by adding more checks, refining existing
ones, and reducing manual corrections over time.
o Example: Implementing machine learning-based models to automatically
predict missing values with greater accuracy.
NLP models can clean and normalize textual data, standardizing formats and
o
extracting valuable information.
o Techniques:
Named Entity Recognition (NER) models to identify and normalize
entities (e.g., cities, dates, product names) in free-text fields.
Sentiment analysis to determine the sentiment of user-generated
content or feedback.
Use Case:
o Standardizing product reviews or customer feedback into predefined
categories (e.g., "positive," "neutral," "negative").
5. Pattern Recognition for Consistency Checks:
o Machine learning models can be used to identify patterns or relationships in
the data, helping to identify inconsistencies across large datasets.
o Techniques:
Supervised models can predict expected patterns or relationships
between data columns.
Unsupervised models can uncover hidden structures and
inconsistencies that were previously unnoticeable.
Use Case:
o Detecting discrepancies between customer order data and inventory records
to prevent inconsistencies in stock management.
6. Predictive Data Quality:
o Machine learning can forecast potential data quality issues before they
become critical, based on historical patterns.
o Techniques:
Time-series forecasting models can predict when a data set is likely to
go out of sync or deviate from quality standards.
Classification models can predict the likelihood of data errors based
on incoming data.
Use Case:
o Predicting when sensor data from equipment might degrade, allowing for
preventive maintenance actions to avoid data quality issues.
Pattern
Text
Data Recognition
Anomaly Data Enrichment Predictive Data
Deduplication for
Detection Imputation and Quality
Using ML Consistency
Normalization
Checks
Example:
395
Chapter 6 : Data Quality and Governance
396
Chapter 6 : Data Quality and Governance
397
Chapter 6 : Data Quality and Governance
Data Governance refers to the overall management of the availability, usability, integrity,
and security of data used in an organization. It includes the processes, policies, standards,
and technologies that ensure data is well-managed, trusted, and used properly throughout
its lifecycle. Data governance helps organizations ensure that their data is accurate,
consistent, secure, and used in compliance with relevant laws and regulations.
Effective data governance enables organizations to extract maximum value from their data
while mitigating risks related to privacy, security, and compliance.
398
Chapter 6 : Data Quality and Governance
1. Data Quality:
o Data governance ensures that data is accurate, consistent, and of high quality.
It establishes rules and standards for data entry, storage, and retrieval, which
reduces errors and inconsistencies.
2. Compliance and Legal Requirements:
o With increasing data privacy regulations (e.g., GDPR, CCPA), organizations
must ensure they follow legal requirements when collecting, storing, and
processing data. Data governance ensures adherence to these rules, reducing
the risk of legal penalties.
3. Risk Management:
o Data governance helps mitigate risks associated with data security breaches,
misuse, or loss. By enforcing access controls, data security protocols, and
policies for data handling, organizations can reduce exposure to data-related
risks.
4. Improved Decision-Making:
o With consistent, high-quality data governed by clear policies, organizations
can make better, more informed decisions. It eliminates ambiguity and
reduces the chances of errors in analysis or reporting.
5. Data Access and Availability:
o Well-defined data governance frameworks ensure the right people have
access to the right data at the right time, enhancing operational efficiency. It
also ensures that data is available when needed for reporting or analytics.
6. Operational Efficiency:
o Effective governance reduces redundancy, improves data sharing, and
enhances collaboration across departments. It eliminates data silos and
facilitates better coordination of data assets across the organization.
7. Data Stewardship and Accountability:
o Data governance establishes clear roles and responsibilities for data
stewardship. It ensures that individuals or teams are accountable for
managing data quality, security, and compliance, leading to better
governance practices.
399
Chapter 6 : Data Quality and Governance
401
Chapter 6 : Data Quality and Governance
402
Chapter 6 : Data Quality and Governance
Establishing clear ownership and stewardship of data ensures that individuals are
accountable for maintaining data quality, compliance, and security. Data owners are
responsible for making key decisions about the data, while data stewards are
responsible for the day-to-day management.
Example: Assigning a data steward to monitor the quality of customer data and
make corrections as needed.
4. Data Security and Privacy
Data security frameworks define measures to protect data from unauthorized
access, use, or corruption. This includes data encryption, access control,
authentication, and compliance with data privacy laws (e.g., GDPR, CCPA).
Example: Implementing role-based access control (RBAC) to restrict access to
sensitive data only to authorized users.
5. Data Quality Management
This component focuses on ensuring that the data is accurate, consistent, complete,
and timely. It involves continuous data profiling, data cleansing, and validation to
maintain data quality.
Example: Running regular data quality checks and automatically flagging or
correcting incomplete or inconsistent data entries.
6. Data Governance Processes
These are the procedures used to manage the flow of data through the organization.
This includes data collection, transformation, storage, and disposal. Proper
governance processes ensure that the data is handled appropriately at every stage
of its lifecycle.
Example: Defining a process for archiving old data, ensuring that outdated
customer records are safely archived and deleted according to compliance policies.
7. Tools and Technologies
The tools and technologies used to support data governance include data
management platforms, data quality tools, metadata management systems, and
compliance software. These tools automate and enforce governance policies,
enabling efficient data management.
Example: Tools like Informatica, Collibra, or Alation for data cataloging, data
quality monitoring, and workflow automation.
8. Data Governance Framework Implementation and Monitoring
A framework must include regular monitoring and auditing to ensure compliance
with governance policies and the effectiveness of governance practices. Establishing
metrics and KPIs to measure the success of data governance initiatives is also
essential.
Example: Implementing regular audits of data access logs to ensure that sensitive
data is only accessed by authorized personnel.
403
Chapter 6 : Data Quality and Governance
404
Chapter 6 : Data Quality and Governance
Data Lineage and Data Cataloging are essential components of effective data governance.
They provide transparency into the data's journey through an organization and ensure that
data assets are well-documented, easily accessible, and properly managed. These practices
play a critical role in maintaining data quality, traceability, and compliance.
405
Chapter 6 : Data Quality and Governance
1. Apache Atlas:
o An open-source metadata management and governance framework that
provides data lineage visualization and tracking capabilities. It allows
organizations to model, govern, and track the flow of data across multiple
systems and applications.
o Key Features: Data classification, lineage tracking, and metadata
governance.
2. Collibra:
o A popular data governance and data management platform that includes data
lineage tracking. It provides a comprehensive view of data flows, helping
organizations monitor the origins and usage of data across complex systems.
o Key Features: Data cataloging, data governance, and automated lineage
tracking for compliance and analytics.
3. Alation:
o Alation is a data cataloging platform that includes data lineage visualization.
It helps users understand the data flow and transformations across
databases, data lakes, and other data storage systems.
o Key Features: Data cataloging, lineage tracking, and collaborative data
usage.
4. Talend:
406
Chapter 6 : Data Quality and Governance
tools can capture both static and dynamic lineage, ensuring that data sources
and their transformations are always traceable.
5. Cloud-Based Lineage Tracking:
o For organizations using cloud services (e.g., AWS, Azure, Google Cloud),
cloud-native tools can track data flow across cloud services, data lakes, and
databases. These tools can automatically track how data is processed and
transformed in cloud environments.
Data Cataloging refers to the process of organizing and classifying data across an
organization. It provides a comprehensive inventory of data assets, including metadata,
data definitions, and their relationships. Data catalogs facilitate easier access, discovery,
and governance of data.
Key Benefits of Data Cataloging:
408
Chapter 6 : Data Quality and Governance
Data security and privacy are fundamental aspects of managing data in any organization.
They are critical to protecting sensitive information from unauthorized access and
ensuring that data is handled in compliance with privacy regulations.
409
Chapter 6 : Data Quality and Governance
In this section, we'll explore the key security principles, as well as techniques for
ensuring data privacy and security, such as data encryption, masking, and
anonymization.
Using cloud services with high uptime guarantees to ensure your data is available
whenever needed.
1. Data Encryption
Data encryption is the process of converting readable data (plaintext) into an unreadable
format (ciphertext) to prevent unauthorized access. The data can only be decrypted and
read by someone who possesses the decryption key.
Encryption at Rest: Protects stored data (e.g., in databases, data warehouses, or
cloud storage).
Encryption in Transit: Protects data being transmitted across networks, such as
when sending sensitive information via email or over the internet.
Examples of Data Encryption:
AES (Advanced Encryption Standard) is commonly used to encrypt files, disks,
and databases.
TLS (Transport Layer Security) is used to secure communications over the
internet (e.g., HTTPS for secure web browsing).
Steps for Implementing Encryption:
1. Select an encryption standard: AES, RSA, etc.
2. Generate encryption keys: Ensure that keys are stored securely and only
accessible to authorized users.
3. Encrypt data: Apply the encryption algorithm to protect sensitive information.
4. Ensure key management: Store and manage keys properly to avoid data breaches.
2. Data Masking
Data masking involves creating a version of the data that looks and behaves like the
original data but has sensitive information replaced with fake or scrambled data. This
ensures that sensitive information can be used in non-production environments (like
testing) without exposing real data.
Static Data Masking: Data is replaced in a stored environment, such as creating a
masked version of a production database.
Dynamic Data Masking: Data is masked in real-time as it is accessed by users or
applications, ensuring sensitive data is only shown when appropriate.
Examples of Data Masking:
Replacing credit card numbers like 4111-1111-1111-1111 with XXXX-XXXX-XXXX-
1111 in non-production environments.
Masking social security numbers by displaying only the last four digits: XXX-XX-
1234.
Steps for Implementing Data Masking:
1. Identify sensitive data: Determine which data fields require masking (e.g., credit
card numbers, personal identification numbers).
2. Choose a masking method: Decide between static or dynamic masking based on
the use case.
3. Implement masking tools: Use data masking software or scripts to automatically
mask data.
411
Chapter 6 : Data Quality and Governance
4. Test the masked data: Ensure the masked data preserves functionality but does
not expose sensitive information.
3. Data Anonymization
Data anonymization is the process of removing or modifying personally identifiable
information (PII) from data, making it impossible to trace the data back to an individual or
entity. Unlike masking, anonymization is usually irreversible, meaning the original data
cannot be recovered.
K-anonymity: Ensures that each individual record cannot be distinguished from at
least k-1 other records in the dataset.
Differential Privacy: Adds noise to the data in a way that preserves privacy while
still allowing for useful statistical analysis.
Examples of Data Anonymization:
Anonymizing a dataset containing customer names and addresses by removing the
name and replacing the address with a generalized region (e.g., city or postal code).
Applying differential privacy to a dataset of health records so that individual
patients' data cannot be re-identified when the data is used for research.
Steps for Implementing Data Anonymization:
1. Determine the level of anonymization: Decide how much information needs to be
anonymized based on the data's sensitivity.
2. Choose an anonymization method: For example, replace PII with pseudonyms or
aggregate data to generalize it.
3. Test the anonymized data: Verify that the anonymized data is still useful for
analysis and cannot be traced back to individuals.
4. Monitor for re-identification risks: Regularly review anonymization techniques to
ensure they prevent re-identification of individuals.
o Implement and enforce data retention policies to ensure data is not kept
longer than necessary. When data is no longer required, ensure it is properly
deleted or anonymized.
Data accessibility and management are crucial for ensuring that authorized users can
access the data they need, while also protecting sensitive information and ensuring
compliance with security regulations. Effective data governance frameworks incorporate
proper access control mechanisms, permission management strategies, and the
consideration of deployment options (cloud vs. on-premise).
In this section, we will cover the following topics:
1. Role-based access control (RBAC) and permission management
2. Strategies for balancing accessibility with security
3. Cloud vs. on-premise data governance considerations
413
Chapter 6 : Data Quality and Governance
Marketing Analyst: Can read product data and generate reports but cannot modify
product information.
Employee: Can only view their own profile and sales data.
Example:
414
Chapter 6 : Data Quality and Governance
2. Segmentation of Data:
o Divide your data into tiers or categories based on its sensitivity. For
example:
Public Data: Accessible by all employees.
Internal Data: Accessible by specific teams or departments (e.g.,
marketing or finance).
Confidential Data: Restricted to top-level management or a small
group of authorized personnel.
3. Context-Aware Access:
o Leverage context-aware access controls that adjust permissions based on
factors such as:
User location: Only allow access to sensitive data when the user is on
a secure corporate network or a trusted location.
Time of access: Restrict access to sensitive data during off-hours or
weekends to reduce the risk of unauthorized access.
4. Audit Trails and Monitoring:
o Regularly monitor who is accessing data and how. Use audit trails to log
every data access event (e.g., user name, action taken, time of access). This
ensures that data access is being used appropriately and can help detect
suspicious activity.
o Set up automated alerts for unusual access patterns (e.g., accessing sensitive
data from a non-employee IP address).
5. Multi-Factor Authentication (MFA):
o Require multi-factor authentication for accessing sensitive data to add an
additional layer of security. This ensures that even if a password is
compromised, unauthorized users cannot access the data without providing
another factor of authentication (e.g., a temporary code sent to a mobile
phone).
Compliance with legal and regulatory frameworks is a critical aspect of data governance.
Organizations must navigate a complex landscape of laws and regulations that govern how
416
Chapter 6 : Data Quality and Governance
data is collected, stored, processed, and shared. Non-compliance can lead to legal
repercussions, financial penalties, and damage to an organization’s reputation.
In this section, we will discuss:
1. Understanding legal and regulatory frameworks
2. Strategies for maintaining compliance
3. Audit and reporting mechanisms
417
Chapter 6 : Data Quality and Governance
418
Chapter 6 : Data Quality and Governance
o Ensure that third-party vendors and partners comply with data protection
standards. This can be achieved through due diligence (e.g., auditing
vendors' data practices) and having data processing agreements (DPAs) in
place.
6. Data Retention and Deletion Policies:
o Establish data retention policies that specify how long data will be kept and
when it will be deleted. This is especially important for compliance with
regulations like GDPR, which mandates that personal data should not be
kept longer than necessary.
o Regularly audit and remove obsolete or non-compliant data from your
systems.
7. Training and Awareness:
o Regularly train employees on the importance of data privacy and security.
This can include mandatory training on the organization’s data handling
practices, as well as specific regulations like GDPR, HIPAA, and CCPA.
o Employees should be aware of the potential consequences of data breaches
and non-compliance.
8. Incident Response Plan:
o Develop and implement an incident response plan that includes clear steps
for responding to data breaches or privacy incidents.
o Ensure that your plan complies with notification requirements under
regulations like GDPR, which mandates notifying authorities and affected
individuals within a specific time frame.
Data linking is used to bring together information from different sources in order to create
a new, richer dataset.
This involves identifying and combining information from corresponding records on each
of the different source datasets. The records in the resulting linked dataset contain some
data from each of the source datasets.
Most linking techniques combine records from different datasets if they refer to the same
entity. (An entity may be a person, organisation, household or even a geographic region.)
However, some linking techniques combine records that refer to a similar, but not
necessarily the same, person or organisation – this is called statistical linking. For
simplicity, this series does not cover statistical linking, but rather focuses on deterministic
and probabilistic linking.
Key Terms
420
Chapter 6 : Data Quality and Governance
o Confidentiality – the legal and ethical obligation to maintain and protect the
privacy and secrecy of the person, business, or organisation that provided their
information.
o Data linking – creating links between records from different sources based on
common features present in those sources. Also known as ‘data linkage’ or ‘data
matching’, data are combined at the unit record or micro level.
o Deterministic (exact) linking – using a unique identifier to link records that refer
to the same entity.
o Identifier – for the purpose of data linking, an identifier is information that
establishes the identity of an individual or organisation. For example, for individuals
it is often name and address. Also see Unique identifier.
o Source dataset – the original dataset as received by the data provider.
o Unique identifier – a number or code that uniquely identifies a person, business or
organisation, such as passport number or Australian Business Number (ABN).
o Unit record level linking – linking at the unit record level involves information
from one entity (individual or organisation) being linked with a different set of
information for the same person (or organisation), or with information on an
individual (or organisation) with the same characteristics. Micro level includes
spatial area data linking.
Benefit Description
Trust & Transparency Know exactly where data came from and how it was modified.
Root Cause Analysis Quickly trace errors back to the source.
Understand what downstream systems or reports are affected by a
Impact Analysis
change.
Regulatory
Prove data handling and flow meet standards (e.g., GDPR, HIPAA).
Compliance
Business users and data teams can speak a common language around data
Collaboration
assets.
1. Business Lineage
o High-level overview: source to report.
o Used by business analysts, compliance officers.
2. Technical Lineage
o Includes code logic, joins, transformations, and scripts.
o Essential for developers, engineers, and auditors.
3. Operational Lineage
o Includes data movement timing, scheduling, and frequency.
o Supports monitoring and performance tuning.
421
Chapter 6 : Data Quality and Governance
422
Chapter 6 : Data Quality and Governance
Metadata: who created it, when it last ran, and versioning. What is Data Cataloging?
Data cataloging is the process of organizing, indexing, and managing metadata (information
about data) so that your data assets are easy to discover, understand, trust, and use.
It’s like creating a searchable inventory for all your data — structured, unstructured, and
semi-structured — across your entire organization.
AI Feature Example
Auto-tagging Detect PII, addresses, emails in columns automatically
Semantic Search Understands user intent and ranks relevant datasets
Data Similarity Detection Identifies duplicate or redundant datasets
Usage-based Recommendations Suggests popular or trusted data sources
Schema Change Detection Alerts users about schema drift or column changes
423
Chapter 6 : Data Quality and Governance
Real-World Example
Problem: An analyst is spending hours trying to find a trusted dataset for monthly sales
reporting.
424
Chapter 6 : Data Quality and Governance
Assessment Criteria
425
Chapter 6 : Data Quality and Governance
Refrences :
Exercise
Multiple Choice Questions:
1. What is the primary goal of data quality management?
a. To increase the volume of data stored.
b. To ensure data is fit for its intended use.
c. To reduce the cost of data storage.
d. To accelerate data processing speed.
426
Chapter 6 : Data Quality and Governance
1. Data quality primarily focuses on the speed at which data is processed. True or False?
2. Data governance involves establishing policies for data usage. True or False?
3. Data consistency means data is stored in only one location. True or False?
4. Data cleansing is the process of removing or correcting inaccurate data. True or False?
427
Chapter 6 : Data Quality and Governance
6. "Data lineage" refers to the current physical location of data storage. True or False?
7. Good data governance helps organizations comply with regulations. True or False?
9. Poor data quality can lead to incorrect business decisions. True or False?
9. Data quality metrics are unnecessary for effective data governance. True or False?
428
Chapter 7 : Advanced Data Management Techniques
Chapter 7
Advanced-Data Management Techniques
Data management is the systematic process of collecting, storing, processing, and securing
data to ensure its accuracy, accessibility, and usability. It encompasses various disciplines,
including data governance, data integration, data security, data storage, and data
analytics.
Effective data management ensures that organizations can make data-driven decisions,
enhance operational efficiency, and comply with industry regulations while
maintaining data quality and security.
Modern data management goes beyond simple storage and retrieval. It involves:
1. Data Collection – Gathering data from multiple sources such as databases, APIs, IoT
devices, and social media.
5. Data Processing & Analytics – Transforming raw data into actionable insights
using Big Data, AI, and Business Intelligence (BI) tools.
429
Chapter 7 : Advanced Data Management Techniques
Data Data
Integratio Collectio
n n
Data
Processin Data
g& Storage
Analytics
Data
Data
Governanc
Security
e
Effective data management is crucial for achieving business success, boosting operational
efficiency, and ensuring compliance with regulations. Companies that adopt robust data
management practices enjoy better decision-making, increased security, and more efficient
data workflows.
430
Chapter 7 : Advanced Data Management Techniques
Governments and regulatory bodies enforce strict data protection laws such as:
o General Data Protection Regulation (GDPR) – Governs data privacy in
Europe.
o Health Insurance Portability and Accountability Act (HIPAA) – Protects
patient data in healthcare.
o California Consumer Privacy Act (CCPA) – Ensures transparency in data
collection practices.
Organizations that fail to comply with these laws face heavy fines and legal
consequences.
Example: In 2019, British Airways was fined $230 million for violating GDPR
after a major data breach exposed customer information.
Data management has evolved beyond being merely an IT task; it’s now a vital strategic
asset that fuels business success. Companies that put their resources into strong data
management systems enjoy enhanced efficiency, security, and innovation.
431
Chapter 7 : Advanced Data Management Techniques
With the ever-increasing volume, variety, and speed of data, cutting-edge AI-driven data
management tools are set to be essential for maintaining data integrity, ensuring
compliance, and enabling real-time decision-making. Organizations that focus on
effective data management will find themselves ahead of the curve in today’s digital
economy.
Data management has really evolved over the years, moving away from those old-school
file-based systems to cutting-edge, AI-driven, cloud-based solutions that process
information in real time. With businesses churning out huge amounts of both structured
and unstructured data, new techniques in data management have popped up to tackle
challenges like scalability, efficiency, security, and automation.
In this section, we’ll take a closer look at the major milestones in the evolution of data
management, showcasing the journey from manual record-keeping to the innovative world
of AI and big data solutions.
Early data storage relied on physical files, punch cards, and magnetic tapes.
Data retrieval was manual, slow, and inefficient.
Limitations:
o No structured organization (difficult indexing).
o High risk of data loss (no backups, vulnerable to damage).
o Lack of multi-user accessibility (only one user could access data at a time).
Example: Banks stored customer records in ledgers and physical files, leading to
errors and slow processing.
432
Chapter 7 : Advanced Data Management Techniques
The rise of the internet, social media, and IoT devices generated massive,
unstructured datasets.
NoSQL databases like MongoDB, Cassandra, and Redis were introduced to handle
scalable, schema-less data storage.
Advantages:
o Designed for high-volume, high-velocity data processing.
o Capable of handling semi-structured (JSON, XML) and unstructured
(videos, logs, IoT data) formats.
o Horizontal scalability (adding more servers instead of upgrading one).
Limitations:
o Weaker data consistency compared to relational databases.
o Complex data integration between structured and unstructured formats.
Example:
o Facebook uses NoSQL databases for real-time messaging and notifications.
o Amazon stores product catalogs in DynamoDB for fast search capabilities.
The rise of AWS, Google Cloud, and Microsoft Azure enabled on-demand,
scalable data storage.
Companies transitioned from on-premise databases to cloud platforms, reducing
hardware costs.
Advantages:
o Scalability – Businesses only pay for what they use.
o High availability – Redundant data centers prevent data loss.
o Real-time processing – Cloud computing enables streaming analytics.
Limitations:
o Security concerns – Data is stored on third-party servers.
o Compliance challenges – Ensuring GDPR, HIPAA, and CCPA compliance.
Example:
o Netflix migrated to AWS cloud to handle massive video streaming demands.
o Healthcare providers use Google Cloud Healthcare API for secure patient
data sharing.
433
Chapter 7 : Advanced Data Management Techniques
AI RDBMS
Powered
434
Chapter 7 : Advanced Data Management Techniques
The shift from traditional databases to advanced AI-driven solutions has been fueled by:
The rise of Artificial Intelligence (AI) and Big Data has transformed modern data
management. Traditional methods struggle to handle the volume, velocity, and variety of
data generated today. AI-driven automation and Big Data technologies provide scalable,
intelligent, and real-time solutions for data storage, processing, security, and
analytics.
In this section, we explore how AI and Big Data work together to optimize data handling,
governance, and decision-making.
AI refers to machine learning (ML), natural language processing (NLP), and deep
learning algorithms that automate data processing, classification, anomaly detection,
and predictive analytics.
435
Chapter 7 : Advanced Data Management Techniques
Big Data refers to massive datasets that cannot be processed using traditional database
systems. It is characterized by the 5 Vs:
Volume – Large amounts of data generated daily (e.g., social media, IoT,
transactions).
Velocity – High-speed data generation and processing (e.g., stock market data, real-
time analytics).
Variety – Different data types (structured, unstructured, semi-structured).
Veracity – Ensuring data accuracy and reliability.
Value – Extracting meaningful insights for business growth.
Big Data technologies such as Hadoop, Apache Spark, and NoSQL databases
handle high-volume, high-velocity data with efficiency.
VARIETY
VERACITY VELOCITY
436
Chapter 7 : Advanced Data Management Techniques
Example:
Amazon uses AI-driven predictive analytics for personalized recommendations.
Healthcare providers use AI to predict disease outbreaks based on patient data.
Example:
Banks use AI-powered fraud detection systems to flag suspicious transactions.
AI-driven data masking protects sensitive information in cloud databases.
Big Data platforms like Hadoop and Apache Spark process petabytes of data
efficiently.
NoSQL databases like MongoDB and Cassandra store unstructured data (e.g.,
emails, images, social media).
437
Chapter 7 : Advanced Data Management Techniques
Example:
Netflix uses Big Data to process millions of user interactions in real-time.
Financial institutions analyze Big Data to detect fraud patterns.
Example:
Uber analyzes real-time traffic data to optimize routes and pricing.
Stock markets use real-time data analytics for high-frequency trading.
IoT sensors generate real-time data streams, which Big Data platforms analyze.
Example:
Smart cities use Big Data to analyze traffic patterns and optimize transportation.
Wearable devices track user health and predict potential health issues.
Example:
Self-driving cars use Big Data to train AI on millions of driving scenarios.
Google’s BERT model (for NLP) processes massive text datasets for language
understanding.
AI and Big Data work together to enhance modern data management strategies:
438
Chapter 7 : Advanced Data Management Techniques
AI and Big Data are revolutionizing modern data management by providing automation,
scalability, real-time analytics, and predictive capabilities. AI enables automated data
governance, fraud detection, and decision-making, while Big Data handles massive
volumes of structured and unstructured information.
Key Takeaways
AI automates data classification, cleaning, and anomaly detection.
Big Data enables real-time analytics and large-scale storage.
AI + Big Data improve security, fraud detection, and business insights.
Future trends include AI-driven edge computing and self-healing databases.
Data governance is a structured approach to managing data assets effectively. A robust data
governance framework ensures data quality, security, privacy, and compliance while
fostering collaboration among stakeholders. Implementing a data governance framework
439
Chapter 7 : Advanced Data Management Techniques
4. Metadata Management
o Maintain a metadata repository to document data definitions, lineage, and
usage.
o Ensure consistency in data interpretation across the organization.
Implementation Strategies
Successfully implementing a data governance framework involves a phased approach:
440
Chapter 7 : Advanced Data Management Techniques
3. Develop a Roadmap
o Establish a step-by-step plan for implementation.
o Define milestones, timelines, and success criteria.
Common Challenges:
Resistance to change from stakeholders.
Complexity in integrating governance policies with existing systems.
Balancing data accessibility with security and compliance requirements.
Best Practices:
Secure executive sponsorship to drive governance initiatives.
Start with a pilot project before scaling up governance implementation.
Foster collaboration between IT and business teams for holistic governance.
Introduction
Data governance is a comprehensive approach that comprises the principles, practices and
tools to manage an organization’s data assets throughout their lifecycle. By aligning data-
related requirements with business strategy, data governance provides superior data
441
Chapter 7 : Advanced Data Management Techniques
4. Metadata Management
1. Maintain a metadata repository to document data definitions, lineage, and
usage.
2. Ensure consistency in data interpretation across the organization.
442
Chapter 7 : Advanced Data Management Techniques
Implementation Strategies
443
Chapter 7 : Advanced Data Management Techniques
Common Challenges:
1. Resistance to change from stakeholders.
2. Complexity in integrating governance policies with existing systems.
3. Balancing data accessibility with security and compliance requirements.
Best Practices:
Secure executive sponsorship to drive governance initiatives.
Start with a pilot project before scaling up governance implementation.
Foster collaboration between IT and business teams for holistic governance.
Data security is a critical component of data governance, ensuring that sensitive and
valuable information is protected against unauthorized access, breaches, and cyber threats.
Organizations must implement comprehensive security measures to maintain trust,
prevent data loss, and comply with regulatory requirements.
444
Chapter 7 : Advanced Data Management Techniques
445
Chapter 7 : Advanced Data Management Techniques
The GDPR is a European Union regulation that came into effect on May 25, 2018. It governs
how organizations collect, store, process, and share personal data of EU citizens. GDPR
emphasizes transparency, accountability, and control for data subjects. Key principles of
GDPR include:
GDPR grants individuals rights such as access to their data, the right to be forgotten, data
portability, and the right to object to processing. Non-compliance can result in heavy
fines—up to €20 million or 4% of annual global turnover.
Enacted in 1996 in the United States, HIPAA is designed to protect sensitive patient health
information (PHI). It applies to healthcare providers, insurers, and any entity handling PHI.
Privacy Rule: Regulates the use and disclosure of PHI by covered entities.
Security Rule: Requires safeguards to protect electronic PHI (ePHI) from breaches.
Breach Notification Rule: Mandates that affected individuals and authorities be
notified in case of data breaches.
Enforcement Rule: Defines penalties and enforcement mechanisms for non-
compliance.
446
Chapter 7 : Advanced Data Management Techniques
HIPAA compliance involves ensuring patient data confidentiality, integrity, and availability.
Violations can result in fines ranging from $100 to $50,000 per violation, depending on the
severity and negligence involved.
California Consumer Privacy Act (CCPA)
The CCPA, effective January 1, 2020, enhances privacy rights for California residents and
imposes strict obligations on businesses that collect and process their personal data. Key
provisions include:
Right to Know: Consumers can request details about the personal data a business
collects and how it is used.
Right to Delete: Consumers can request the deletion of their personal data, subject
to certain exceptions.
Right to Opt-Out: Consumers can opt out of the sale of their personal data to third
parties.
Non-Discrimination: Businesses cannot discriminate against consumers for
exercising their privacy rights.
Businesses subject to CCPA must provide clear disclosures, maintain data security, and
comply with consumer requests within specified timeframes. Non-compliance can result in
fines of up to $7,500 per intentional violation.
Comparative Overview
447
Chapter 7 : Advanced Data Management Techniques
Ensuring data security is a critical aspect of modern digital operations. Organizations must
adopt best practices to protect sensitive information, mitigate risks, and maintain
compliance with regulatory requirements. Below are essential strategies for effective data
protection:
1. Data Encryption
Encrypting data, both at rest and in transit, ensures that unauthorized parties cannot
access or misuse sensitive information. Implementing strong encryption standards, such as
AES-256 and TLS, enhances security.
3. Data Minimization
Collecting only the necessary data reduces exposure to breaches and compliance risks.
Organizations should regularly review their data collection practices and eliminate
redundant or obsolete data.
448
Chapter 7 : Advanced Data Management Techniques
Ensuring data consistency and quality is crucial for accurate decision-making, operational
efficiency, and compliance with regulatory requirements. Poor data quality can lead to
incorrect analysis, inefficiencies, and reputational damage. Organizations must adopt robust
data governance strategies to maintain high standards of data integrity.
Standardizing data formats, naming conventions, and validation rules ensures uniformity
across systems. Implementing validation mechanisms helps prevent errors, missing values,
and inconsistencies.
6. Metadata Management
Establishing comprehensive metadata standards ensures that data definitions, lineage, and
usage guidelines are well-documented and maintained.
449
Chapter 7 : Advanced Data Management Techniques
Defining and enforcing governance policies ensures that data quality and consistency are
maintained throughout the data lifecycle. Clearly assigning roles and responsibilities
fosters accountability.
Master Data Management (MDM) is a critical approach to ensuring data consistency and
quality across an organization. It involves the creation and maintenance of a central
repository that serves as the authoritative source for key business data. MDM helps
eliminate discrepancies, enhances collaboration, and improves overall data reliability.
Define Clear Objectives: Establish goals and expected outcomes for MDM adoption.
450
Chapter 7 : Advanced Data Management Techniques
Select the Right Tools: Use MDM software solutions that align with organizational
needs.
Develop Standardized Workflows: Ensure consistent data handling practices
across departments.
Monitor and Maintain: Continuously audit and refine MDM processes to sustain
data quality.
Master Data Management is essential for maintaining a high level of data integrity within
an organization. By centralizing and standardizing data, businesses can improve efficiency,
enhance decision-making, and ensure compliance with regulatory frameworks. Successful
MDM implementation requires strategic planning, the right technology, and ongoing
governance to achieve sustainable data quality improvements.
Data validation and cleansing are essential techniques to maintain high data quality,
accuracy, and consistency. These processes help organizations eliminate errors,
redundancies, and inconsistencies, ensuring that data remains reliable for business
operations and decision-making.
451
Chapter 7 : Advanced Data Management Techniques
Effective data validation and cleansing are crucial for maintaining high data quality. By
implementing systematic validation rules and cleansing techniques, organizations can
ensure accurate, reliable, and useful data, leading to better decision-making, operational
efficiency, and compliance with regulatory standards.
Examining real-world case studies provides valuable insights into effective data
governance strategies, common challenges, and best practices. The following case studies
highlight how organizations have successfully implemented data governance frameworks.
A leading healthcare provider faced challenges in maintaining accurate patient records due
to inconsistent data entry and duplication across multiple hospitals. Through Master Data
Management (MDM), they consolidated patient data into a single, authoritative source. This
improved care coordination, reduced errors in medical prescriptions, and led to a 20%
decrease in redundant medical tests, saving costs and improving patient outcomes.
452
Chapter 7 : Advanced Data Management Techniques
Data governance ensures data quality, security, compliance, and usability across
organizations. Many leading companies have successfully implemented data governance
frameworks, resulting in better decision-making, regulatory compliance, and
improved efficiency.
In this section, we will explore real-world examples of successful data governance
implementations across different industries.
Mastercard, a leading player in the global payment technology arena, processes millions of
financial transactions every single day. With the rise of cyber threats and the need to
comply with strict regulations like GDPR and PCI DSS, the company recognized the urgent
need for a solid data governance framework.
There was a 40% drop in security incidents tied to data breaches. Compliance with GDPR
and other regulations improved, helping them dodge hefty fines. Fraud detection became
quicker, saving millions by preventing fraudulent transactions.
Walmart manages one of the biggest supply chains in the world. With a vast network of
suppliers and countless transactions, the inconsistency in data across various systems
resulted in inefficiencies, inventory shortages, and inaccurate demand forecasting.
453
Chapter 7 : Advanced Data Management Techniques
Achieved a 20% decrease in inventory errors and stockouts. Saved millions by enhancing
supply chain efficiency. Boosted supplier collaboration through standardized data-sharing
practices.
AI-assisted data curation leverages machine learning (ML) and artificial intelligence (AI) to
automate and enhance the curation process. AI techniques help organizations:
Improve data accuracy and consistency
Reduce manual effort in data processing
Enhance data discoverability and usability
Ensure compliance with data governance policies
This section explores key AI-driven techniques used in data curation and real-world
applications.
AI can automatically identify and correct errors in datasets, such as missing values,
duplicate records, and inconsistent formats.
🔹 Machine Learning for Anomaly Detection:
Uses AI models to detect outliers and inconsistencies.
Example: AI flags incorrect entries in financial transactions (e.g., a salary of
$1,000,000 instead of $10,000).
🔹 Natural Language Processing (NLP) for Text Data Cleaning:
AI can correct spelling errors, remove irrelevant text, and standardize
terminology.
Example: Standardizing "NYC" and "New York City" as the same entity in customer
data.
🔹 Automated Deduplication:
AI matches and merges duplicate records in large datasets.
Example: Customer databases where "John Doe" and "J. Doe" refer to the same
person.
454
Chapter 7 : Advanced Data Management Techniques
Identifies key entities (e.g., names, locations, dates) from unstructured text.
Example: Extracting medical terms from patient records in healthcare datasets.
🔹 AI-Based Metadata Generation:
AI automatically assigns descriptive metadata to files and documents.
Example: A document about "AI Ethics" is automatically tagged with "Technology,"
"Governance," and "Regulations."
🔹 Semantic Tagging with NLP:
AI understands the context of words and categorizes data accordingly.
Example: AI differentiates between “Apple” (the company) and “apple” (the fruit) in
business reports.
455
Chapter 7 : Advanced Data Management Techniques
Data tagging is the process of assigning labels or metadata to data, making it easier to
organize, retrieve, and analyze. It plays a critical role in:
456
Chapter 7 : Advanced Data Management Techniques
457
Chapter 7 : Advanced Data Management Techniques
Metadata is data about data, providing essential context for organizing, searching, and
managing information efficiently. AI-powered metadata generation and categorization help
automate and enhance the process by:
It automatically creates descriptive metadata for various formats like text, images, audio,
and video. It boosts searchability and makes data discovery easier in extensive datasets.
It maintains consistency in how data is classified across different systems.
It helps ensure compliance with data governance standards.AI-driven approaches are
transforming how businesses handle metadata, reducing manual effort, errors, and
inconsistencies.
AI-powered Natural Language Processing (NLP) techniques analyze text and generate
metadata such as:
Keywords and tags – Identifying key terms for searchability.
Named Entity Recognition (NER) – Extracting entities like names, dates, and locations.
Topic Modeling – Categorizing text into meaningful groups.
Sentiment Analysis – Determining emotional tone (positive, neutral, negative).
Example Applications:
1. Publishing Industry – Automatically tagging news articles based on topics.
2. Legal Sector – AI-driven classification of legal contracts.
3. Education – Metadata tagging for research papers and academic resources.
Computer vision and deep learning models automatically generate metadata for images
and videos, enabling:
Object Recognition – Identifying people, objects, and scenes.
Facial Recognition – Tagging individuals in images/videos.
Activity Detection – Categorizing actions in surveillance footage.
Example Applications:
1. E-commerce – Auto-tagging product images with attributes (e.g., "red dress,"
"running shoes").
2. Media & Entertainment – AI categorizing movies and shows based on genres,
actors, and themes.
3. Security & Surveillance – Tagging video footage for faster searchability.
AI models use speech recognition and audio analysis to generate metadata for spoken
content.
458
Chapter 7 : Advanced Data Management Techniques
Example Applications:
Podcast Platforms – AI-generated metadata for podcast transcripts.
Customer Support – Categorizing recorded calls based on topics and customer sentiment.
Music Streaming – Auto-tagging songs based on mood, genre, and lyrics.
Example Applications:
Banking & Finance – Categorizing transactions as “Salary,” “Shopping,” or “Bills.”
Healthcare – Tagging medical records based on patient history.
Retail Analytics – Classifying customer purchase data for trend analysis.
In the fast-paced world of today’s digital age, companies are constantly creating and
managing huge volumes of data from a variety of sources—think databases, IoT devices,
social media, and cloud applications. Unfortunately, traditional methods of data integration
often find it tough to keep up with the sheer volume, diversity, and speed of modern data.
That’s where Artificial Intelligence (AI) comes into play, acting as a game-changer for
improving data integration processes. It allows businesses to automate tasks, optimize
workflows, and extract valuable insights from their data environments.
459
Chapter 7 : Advanced Data Management Techniques
AI-driven data integration leverages machine learning (ML), natural language processing
(NLP), and automation to improve the efficiency and accuracy of data consolidation. Key
ways AI enhances data integration include:
In an era of rapid digital transformation, organizations must efficiently manage and process
vast amounts of data—commonly referred to as Big Data—to gain valuable insights,
improve decision-making, and optimize operations. Big Data management and processing
involve collecting, storing, analyzing, and visualizing large datasets using advanced
technologies and frameworks.
460
Chapter 7 : Advanced Data Management Techniques
(e.g., Hadoop Distributed File System, HDFS) ensure scalable and cost-effective data
storage.
3. Data Processing and Analytics
Big Data processing involves transforming raw data into structured insights using
frameworks such as Apache Spark, Hadoop MapReduce, and Flink. These
technologies enable real-time and batch processing to support advanced analytics,
including machine learning and predictive modeling.
4. Data Governance and Security
Managing access control, compliance, and data integrity is essential for safeguarding
sensitive information. Data governance frameworks, such as GDPR and CCPA, guide
organizations in handling personal data responsibly. Encryption, role-based access
control (RBAC), and anomaly detection help enhance security.
Big Data processing relies on robust frameworks that enable organizations to analyze
massive datasets efficiently:
Apache Hadoop: A widely used open-source framework that provides distributed
storage and processing capabilities.
Apache Spark: A high-speed, in-memory data processing engine suitable for real-
time analytics and machine learning applications.
Google Big Query: A serverless, highly scalable cloud data warehouse that enables
interactive SQL queries.
Flink and Storm: Stream-processing frameworks designed for real-time data
analytics and event-driven applications.
461
Chapter 7 : Advanced Data Management Techniques
Big Data technologies refer to the tools, frameworks, and methodologies designed to
handle large-scale data processing, storage, and analysis. These technologies enable
organizations to extract actionable insights from complex and voluminous datasets
efficiently.
Managing Big Data requires specialized tools that facilitate data collection, storage,
processing, and analysis. Some of the most widely used Big Data management tools include:
462
Chapter 7 : Advanced Data Management Techniques
The Hadoop ecosystem is one of the most widely used frameworks for managing and
processing large-scale data. It consists of several core components that facilitate
distributed storage and computation.
HDFS is a scalable and fault-tolerant distributed storage system that allows organizations
to store massive amounts of data across multiple nodes. Key features of HDFS include:
High Fault Tolerance: Data is replicated across multiple nodes to prevent data loss
in case of hardware failures.
Scalability: Supports horizontal scaling by adding more nodes to accommodate
growing data volumes.
463
Chapter 7 : Advanced Data Management Techniques
2. MapReduce
MapReduce is a programming model used for processing large datasets in a parallel and
distributed manner. It consists of two main phases:
Map Phase: Data is broken down into key-value pairs and processed in parallel
across multiple nodes.
Reduce Phase: The output from the Map phase is aggregated and combined to
generate the final result.
MapReduce is particularly useful for batch processing large-scale data efficiently. However,
newer frameworks like Apache Spark offer faster, in-memory alternatives to MapReduce.
3. Apache Hive
Apache Hive is a data warehousing and SQL-like querying tool built on top of Hadoop. It
allows users to query large datasets using Hive Query Language (HQL), which is similar to
SQL. Key features of Hive include:
SQL-Like Querying: Enables analysts and data engineers to work with Big Data
without requiring deep programming knowledge.
Integration with HDFS: Queries can be executed directly on data stored in HDFS.
Scalability and Performance Optimization: Supports indexing and partitioning
for improved query performance.
The Hadoop ecosystem remains a foundational technology for Big Data management,
enabling organizations to efficiently store, process, and analyze large datasets. While newer
technologies like Apache Spark and cloud-based solutions offer enhanced performance,
Hadoop continues to be widely used for batch processing and cost-effective storage
solutions.
Apache Spark is a powerful open-source framework designed for fast and efficient large-
scale data processing. Unlike traditional batch processing frameworks like MapReduce,
Spark utilizes in-memory computing, making it significantly faster for iterative and real-
time analytics.
464
Chapter 7 : Advanced Data Management Techniques
Integration with Big Data Tools: Works seamlessly with Hadoop, HDFS, Apache
Hive, and NoSQL databases.
Supports Multiple Workloads: Can handle batch processing, real-time streaming,
machine learning, and graph analytics.
Apache Spark's ability to handle diverse workloads and process vast amounts of data at
high speed makes it a preferred choice for modern data analytics and AI-driven
applications.
MongoDB
Apache Cassandra
Cassandra is a distributed NoSQL database designed for handling large-scale data across
multiple data centers. Key features include:
Decentralized Architecture: Eliminates single points of failure.
High Availability: Ensures continuous uptime and fault tolerance.
Scalability: Handles massive datasets with linear scalability.
Tunable Consistency: Balances consistency and availability based on application
needs.
465
Chapter 7 : Advanced Data Management Techniques
Enterprises deal with massive datasets that require efficient management, storage, and
processing. Handling large datasets in enterprise applications involves:
As data continues to grow in volume and complexity, organizations must adopt advanced
data management strategies to stay competitive. These strategies focus on improving data
quality, security, governance, and real-time analytics to support business objectives
effectively.
466
Chapter 7 : Advanced Data Management Techniques
467
Chapter 7 : Advanced Data Management Techniques
As data continues to grow in complexity, volume, and variety, organizations face significant
challenges in managing and utilizing their data effectively. Advanced data management
encompasses various aspects, including storage, security, scalability, integration, and
analytics. This section explores key challenges and corresponding solutions in advanced
data management.
Challenge: With the exponential growth of data, organizations struggle to store, process,
and analyze vast amounts of structured and unstructured data efficiently.
Solution: Implementing scalable cloud-based storage solutions, such as data lakes and
distributed databases (e.g., Apache Hadoop, Amazon S3), helps manage large volumes of
data. Additionally, using parallel processing frameworks like Apache Spark ensures
efficient handling of large-scale data analytics.
Challenge: Organizations often deal with diverse data sources, formats, and storage
systems, making integration complex and time-consuming.
Solution: Utilizing Extract, Transform, Load (ETL) tools and middleware platforms like
Apache Nifi, Talend, or Informatica can streamline data integration. Adopting standardized
data exchange formats (e.g., JSON, XML, or APIs) ensures smooth interoperability between
systems.
Challenge: Poor data quality, including missing, duplicate, or inconsistent data, can lead to
incorrect insights and faulty decision-making.
Challenge: With increasing cyber threats and stringent data privacy regulations (e.g.,
GDPR, CCPA), organizations must ensure data protection and compliance.
468
Chapter 7 : Advanced Data Management Techniques
Solution: Implementing real-time data processing frameworks like Apache Kafka, Apache
Flink, or Google Dataflow allows for low-latency data streaming and analytics. Edge
computing also helps process data closer to the source, reducing transmission delays.
Challenge: Managing regulatory compliance and enforcing data governance policies across
an organization is complex.
Challenge: Extracting actionable insights from large datasets while integrating artificial
intelligence (AI) and machine learning (ML) models can be challenging.
Solution: Using advanced analytics platforms like Google BigQuery, Azure Synapse
Analytics, or Snowflake facilitates efficient data processing. AI-driven tools like TensorFlow
and AutoML can automate predictive modeling and enhance decision-making.
8. Cost Management
Challenge: The increasing cost of data storage, processing, and analytics poses a financial
burden on organizations.
7.5.3 Future Trends and Innovations in Data Governance and AI-Assisted Management
469
Chapter 7 : Advanced Data Management Techniques
470
Chapter 7 : Advanced Data Management Techniques
Assessment Criteria
Refrences :
471
Chapter 7 : Advanced Data Management Techniques
Exercise
472
Chapter 7 : Advanced Data Management Techniques
True/False Questions
1. Poor data management can result in inconsistent and duplicate data, which affects
business outcomes. (T/F)
2. The General Data Protection Regulation (GDPR) applies only to companies located
within the United States. (T/F)
3. Cloud storage and AI-driven automation help businesses scale their data
management systems efficiently. (T/F)
4. AI-powered self-healing databases require human intervention to optimize
performance.
5. Big Data technologies like Hadoop and Apache Spark help process large volumes of
structured and unstructured data. (T/F)
6. Quantum computing has no impact on AI and Big Data processing. (T/F)
7. Role-Based Access Control (RBAC) allows only specific users to access certain data
based on their roles within an organization. (T/F)
8. Data validation ensures that all data entered into a system is accurate, consistent,
and formatted correctly. (T/F)
9. Master Data Management (MDM) increases data inconsistencies by allowing
multiple sources of truth. (T/F)
10. GDPR, HIPAA, and CCPA are regulatory frameworks that primarily focus on
managing financial transactions. (T/F)
473
Chapter 7 : Advanced Data Management Techniques
474
Chapter 8 : Application of Data Curation
Chapter 8:
Application of Data Curation
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical
summary of the data. We will also be able to deal with the duplicate values, outliers, and also
see some trends or patterns present in the dataset.
Petal Length,
Petal Width,
Sepal Length,
Sepal Width, and
Species Type.
Iris is a flowering Plant, the researchers have measured various features of the different iris
flowers and recorded them digitally.
8.3 Image of flowers
475
Chapter 8 : Application of Data Curation
To work with the IRIS dataset, data can be imported using sklearn.datasets or load it from a
CSV file.
If the dataset is stored in a CSV file, Pandas can be used to load the data in dataframe :
476
Chapter 8 : Application of Data Curation
Output-
Describing Data
The Iris dataset is a well-known dataset in machine learning and statistics, often used for
classification tasks. It consists of 150 samples of iris flowers, categorized into three species:
Setosa
Versicolor
Virginica
Each sample has four numerical features that describe the physical characteristics of the
flower:
477
Chapter 8 : Application of Data Curation
We will check if our data contains any missing values or not. Missing values can occur when
no information is provided for one or more items or for a whole unit. We will use
the isnull() method.
478
Chapter 8 : Application of Data Curation
Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method
helps in removing duplicates from the data frame.
From the output it is learnt that there are only three unique species.
Let’s see if the dataset is balanced or not i.e. all the species contain equal amounts of rows
or not.
For that we use the Series.value_counts() function. This function returns a Series
containing counts of unique values.
479
Chapter 8 : Application of Data Curation
Our target column will be the Species column because at the end we will need the result
according to the species only. Let’s see a countplot for species.
Google Colab link for the code described below :
https://colab.research.google.com/drive/1TTJlwNTYVHZfHR4-
UtxTEuXUpxOcqap9#scrollTo=HDmwktl9psug
480
Chapter 8 : Application of Data Curation
481
Chapter 8 : Application of Data Curation
482
Chapter 8 : Application of Data Curation
483
Chapter 8 : Application of Data Curation
484
Chapter 8 : Application of Data Curation
485
Chapter 8 : Application of Data Curation
Species Setosa has smaller sepal lengths but larger sepal widths.
Versicolor Species lies in the middle of the other two species in terms of sepal length
and width
Species Virginica has larger sepal lengths but smaller sepal widths.
486
Chapter 8 : Application of Data Curation
Inference –
Species Setosa has smaller petal lengths and widths. Versicolor Species lies in the
middle of the other two
Species in terms of petal length and width
Species Virginica has the largest of petal lengths and widths.
D). Pairplot:
487
Chapter 8 : Application of Data Curation
E.) Replot:
Replot stands for "relational plot". It's a figure-level function in Seaborn used to plot
relationships between two variables — like scatter plots or line plots — with extra options
for faceting, grouping, and more.
488
Chapter 8 : Application of Data Curation
489
Chapter 8 : Application of Data Curation
8.6 Histograms
Histograms allow seeing the distribution of data for various columns. It can be used for uni
as well as bi-variate analysis.
490
Chapter 8 : Application of Data Curation
Interpretation of Histograms
The highest frequency of the sepal length is between 30 and 35 which is between
5.5 and 6
The highest frequency of the sepal Width is around 70 which is between 3.0 and 3.5
The highest frequency of the petal length is around 50 which is between 1 and 2
The highest frequency of the petal width is between 40 and 50 which is between 0.0
and 0.5
491
Chapter 8 : Application of Data Curation
Distplot is used basically for the univariant set of observations and visualizes it through a
histogram i.e. only one observation and hence we choose one particular column of the
dataset.
492
Chapter 8 : Application of Data Curation
Handling Correlation
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the
dataframe, Any NA values are automatically excluded. For any non-numeric data type
columns in the dataframe it is ignored.
Pearson coefficient and p-value. We can say there is a strong correlation between two
variables, when Pearson correlation coefficient is close to either 1 or -1and the p-value is
less than 0.0001.
493
Chapter 8 : Application of Data Curation
8.7 Heatmaps
Importance of a Heatmap
https://colab.research.google.com/drive/1TTJlwNTYVHZfHR4-
UtxTEuXUpxOcqap9#scrollTo=1Gg9AXXqGi8d&line=1&uniqifier=1
494
Chapter 8 : Application of Data Curation
Shows distribution: Displays the median, quartiles, and range of each feature.
Identifies outliers: Points outside the whiskers indicate potential outliers.
Compares species differences: Helps visualize how each feature varies among
Setosa, Versicolor, and Virginica.
495
Chapter 8 : Application of Data Curation
496
Chapter 8 : Application of Data Curation
Species Setosa has the smallest features and less distributed with some outliers.
Species Versicolor has the average features.
Species Virginica has the highest features
497
Chapter 8 : Application of Data Curation
8.9 Outliers
An Outlier is a data-item/object that deviates significantly from the rest of the (so-called
normal) objects. They can be caused by measurement or execution errors. The analysis
for outlier detection is referred to as outlier mining. There are many ways to detect the
outliers, and the removal process is the data frame same as removing a data item from the
panda’s data-frame.
Let’s consider the iris dataset and let’s plot the boxplot for the Sepal Width Cm column.
498
Chapter 8 : Application of Data Curation
Observations ✈
Removing Outliers
For removing the outlier, one must follow the same process of removing an entry from the
dataset using its exact position in the dataset because in all the above methods of detecting
the outliers end result is the list of all those data items that satisfy the outlier definition
according to the method used.
Example: We will detect the outliers using IQR and then we will remove them. We will also
draw the boxplot to see if the outliers are removed or not
499
Chapter 8 : Application of Data Curation
500
Chapter 8 : Application of Data Curation
501
Chapter 8 : Application of Data Curation
Assessment Criteria
Refrences :
502
Chapter 8 : Application of Data Curation
Exercise
Multiple Choice Questions
2. In the context of data curation, what does data cleaning refer to?
a. Encrypting data for security purposes
b. Removing, correcting, or replacing inaccurate or corrupt data
c. Archiving data for future access
d. Formatting data to fit the system's requirements
4. Which of the following best describes the role of metadata in data curation?
a. Metadata is used to ensure the data is in a human-readable format
b. Metadata describes the structure, content, and context of the data to enhance
discoverability and reuse
c. Metadata only helps with data cleaning
d. Metadata is only useful for archiving purposes
5. What is one of the challenges associated with data curation in scientific research?
a. Ensuring data is properly encrypted
b. Making data available without any public access restrictions
c. Maintaining data quality and consistency over long periods of time
d. Reducing the size of datasets to make them more manageable
6. In data curation, which of the following actions is most closely associated with
ensuring data integrity?
a. Proper documentation and version control
b. Creating machine learning models
c. Data mining
d. Generating random samples of data
503
Chapter 8 : Application of Data Curation
8. In a data curation workflow, which phase involves checking the data for consistency,
accuracy, and completeness?
a. Data Acquisition
b. Data Cleaning
c. Data Storage
d. Data Visualization
9. Which of the following industries most commonly relies on data curation for
decision-making and research?
a. Retail and ecommerce
b. Healthcare and pharmaceuticals
c. Social media platforms
d. All of the above
2. _________ is the practice of enhancing datasets by integrating additional data from external
sources, such as APIs or other databases
3. The process of standardizing data into a specific range, such as scaling numerical data to
the range [0, 1], is known as _________.
4. _________ refers to the metadata or documentation that describes the structure, content,
and other properties of a dataset.
5. In data curation, _________ refers to the task of organizing, validating, and ensuring the
consistency of data before it is stored or used for analysis.
6. A key challenge in data curation is dealing with _________, where certain values or fields
are missing in a dataset, which can affect the quality and reliability of analysis.
7. In the context of data curation, the process of tracking changes to a dataset over time
using identifiers like timestamps is called _________.
8. ________ is the term used to describe a dataset that has been cleaned, documented, and
transformed for reuse, often stored in a system that allows easy retrieval and analysis.
9. Data ________ refers to the practice of ensuring that the stored data complies with legal,
ethical, and security standards.
10. To ensure long-term usability, data curation involves storing datasets in ________ that
allow for future access, sharing, and analysis.
505
Chapter 10 : Data Analysis Tool : Pandas
506