0% found this document useful (0 votes)
129 views99 pages

Tycs Data Science Sem6

Uploaded by

casualworkcat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views99 pages

Tycs Data Science Sem6

Uploaded by

casualworkcat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

T.Y.C.S.

SEM-VI
DATA SCIENCE

Compiled By: MEGHA SHARMA


https://www.youtube.com/@omega_teched
Compiled By: Asst.Prof. MEGHA SHARMA

Chapter-1
What is Data Science? Definition and scope of Data Science, Applications
and domains of Data Science, Comparison with other fields like Business
Intelligence (BI), Artificial Intelligence (AI), Machine Learning (ML), and
Data Warehousing/Data Mining (DW-DM)

Data Science:

Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate data so
that you can find something new and meaningful.

Applications of Data Science:

o Image recognition and speech recognition:


Data science is currently used for Image and speech recognition. When you
upload an image on Facebook and start getting the suggestion to tag your
friends. This automatic tagging suggestion uses an image recognition
algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., these
devices respond as per voice control, so this is possible with speech
recognition algorithms.
o Gaming
In the gaming world, the use of Machine learning algorithms is increasing day
by day. EA Sports, Sony, Nintendo, are widely using data science for
enhancing user experience.
o Internet:
When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc. All these
search engines use data science technology to make the search experience
better, and you can get a search result within a fraction of seconds.

OMega TechEd 1
Compiled By: Asst.Prof. MEGHA SHARMA

o Transport:
Transport industries are also using data science technology to create self-
driving cars. With self-driving cars, it will be easy to reduce the number of
road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science
is being used for tumor detection, drug discovery, medical image analysis,
virtual medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and
you start getting suggestions for similar products, so this is because of data
science technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued.
Most of the finance companies are looking for data scientists to avoid risk and
any type of losses with an increase in customer satisfaction.

BI stands for business intelligence, which is also used for data analysis of business
information:

differences between BI and Data sciences:

Criterion Business intelligence Data science

Data Business intelligence deals with Data science deals with structured
Source structured data, e.g., data and unstructured data, e.g.,
warehouse. weblogs, feedback, etc.

OMega TechEd 2
Compiled By: Asst.Prof. MEGHA SHARMA

Method Analytical (historical data) Scientific (goes deeper to know


the reason for the data report)

Skills Statistics and Visualization are Statistics, Visualization, and


the two skills required for Machine learning are the required
business intelligence. skills for data science.

Focus Business intelligence focuses Data science focuses on past data,


on both Past and present data present data, and also future
predictions.

Difference between Data Science and Machine Learning:

Data Science Machine Learning

It deals with understanding and It is a subfield of data science that enables


finding hidden patterns or useful the machine to learn from the past data and
insights from the data, which helps experiences automatically.
to make smarter business decisions.

It is used for discovering insights It is used for making predictions and


from the data. classifying the result for new data points.

OMega TechEd 3
Compiled By: Asst.Prof. MEGHA SHARMA

It is a broad term that includes It is used in the data modeling step of data
various steps to create a model for a science as a complete process.
given problem and deploy the
model.

A data scientist needs to have skills A Machine Learning Engineer needs to


to use big data tools like Hadoop, have skills such as computer science
Hive and Pig, statistics, fundamentals, programming skills in
programming in Python, R, or Scala. Python or R, statistics and probability
concepts, etc.

It can work with raw, structured, and It mostly requires structured data to work
unstructured data. on.

Data scientists spend lots of time ML engineers spend a lot of time managing
handling the data, cleansing the data, the complexities that occur during the
and understanding its patterns. implementation of algorithms and
mathematical concepts behind that.

Difference between Data Science and AI

OMega TechEd 4
Compiled By: Asst.Prof. MEGHA SHARMA

Data Science is a detailed


AI(short) is the implementation of a
process that mainly involves
Basics predictive model to forecast future
pre- processing analysis,
events and trends.
visualization and prediction.

Identifying the patterns that are Automation of the process and the
Goals concealed in the data is the main granting of autonomy to the data
objective of data science. model are the main goals of artificial
intelligence.

Data Science will have a variety of AI uses standardized


Types of different types of data, including data in the form of
data structured, semi-structured, and vectors and
unstructured type of data. embeddings.

It has a lot of high


Scientific It has a high degree of scientific
levels of complex
Processing processing.
processing.

OMega TechEd 5
Compiled By: Asst.Prof. MEGHA SHARMA

The tools utilized in Data Science are far


The tools used in AI
more extensive than those used in AI.
are less extensive
Tools used This is because Data Science entails a
compared to Data
few procedures for analyzing data and
Science.
developing insights from it.

By using the concept of data By using this we emulate


science, we can build complex cognition and human
Build
models about statistics and facts understanding to a certain
about data. level.

Technique It uses the technique of data It uses a lot of machine


used analysis and data analytics. learning techniques.

Artificial intelligence makes


Data science makes use of
Use use of algorithms and
graphical representation.
network node representation.

OMega TechEd 6
Compiled By: Asst.Prof. MEGHA SHARMA

Its knowledge was established to Its knowledge is all about


Knowledge find hidden patterns and trends in imparting some autonomy to a
the data. data model.

Data Warehousing

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses
on providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various


applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of


information in support of management's decisions."

OMega TechEd 7
Compiled By: Asst.Prof. MEGHA SHARMA

Characteristics:

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that is not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attribute types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the most
current file is kept.

OMega TechEd 8
Compiled By: Asst.Prof. MEGHA SHARMA

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from
the source operational RDBMS. The operational updates of data do not occur in the
data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered the warehouse, and data should not change.
Goals of Data Warehousing

o To help reporting as well as analysis


o Maintain the organization's historical information.
o Be the foundation for decision making.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to store enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier
to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provides the capabilities to analyze a large amount of
historical data.

OMega TechEd 9
Compiled By: Asst.Prof. MEGHA SHARMA

Difference between database and data warehouse: -

Database Data Warehouse

1. It is used for Online Transactional 1. It is used for Online Analytical


Processing (OLTP) but can be used for Processing (OLAP). This reads the
other objectives such as Data Warehousing. historical information for the
This records the data from the clients for customers for business decisions.
history.

2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. since they are denormalized. This is
This is done to reduce redundant files and done to minimize the response time
to save storage space. for analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures 4. Data: Modeling approaches are


are used for RDBMS database design. used for the Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical


queries.

OMega TechEd 10
Compiled By: Asst.Prof. MEGHA SHARMA

7. The database is the place where the data 7. Data Warehouse is the place
is taken as a base and managed to get where the application data is
available fast and efficient access. handled for analysis and reporting
objectives.

ETL (Extract, Transform, and Load) Process


The mechanism of extracting information from source systems and bringing it into
the data warehouse is commonly called ETL, which stands for Extraction,
Transformation and Loading.
The ETL process requires active input from various stakeholders, including
developers, analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs
to change with business changes. ETL is a recurring method (daily, weekly,
monthly) of a Data warehouse system and needs to be agile, automated, and well
documented.

Extraction

o Extraction is the operation of extracting information from a source system for


further use in a data warehouse environment. This is the first stage of the ETL
process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all
the changed data to the warehouse and keep it up-to-date.

Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed
to improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing

OMega TechEd 11
Compiled By: Asst.Prof. MEGHA SHARMA

mistakes and to recognize synonyms, as well as rule-based cleansing to enforce


domain-specific rules and define appropriate associations between values.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement
a three-layer architecture, this phase outputs our reconciled data layer.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older
files are replaced. Refresh is usually used in combination with static extraction
to populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying pre-existing data. This method is used in combination with
incremental extraction to update data warehouses regularly.

Data Mining:
The process of extracting information to identify patterns, trends, and useful data
that would allow the business to take the data-driven decision from huge sets of data
is called Data Mining.
We can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is
collected and assembled areas such as data warehouses, efficient analysis, data
mining algorithms, helping decision making and other data requirements to
eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the

OMega TechEd 12
Compiled By: Asst.Prof. MEGHA SHARMA

probability of future events. Data Mining is also called Knowledge Discovery of


Data (KDD).
Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by
tables, records, and columns from which data can be accessed in various ways
without having to recognize the database tables. Tables convey and share
information, which facilitates data searchability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources
within the organization to provide meaningful business insights. The huge amount
of data comes from multiple places such as Marketing and Finance. The extracted
data is utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However,
many IT professionals utilize the term more clearly to refer to a specific kind of setup
within an IT structure. For example, a group of databases, where an organization has
kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model
is called an object-relational model. It supports Classes, Objects, Inheritance, etc.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has
the potential to undo a database transaction if it is not performed appropriately. Even
though this was a unique capability a very long while back, today, most of the
relational database systems support transactional database activities.

Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based


data.

OMega TechEd 13
Compiled By: Asst.Prof. MEGHA SHARMA

o Data mining enables organizations to make lucrative modifications in


operation and production.
o Compared with other statistical data applications, data mining is cost-
efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.

Disadvantages of Data Mining

o There is a probability that the organizations may sell useful data of customers
to other organizations for money. As per the report, American Express has
sold credit card purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance
training to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data
mining tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

Data Mining Applications

Data Mining is primarily used by organizations with intense consumer demands-


Retail, Communication, Financial, marketing company, determine price, consumer
preferences, product positioning, and impact on sales, customer satisfaction, and
corporate profits. Data mining enables a retailer to use point-of-sale records of
customer purchases to develop products and promotions that help the organization
to attract the customer.

OMega TechEd 14
Compiled By: Asst.Prof. MEGHA SHARMA

Data Mining Techniques

Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can
incorporate statistical models, machine learning techniques, and mathematical
algorithms, such as neural networks or decision trees. Thus, data mining incorporates
analysis and prediction.
Depending on various methods and technologies from the intersection of machine
learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.

Chapter Ends…

OMega TechEd 15
Compiled By: Asst.Prof. MEGHA SHARMA

Chapter-2
Data Types and Sources
Data Types and Sources: Different types of data: structured, unstructured,
semi-structured, Data sources: databases, files, APIs, web scraping, sensors,
social media

Data can be Structured data, Semi-structured data, and Unstructured data.

1. Structured
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is
typically a database. It concerns all data which can be stored in database
SQL in a table with rows and columns. They have relational keys and
can easily be mapped into pre-designed fields. Today, those data are
most processed in the development and simplest way to manage
information. Example: Relational-data.
2. Semi-Structured –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it easier
to analyze. With some processes, you can store them in the relation
database (it could be very hard for some kind of semi-structured data),
but Semi-structured exists to ease space. Example: XML data.
3. Unstructured data – Unstructured data is data which is not organized in
a predefined manner or does not have a predefined data model; thus, it
is not a good fit for a mainstream relational database. So, for
Unstructured data, there are alternative platforms for storing and
managing. It is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.

OMega TechEd 16
Compiled By: Asst.Prof. MEGHA SHARMA

Differences between Structured, Semi-structured and Unstructured data:

Unstructured
Properties Structured data Semi-structured data
data

It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational character and
Description
database table binary data
Framework).

Matured
No transaction
transaction and Transaction is adapted
Transaction management
various from DBMS not
management and no
concurrency matured
concurrency
techniques

Version Versioning over Versioning over tuples Versioned as a


management tuples,row,tables or graph is possible whole

It is more flexible than It is more


It is schema
structured data but less flexible and
Flexibility dependent and less
flexible than there is absence
flexible
unstructured data of schema

OMega TechEd 17
Compiled By: Asst.Prof. MEGHA SHARMA

It is very difficult
It’s scaling is simpler It is more
Scalability to scale DB
than structured data scalable.
schema

New technology, not


Robustness Very robust —
very spread

Data Types Based on Its Collection


Based on how data is collected, it can be divided into two categories - Primary and
Secondary data. Let’s review the key differences between these two types in the
following table -

Factor Primary Data Secondary Data

Definition Primary Data refers to the Secondary Data has been collected
first-hand data collected by by other teams in the past. It does
the team. It is collected based not necessarily need to be aligned
on the researcher’s needs. with the researcher’s requirements.

Data Real-time Data Historical Data

OMega TechEd 18
Compiled By: Asst.Prof. MEGHA SHARMA

Process Time Consuming Quick and Easy

Collection Long Short


Time

Available In Raw and Crude form Refined form

Accuracy Very high Relatively less


and
Reliability

Examples Personal Interviews, Surveys, Websites, Articles, Research


Observations, etc. Papers, Historical Data, etc.

Types of Data:

OMega TechEd 19
Compiled By: Asst.Prof. MEGHA SHARMA

The data in statistics is classified into four categories:


• Nominal data
• Ordinal data
• Discrete data
• Continuous data

In statistics, there are four main types of data: nominal, ordinal, interval, and ratio.
These types of data are used to describe the nature of the data being collected or
analyzed, and they help determine the appropriate statistical tests to use.

Qualitative Data (Categorical Data)


As the name suggests Qualitative Data tells the features of the data in the statistics.
Qualitative Data is also called Categorical Data and its categories the data into
various categories. Qualitative data includes data such as gender of people, their
family name, and others in a sample of population data.
Qualitative data is further categorized into two categories that includes,
• Nominal Data
• Ordinal Data

Nominal Data
Nominal data is a type of data that consists of categories or names that cannot be
ordered or ranked. Nominal data is often used to categorize observations into groups,
and the groups are not comparable. In other words, nominal data has no inherent
order or ranking. Examples of nominal data include gender (Male or female), race
(White, Black, Asian), religion (Hinduism, Christianity, Islam, Judaism), and blood
type (A, B, AB, O).
Nominal data can be represented using frequency tables and bar charts, which
display the number or proportion of observations in each category. For example, a
frequency table for gender might show the number of males and females in a sample
of people.
Nominal data is analyzed using non-parametric tests, which do not make any
assumptions about the underlying distribution of the data. Common non-parametric
tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These

OMega TechEd 20
Compiled By: Asst.Prof. MEGHA SHARMA

tests are used to compare the frequency or proportion of observations in different


categories.
Ordinal Data
Ordinal data is a type of data that consists of categories that can be ordered or ranked.
However, the distance between categories is not necessarily equal. Ordinal data is
often used to measure subjective attributes or opinions, where there is a natural order
to the responses. Examples of ordinal data include education level (Elementary,
Middle, High School, College), job position (Manager, Supervisor, Employee), etc.
Ordinal data can be represented using bar charts, line charts. These displays show
the order or ranking of the categories, but they do not imply that the distances
between categories are equal.
Ordinal data is analyzed using non-parametric tests, which make no assumptions
about the underlying distribution of the data. Common non-parametric tests for
ordinal data include the Wilcoxon Signed-Rank test and Mann-Whitney U test.

Quantitative Data (Numerical Data)


Quantitative Data is the type of data that represents the numerical value of the data.
They are also called Numerical Data. This data type is used to represent the height,
weight, length, and other things of the data. Quantitative data is further classified
into two categories that are,
• Discrete Data
• Continuous Data

Discrete Data
Discrete data type is a type of data in statistics that only uses Discrete Value or Single
Values. These data types have values that can be easily counted as whole numbers.
The example of the discrete data types is,
• Height of Students in a class
• Marks of the students in a class test
• Weight of different members of a family, etc.

Continuous Data

OMega TechEd 21
Compiled By: Asst.Prof. MEGHA SHARMA

Continuous data is the type of quantitative data that represent the data in a continuous
range. The variable in the data set can have any value between the range of the data
set. Examples of the continuous data types are,
• Temperature Range
• Salary range of Workers in a Factory, etc.

Difference between Quantitative and Qualitative Data


Quantitative and Qualitative data has huge differences and the basic differences
between them are studied in the table added below,

Quantitative data Qualitative data

Data is not depicted in numerical


Data is depicted in numerical terms.
terms.

Can be shown in numbers and variables Could be about the behavioral


like ratio, percentage, and more. attributes of a person, or thing.

Examples: loud behavior, fair skin,


Example: 100%, 1:3, 123
soft quality, and more.

Difference between Discrete and Continuous Data


Discrete data and continuous data both come under Quantitative data and the
differences between them is studied in the table added below,

OMega TechEd 22
Compiled By: Asst.Prof. MEGHA SHARMA

Discrete Data Continuous Data

The type of data that has clear spaces This information falls into a continuous
between values is discrete data. series.

Discrete Data is Countable Continuous Data is Measurable

There are distinct or different values Every value within a range is included in
in discrete data. continuous data.

Discrete Data is depicted using bar Continuous Data is depicted using


graphs histograms

Ungrouped frequency distribution of Grouped distribution of continuous data


discrete data is performed against a tabulation frequencies is performed
single value. against a value group.

Data Sources:

OMega TechEd 23
Compiled By: Asst.Prof. MEGHA SHARMA

A data source is the location where data that is being used originates from. A data
source may be the initial location where data is born or where physical information
is first digitized, however even the most refined data may serve as a source, as long
as another process accesses and utilizes it.
Databases
A database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by a
database management system (DBMS).
Types:
Relational Database
NoSQL Database

Files:
Data stored in files, which can be in various formats such as text files, CSV, Excel
Spreadsheets, and more.

APIs (Application Programming Interface)


API stands for Application Programming Interface. In the context of APIs, the word
Application refers to any software with a distinct function. Interface can be thought
of as a contract of service between two applications. This contract defines how the
two communicate with each other using requests and responses.
Types:
Web APIs: Allow access to data over HTTP (eg. RESTful APIs) and usually return
data in JSON or XML format.
Library APIs: APIs provided by programming libraries to access specific functions
and data.

Web Scraping
Web scraping is the process of using bots to extract content and data from a website.
Unlike screen scraping, which only copies pixels displayed on screen, web scraping
extracts underlying HTML code, and, with it, data stored in a database. The scraper
can then replicate entire website content elsewhere.
Usage: Extracting news articles, product information, reviews, and more from
websites.

OMega TechEd 24
Compiled By: Asst.Prof. MEGHA SHARMA

Sensors

A sensor is a device that detects and responds to some type of input from the physical
environment. The input can be light, heat, motion, moisture, pressure, or any number
of other environmental phenomena. Sensors collect data from the environment or
devices, providing valuable information for various applications and IOT projects.
In the context of data science sensor data is valuable for IOT applications,
environmental monitoring, health care manufacturing and more.

Social Media

Social Media platforms generate vast amounts of data daily including text messages,
videos, and user engagement metrics.
Usage: Analyzing trends, sentiments, user behavior, and engagement patterns.

__________________________________________________________________

Chapter Ends…

OMega TechEd 25
Compiled By: Asst.Prof. MEGHA SHARMA

Chapter-3
Data Preprocessing
Data Preprocessing: Data cleaning: handling missing values, outliers,
duplicates, Data transformation: scaling, normalization, encoding categorical.
variables, Feature selection: selecting relevant features/columns, Data.
merging: combining multiple datasets.

Data cleaning: Data cleaning is one of the important parts of machine learning. It
plays a significant part in building a model.
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML
model. Professional data scientists usually invest a very large portion of their time
in this step because of the belief that “Better data beats fancier algorithms”.
Data cleaning is essential because raw data is often noisy, incomplete, and
inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.
Data cleaning involves the systematic identification and correction of errors,
inconsistencies, and inaccuracies within a dataset, encompassing tasks such as
handling missing values, removing duplicates, and addressing outliers. This
meticulous process is essential for enhancing the integrity of analyses, promoting
more accurate modeling, and ultimately facilitating informed decision-making based
on trustworthy and high-quality data.

Steps to Perform Data Cleanliness


Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset.
• Removal of Unwanted Observations: Identify and eliminate irrelevant
or redundant observations from the dataset. The step involves scrutinizing
data entries for duplicate records, irrelevant information, or data points that
do not contribute meaningfully to the analysis. Removing unwanted

OMega TechEd 26
Compiled By: Asst.Prof. MEGHA SHARMA

observations streamlines the dataset, reducing noise and improving the


overall quality.
• Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity
in data representation. Fixing structure errors enhances data consistency
and facilitates accurate analysis and interpretation.
• Managing Unwanted outliers: Identify and manage outliers, which are
data points significantly deviating from the norm. Depending on the
context, decide whether to remove outliers or transform them to minimize
their impact on analysis. Managing outliers is crucial for obtaining more
accurate and reliable insights from the data.
• Handling Missing Data: Devise strategies to handle missing data
effectively. This may involve imputing missing values based on statistical
methods, removing records with missing values, or employing advanced
imputation techniques. Handling missing data ensures a more complete
dataset, preventing biases and maintaining the integrity of analyses.

Handling missing values:

Identify the Missing Data Values


Most analytics projects will encounter three possible types of missing data values,
depending on whether there’s a relationship between the missing data and the other
data in the dataset:

• Missing completely at random (MCAR): In this case, there may be no


pattern as to why a column’s data is missing. For example, survey data is
missing because someone could not make it to an appointment, or an
administrator misplaces the test results he is supposed to enter the computer.
The reason for the missing values is unrelated to the data in the dataset.
• Missing at random (MAR): In this scenario, the reason the data is missing
in a column can be explained by the data in other columns. For example, a
school student who scores above the cutoff is typically given a grade. So, a
missing grade for a student can be explained by the column that has scores

OMega TechEd 27
Compiled By: Asst.Prof. MEGHA SHARMA

below the cutoff. The reason for these missing values can be described by data
in another column.
• Missing not at random (MNAR): Sometimes, the missing value is related to
the value itself. For example, higher income people may not disclose their
incomes. Here, there is a correlation between the missing values and the actual
income. The missing values are not dependent on other variables in the
dataset.

Handling Missing Data Values

The first common strategy for dealing with missing data is to delete the rows with
missing values. Typically, any row which has a missing value in any cell gets
deleted. However, this often means many rows will get removed, leading to loss of
information and data. Therefore, this method is typically not used when there are
few data samples.
We can also impute the missing data. This can be based solely on information in the
column that has missing values, or it can be based on other columns present in the
dataset.
Finally, we can use classification or regression models to predict missing values.

1. Missing Values in Numerical Columns


The first approach is to replace the missing value with one of the following
strategies:

• Replace it with a constant value. This can be a good approach when used in
discussion with the domain expert for the data we are dealing with.
• Replace it with the mean or median. This is a decent approach when the data
size is small—but it does add bias.
• Replace it with values by using information from other columns.

2. Predicting Missing Values Using an Algorithm


Another way to predict missing values is to create a simple regression model. The
column to predict here is the Salary, using other columns in the dataset. If there are
missing values in the input columns, we must handle those conditions when creating

OMega TechEd 28
Compiled By: Asst.Prof. MEGHA SHARMA

the predictive model. A simple way to manage this is to choose only the features that
do not have missing values or take the rows that do not have missing values in any
of the cells.

3. Missing Values in Categorical Columns


Dealing with missing data values in categorical columns is a lot easier than in
numerical columns. Simply replace the missing value with a constant value or the
most popular category. This is a good approach when the data size is small, though
it does add bias.
For example, we have a column for Education with two possible values: High School
and College. If there are more people with a college degree in the dataset, we can
replace the missing value with College Degree:

2. Handling Duplicates:
The simplest and most straightforward way to handle duplicate data is to delete it.
This can reduce the noise and redundancy in our data, as well as improve the
efficiency and accuracy of your models. However, we need to be careful and make
sure that you are not losing any valuable or relevant information by removing
duplicate data. We also need to consider the criteria and logic for choosing which
duplicates to keep or discard. For example, we can use the df.drop_duplicates()
method in pandas to remove duplicate rows or columns, specifying the subset, keep,
and inplace arguments.
Removing duplicates:
In python using Pandas: df.drop_duplicates()
In SQL: Use DISTINCT keyword in SELECT statement.

3. Outliers Detection and Treatment


Outlier detection is the process of detecting outliers, or a data point that is far away
from the average, and depending on what we are trying to accomplish.
Detecting and appropriately dealing with outliers is essential in data science to
ensure that statistical analysis and machine learning models are not unduly
influenced, and the results are accurate and reliable.
Techniques used for outlier detection.

OMega TechEd 29
Compiled By: Asst.Prof. MEGHA SHARMA

A data scientist can use several techniques to identify outliers and decide if they are
errors or novelties.
Numeric outlier
This is the simplest nonparametric technique, where data is in a one-dimensional
space. Outliers are calculated by dividing them into three quartiles. The range limits
are then set as upper and lower whiskers of a box plot. Then, the data that is outside
those ranges can be removed.
Z-score
This parametric technique indicates how many standard deviations a certain point of
data is from the sample’s mean. This assumes a gaussian distribution (a normal, bell-
shaped curve). However, if the data is not normally distributed, data can be
transformed by scaling it, and giving it a more normal appearance. The z-score of
data points is then calculated, placed on the bell curve, and then using heuristics (rule
of thumb) a cut-off point for thresholds of standard deviation can be decided. Then,
the data points that lie beyond that standard deviation can be classified as outliers
and removed from the equation. The Z-score is a simple, powerful way to remove
outliers, but it is only useful with medium to small data sets. It can’t be used for
nonparametric data.

DBSCAN
This is Density Based Spatial Clustering of Applications with Noise, which is
basically a graphical representation showing density of data. Using complex
calculations, it clusters data together in groups of related points. DBSCAN groups
data into core points, border points, and outliers. Core points are main data groups,
border points have enough density to be considered part of the data group, and
outliers are in no cluster at all, and can be disregarded from data.
Isolation forest
This method is effective for finding novelties and outliers. It uses binary decision
trees which are constructed using randomly selected features and a random split
value. The forest trees then form a tree forest, which is averaged out. Then, outlier
scores can be calculated, giving each node, or data point, a score from 0 to 1, 0 being
normal and 1 being more of an outlier.

OMega TechEd 30
Compiled By: Asst.Prof. MEGHA SHARMA

Visualization for Outlier Detection

We can use the box plot, or the box and whisker plot, to explore the dataset and
visualize the presence of outliers. The points that lie beyond the whiskers are
detected as outliers.

Handling Outliers
a. Removing Outliers
i) Listwise detection: Remove rows with outliers.
ii) Trimming: Remove extreme values while keeping a certain percentage (1%
or 5%) of data.
b. Transforming Outliers
i) Winsorization: Cap or replace outliers with values at a specified percentile.
ii) Log Transformation: Apply a log transformation to reduce the impact of
extreme values.
c. Imputation
Impute outliers with a value derived from statistical measures(mean,median)
or more advanced imputation methods.
d. Treating as Anomaly: Treat outliers as anomalies and analyze them
separately. This is common in fraud detection or network security.

Data Transformation:

Data transformation is the process of converting data from one format or


structure into another format or structure. It is a fundamental aspect of most
data integration and data management tasks such as data wrangling, data
warehousing, data integration and application integration.
Data transformation is part of an ETL process and refers to preparing data for
analysis and modeling. This involves cleaning (removing duplicates, fill-in
missing values), reshaping (converting currencies, pivot tables), and
computing new dimensions and metrics.

Data transformation techniques include scaling, normalization, and encoding


categorical variables.

OMega TechEd 31
Compiled By: Asst.Prof. MEGHA SHARMA

1. Scaling: Scaling is the process of transforming the features of a dataset


so that they fall within a specific range. Scaling is useful when we want
to compare two different variables on equal grounds. This is especially
useful with variables which use distance measures. For example,
models that use Euclidean Distance are sensitive to the magnitude of
distance, so scaling helps even with the weight of all the features. This
is important because if one variable is more heavily weighted than the
other, it introduces bias into our analysis.

Min-Max Scaling:
The objective of Min-Max scaling is to shift the values closer to the mean of the
column. This method scales the data to a fixed range, usually [0, 1] or [-1, 1]. A
drawback of bounding this data to a small, fixed range is that we will, in turn, end
up with smaller standard deviations, which suppresses the weight of outliers in our
data.

Standardization (Z-Score Normalization):


Standardization is used to compare features that have different units or scales. This
is done by subtracting a measure of location (x- x̅) and dividing by a measure of
scale (σ).

OMega TechEd 32
Compiled By: Asst.Prof. MEGHA SHARMA

This transforms your data, so the resulting distribution has a mean of 0 and a standard
deviation of 1. This method is useful (in comparison to normalization) when we have
important outliers in our data, and we don’t want to remove them and lose their
impact.

2. Normalization

Data normalization is a technique used in data mining to transform the values of a


dataset into a common scale. This is important because many machine learning
algorithms are sensitive to the scale of the input features and can produce better
results when the data is normalized.

There are several different normalization techniques that can be used in data mining,
including:

1. Min-Max normalization: This technique scales the values of a feature to


a range between 0 and 1. This is done by subtracting the minimum value
of the feature from each value, and then dividing it by the range of the
feature.
2. Z-score normalization: This technique scales the values of a feature to
have a mean of 0 and a standard deviation of 1. This is done by subtracting
the mean of the feature from each value, and then dividing it by the
standard deviation.
3. Decimal Scaling: This technique scales the values of a feature by dividing
the values of a feature by a power of 10.
4. Logarithmic transformation: This technique applies a logarithmic
transformation to the values of a feature. This can be useful for data with
a wide range of values, as it can help to reduce the impact of outliers.

OMega TechEd 33
Compiled By: Asst.Prof. MEGHA SHARMA

5. Root transformation: This technique applies a square root transformation


to the values of a feature. This can be useful for data with a wide range of
values, as it can help to reduce the impact of outliers.
6. It’s important to note that normalization should be applied only to the input
features, not the target variable, and that different normalization techniques
may work better for different types of data and models.

Note: The main difference between normalizing and scaling is that in normalization
you are changing the shape of the distribution and in scaling you are changing the
range of your data. Normalizing is a useful method when you know the distribution
is not Gaussian. Normalization adjusts the values of your numeric data to a common
scale without changing the range whereas scaling shrinks or stretches the data to fit
within a specific range.

3. Encoding Categorical Variables

The process of encoding categorical data into numerical data is called


“categorical encoding.” It involves transforming categorical variables into a
numerical format suitable for machine learning models.

1. Label Encoding: Label Encoding is a technique that is used to convert


categorical columns into numerical ones so that they can be fitted by
machine learning models which only take numerical data. It is an
important preprocessing step in a machine-learning project.

Example Of Label Encoding


Suppose we have a column Height in some dataset that has elements as Tall,
Medium, and short. To convert this categorical column into a numerical column we
will apply label encoding to this column. After applying label encoding, the Height
column is converted into a numerical column having elements 0,1, and 2 where 0 is
the label for tall, 1 is the label for medium, and 2 is the label for short height.

OMega TechEd 34
Compiled By: Asst.Prof. MEGHA SHARMA

Height Height

Tall 0

Medium 1

Short 2

2. One-Hot Encoding: One hot encoding is a technique that we use to represent


categorical variables as numerical values in a machine learning model. One-
hot encoding is used when there is no ordinal relationship among the
categories, and each category is treated as a separate independent feature. It
creates a binary column for each category, where a “1” indicates the presence
of the category and “0” its absence. This method is suitable for nominal
variables.

Advantages:
• It allows the use of categorical variables in models that require numerical
input.
• It can improve model performance by providing more information to the
model about the categorical variable.

OMega TechEd 35
Compiled By: Asst.Prof. MEGHA SHARMA

▪ It can help to avoid the problem of ordinality, which can occur when a
categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).

One Hot Encoding Examples


In One Hot Encoding, the categorical parameters will prepare separate columns for
both Male and Female labels. So, wherever there is a Male, the value will be 1 in the
Male column and 0 in the Female column, and vice-versa. Let’s understand with an
example: Consider the data where fruits, their corresponding categorical values, and
prices are given.

Fruit Categorical value of fruit Price

apple 1 5

mango 2 10

apple 1 15

orange 3 20

The output after applying one-hot encoding on the data is given as follows,

OMega TechEd 36
Compiled By: Asst.Prof. MEGHA SHARMA

apple mango orange price

1 0 0 5

0 1 0 10

1 0 0 15

0 0 1 20

The disadvantages of using one hot encoding include:


1. It can lead to increased dimensionality, as a separate column is created for
each category in the variable. This can make the model more complex and
slower to train.

2. It can lead to sparse data, as most observations will have a value of 0 in


most of the one-hot encoded columns.

3. It can lead to overfitting, especially if there are many categories in the


variable and the sample size is relatively small.

OMega TechEd 37
Compiled By: Asst.Prof. MEGHA SHARMA

4. One-hot-encoding is a powerful technique to treat categorical data, but it


can lead to increased dimensionality, sparsity, and overfitting. It is
important to use it cautiously and consider other methods such as ordinal
encoding or binary encoding.

3. Binary Encoding:
Binary encoding combines elements of label encoding and one-hot encoding. It first
assigns unique integer labels to each category and then represents these labels in
binary form. It’s especially useful when we have many categories, reducing the
dimensionality compared to one hot encoding.

4. Frequency Encoding (Count Encoding)


Frequency encoding replaces each category with the count of how often it appears
in the dataset. This can be useful when we suspect that the frequency of a category
is related to the target variable.

5. Target Encoding (Mean Encoding)


Target encoding is used when we want to encode categorical variables based on their
relationship with the target variable. It replaces each category with the mean of the
target variable for that category.

Feature Selection

A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection.
Each machine learning process depends on feature engineering, which mainly
contains two processes, which are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective,
both are completely different from each other. The main difference between them is
that feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only
relevant data to reduce overfitting in the model.

OMega TechEd 38
Compiled By: Asst.Prof. MEGHA SHARMA

We can define feature Selection as, "It is a process of automatically or manually


selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique


Supervised Feature selection techniques consider the target variable and can
be used for the labeled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can
be used for the unlabeled dataset.

There are mainly three techniques under supervised feature Selection:

OMega TechEd 39
Compiled By: Asst.Prof. MEGHA SHARMA

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
Based on the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.

Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process, which begins


with an empty set of features. After each iteration, it keeps adding on a feature
and evaluates the performance to check whether it is improving the
performance or not. The process continues until the addition of a new
variable/feature does not improve the performance of the model.
o Backward elimination - Backward elimination is also an iterative approach,
but it is the opposite of forward selection. This technique begins the process
by considering all the features and removes the least significant feature. This
elimination process continues until removing the features does not improve
the performance of the model.

OMega TechEd 40
Compiled By: Asst.Prof. MEGHA SHARMA

o Exhaustive Feature Selection- Exhaustive feature selection is one of the best


feature selection methods, which evaluates each feature set as brute-force. It
means this method tries & make each possible combination of features and
return the best performing feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization approach,
where features are selected by recursively taking a smaller and smaller subset
of features. Now, an estimator is trained with each set of features, and the
importance of each feature is determined using coef_attribute or through a
feature_importances_attribute.

2. Filter Methods
In the Filter Method, features are selected based on statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant features and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and
does not overfit the data.

OMega TechEd 41
Compiled By: Asst.Prof. MEGHA SHARMA

Some common techniques of Filter methods are as follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio

Information Gain: Information gain determines the reduction in entropy while


transforming the dataset. It can be used as a feature selection technique by
calculating the information gain of each variable with respect to the target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship between
the categorical variables. The chi-square value is calculated between each feature
and the target variable, and the desired number of features with the best chi-square
value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised techniques of feature selection. It
returns the rank of the variable on the fisher's criteria in descending order. Then we
can select the variables with a large fisher's score.

OMega TechEd 42
Compiled By: Asst.Prof. MEGHA SHARMA

Missing Value Ratio:


The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number
of missing values in each column divided by the total number of observations. The
variable having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are
fast processing methods like the filter method but more accurate than the filter
method.

These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:

o Regularization- Regularization adds a penalty term to different parameters


of the machine learning model for avoiding overfitting in the model. This
penalty term is added to the coefficients; hence it shrinks some coefficients to
zero. Those features with zero coefficients can be removed from the dataset.
The types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).

OMega TechEd 43
Compiled By: Asst.Prof. MEGHA SHARMA

o Random Forest Importance - Different tree-based methods of feature


selection help us with feature importance to provide a way of selecting
features. Here, feature importance specifies which feature has more
importance in model building or has a great impact on the target variable.
Random Forest is such a tree-based method, which is a type of bagging
algorithm that aggregates a different number of decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all the trees. Nodes are arranged as per the impurity values,
and thus it allows pruning of trees below a specific node. The remaining nodes
create a subset of the most important features.

Data Merging: Combining Multiple Datasets


The most common method for merging data is through a process called “joining”.
There are several types of joins.

• Inner Join: Uses a comparison operator to match rows from two tables that
are based on the values in common columns from each table.
• Left join/left outer join.
Returns all the rows from the left table that are specified in the left outer join
clause, not just the rows in which the columns match.
• Right join/right outer join
Returns all the rows from the right table that are specified in the right outer
join clause, not just the rows in which the columns match.
• Full outer join
Returns all the rows in both the left and right tables.
• Cross joins (cartesian join)
Returns all possible combinations of rows from two tables.
____________________________________________________________

Chapter Ends…

OMega TechEd 44
Compiled By: Asst.Prof. MEGHA SHARMA

Chapter-4
Data Wrangling and Feature Engineering
Data Wrangling and Feature Engineering: Data wrangling techniques:
reshaping, pivoting, aggregating, Feature engineering: creating new features,
handling time-series data Dummification: converting categorical variables.
into binary indicators, Feature scaling: standardization, normalization

Data Wrangling and Feature Engineering

Data Wrangling:
A data wrangling process, also known as a data munging process, consists of
reorganizing, transforming, and mapping data from one "raw" form into
another to make it more usable and valuable for a variety of downstream uses
including analytics.
Data wrangling can be defined as the process of cleaning, organizing, and
transforming raw data into the desired format for analysts to use for prompt
decision-making. Also known as data cleaning or data munging, data
wrangling enables businesses to tackle more complex data in less time,
produce more accurate results, and make better decisions.

Data Wrangling Tools


• Spreadsheets/ ExcelPower Query
• OpenRefine
• Tabula
• Google Dataprep
• Datawrangler

Reshaping Data:
Reshaping data involves changing the structure of the dataset. The shape of a data
set refers to the way in which a data set is arranged into rows and columns, and
reshaping data is the rearrangement of the data without altering the content of the
data set. Reshaping data sets is a very frequent and cumbersome task in the process
of data manipulation and analysis.

OMega TechEd 45
Compiled By: Asst.Prof. MEGHA SHARMA

Common reshaping techniques include:

• Merging (Joining): Combining multiple datasets by a common key or


identifier. This is useful when we have data in different tables or sources.
• Melting (Unpivoting): Transforming a dataset from wide format (many
columns) to long format (fewer columns but more rows). This is useful when
we have data with multiple variables in separate columns.

Pivoting
Data pivoting enables us to rearrange the columns and rows in a report so we
can view data from different perspectives. Common pivoting techniques
include:
• Pivot Tables: A PivotTable is an interactive way to quickly summarize
large amounts of data. You can use a PivotTable to analyze numerical
data in detail and answer unanticipated questions about your data.
PivotTable is especially designed for: Querying large amounts of data
in many user-friendly ways.
• Crosstabs (Contingency Tables): A contingency table (also known as
a cross tabulation or crosstab) is a type of table in a matrix format that
displays the multivariate frequency distribution of the variables. They
are heavily used in survey research, business intelligence, engineering,
and scientific research.
• Transpose: This simple operation flips rows and columns, making the
data easier to work with in some cases.

Data Aggregation
Data aggregation is the process of compiling typically [large] amounts of
information from a given database and organizing it into a more consumable
and comprehensive medium. A common statistical data aggregation is reducing
a distribution of values to a mean and standard deviation. Another example of
data reduction is frequency tables.
A histogram is an example of aggregation for exploration. Histograms count
(aggregate) the number of observations that fall into bins. While some data is
lost in this aggregation, it also provides a very useful visualization of the
distribution of a set of values.

OMega TechEd 46
Compiled By: Asst.Prof. MEGHA SHARMA

Feature Engineering: Creating New features handling Time-Series data.

Feature engineering involves a set of techniques that enable us to create new


features by combining or transforming the existing ones. These techniques
help to highlight the most important patterns and relationships in the data,
which in turn helps the machine learning model to learn from the data more
effectively.

Feature engineering is the pre-processing step of machine learning, which is


used to transform raw data into features that can be used for creating a
predictive model using Machine learning or statistical Modeling. Feature
engineering in machine learning aims to improve the performance of models.
Feature engineering in ML contains mainly four processes: Feature
Creation, Transformations, Feature Extraction, and Feature Selection.

These processes are described as below:

1. Feature Creation: Feature creation is finding the most useful variables to be


used in a predictive model. The process is subjective, and it requires human
creativity and intervention. The new features are created by mixing existing
features using addition, subtraction, and ration, and these new features have
great flexibility.
2. Transformations: The transformation step of feature engineering involves
adjusting the predictor variable to improve the accuracy and performance of
the model. For example, it ensures that the model is flexible to take input of

OMega TechEd 47
Compiled By: Asst.Prof. MEGHA SHARMA

the variety of data; it ensures that all the variables are on the same scale,
making the model easier to understand. It improves the model's accuracy and
ensures that all the features are within the acceptable range to avoid any
computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering
process that generates new variables by extracting them from the raw data.
The main aim of this step is to reduce the volume of data so that it can be
easily used and managed for data modeling. Feature extraction methods
include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features
are either redundant or irrelevant. If we input the dataset with all these
redundant and irrelevant features, it may negatively impact and reduce the
overall performance and accuracy of the model. Hence it is very important to
identify and select the most appropriate features from the data and remove the
irrelevant or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of selecting the
subset of the most relevant features from the original features set by removing
the redundant, irrelevant, or noisy features."

Feature Engineering Techniques


Some of the popular feature engineering techniques include:

1. Imputation

Feature engineering deals with inappropriate data, missing values, human


interruption, general errors, insufficient data sources, etc. Missing values within the
dataset highly affect the performance of the algorithm, and to deal with them an
"Imputation" technique is used. Imputation is responsible for handling irregularities
within the dataset.
For example, removing the missing values from the complete row or complete
column by a huge percentage of missing values. But at the same time, to maintain
the data size, it is required to impute the missing data, which can be done as:

OMega TechEd 48
Compiled By: Asst.Prof. MEGHA SHARMA

o For numerical data imputation, a default value can be imputed in a column,


and missing values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.

2. Handling Outliers

Outliers are the deviated values or data points that are observed too away from other
data points in such a way that they badly affect the performance of the model.
Outliers can be handled with this feature engineering technique. This technique first
identifies the outliers and then removes them.
Standard deviation can be used to identify the outliers. For example, each value
within a space has a definite to an average distance, but if a value is greater than a
certain value, it can be considered as an outlier. Z-score can also be used to detect
outliers.

3. Log transform
Logarithm transformation or log transform is one of the commonly used
mathematical techniques in machine learning. Log transform helps in handling the
skewed data, and it makes the distribution more approximate to normal after
transformation. It also reduces the effects of outliers on the data, as because of the
normalization of magnitude differences, a model becomes much more robust.

4. Binning
In machine learning, overfitting is one of the main issues that degrades the
performance of the model, and which occurs due to a greater number of parameters
and noisy data. However, one of the popular techniques of feature engineering,
"binning", can be used to normalize the noisy data. This process involves segmenting
different features into bins.

5. Feature Split
As the name suggests, feature split is the process of splitting features intimately into
two or more parts and performing to make new features. This technique helps the
algorithms to better understand and learn the patterns in the dataset.

OMega TechEd 49
Compiled By: Asst.Prof. MEGHA SHARMA

The feature splitting process enables the new features to be clustered and binned,
which results in extracting useful information and improving the performance of the
data models.

6. One hot encoding


One hot encoding is the popular encoding technique in machine learning. It is a
technique that converts categorical data in a form so that they can be easily
understood by machine learning algorithms and hence can make a good prediction.
It enables grouping of categorical data without losing any information.
A typical approach to feature engineering in time series forecasting involves the
following types of features.

• Lagged variables: A lag variable is a variable based on the past values of the
time series. By incorporating previous time series values as features, patterns
such as seasonality and trends can be captured. For example, if we want to
predict today's sales, using lagged variables like yesterday’s sales can provide
valuable information about the ongoing trend.
• Moving window statistics: Moving statistics can also be called moving
window statistics, rolling statistics, or running statistics. A predefined window
around each dimension value is used to calculate various statistics before
moving to the next.
• Time-based features: such as the day of the week, the month of the year,
holiday indicators, seasonal it, and other time related patterns can be valuable
for prediction. For instance, if certain products tend to have higher average
sales on weekends, incorporating the day of the week as a feature can improve
the accuracy of the forecasting model.

Dummification: Converting categorical variables into binary indicators.


The word “dummy” means the act of replication. In the field of data science,
it holds the same meaning. The whole art of dummifying variables in data
science is the process of “transforming the variables into a numerical
representation”.
Example: One-Hot-Encoding

OMega TechEd 50
Compiled By: Asst.Prof. MEGHA SHARMA

Feature Scaling: Standardization, Normalization

Feature scaling is a data preprocessing technique used to transform the values of


features or variables in a dataset to a similar scale. The purpose is to ensure that all
features contribute equally to the model and to avoid the domination of features with
larger values.
Feature scaling becomes necessary when dealing with datasets containing features
that have different ranges, units of measurement, or orders of magnitude. In such
cases, the variation in feature values can lead to biased model performance or
difficulties during the learning process.
There are several common techniques for feature scaling, including standardization,
normalization, and min-max scaling. These methods adjust the feature values while
preserving their relative relationships and distributions.
By applying feature scaling, the dataset’s features can be transformed to a more
consistent scale, making it easier to build accurate and effective machine learning
models. Scaling facilitates meaningful comparisons between features, improves
model convergence, and prevents certain features from overshadowing others based
solely on their magnitude.

Normalization: Normalization, a vital aspect of Feature Scaling, is a data


preprocessing technique employed to standardize the values of features in a dataset,
bringing them to a common scale. This process enhances data analysis and modeling
accuracy by mitigating the influence of varying scales on machine learning models.
Normalization is a scaling technique in which values are shifted and rescaled so that
they end up ranging between 0 and 1. It is also known as Min-Max scaling.
Here’s the formula for normalization:

Here, Xmax and Xmin are the maximum and the minimum values of the
feature, respectively.

• When the value of X is the minimum value in the column, the numerator will
be 0, and hence X’ is 0

OMega TechEd 51
Compiled By: Asst.Prof. MEGHA SHARMA

• On the other hand, when the value of X is the maximum value in the column,
the numerator is equal to the denominator, and thus the value of X’ is 1
• If the value of X is between the minimum and the maximum value, then the
value of X’ is between 0 and 1

Standardization: Standardization is another Feature scaling method where the


values are centered around the mean with a unit standard deviation. This means that
the mean of the attribute becomes zero, and the resultant distribution has a unit
standard deviation.
Here’s the formula for standardization:

is the mean of the feature values and is the standard deviation of the feature
values. Note that, in this case, the values are not restricted to a particular range.

Normalization Standardization

Rescales values to a range between 0 Centers data around the mean and scales
and 1 to a standard deviation of 1

Useful when the distribution of the data Useful when the distribution of the data
is unknown or not Gaussian is Gaussian or unknown

Sensitive to outliers Less sensitive to outliers

Retains the shape of the original Changes the shape of the original
distribution distribution

OMega TechEd 52
Compiled By: Asst.Prof. MEGHA SHARMA

May not preserve the relationships Preserves the relationships between the
between the data points data points

Equation: (x – min)/(max – min) Equation: (x – mean)/standard deviation

__________________________________________________________________

Chapter Ends…

OMega TechEd 53
Compiled By: Asst.Prof. MEGHA SHARMA

Chapter-5
Tools and Libraries
Tools and Libraries: Introduction to popular libraries and technologies used,
in Data Science like Pandas, NumPy, Sci-kit Learn, etc.

1. TensorFlow: TensorFlow is a free and open-source software library for


machine learning and artificial intelligence. It can be used across a range of
tasks but has a particular focus on training and inference of deep neural
networks. It was developed by the Google Brain team for Google's internal
use in research and production.

2. Matplotlib: Matplotlib is a multi-platform data visualization library built on


NumPy arrays and designed to work with the broader SciPy stack. It was
introduced by John Hunter in 2002. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in easily
digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram, etc.

3. Pandas: Pandas are built on top of two core Python libraries—matplotlib for
data visualization and NumPy for mathematical operations. Pandas acts as a
wrapper over these libraries, allowing you to access many of matplotlib and
NumPy's methods with less code.

4. Numpy: NumPy (Numerical Python) is an open-source Python library that's


used in almost every field of science and engineering. It's the universal
standard for working with numerical data in Python, and it's at the core of the
scientific Python and PyData ecosystems.

OMega TechEd 54
Compiled By: Asst.Prof. MEGHA SHARMA

5. Scipy: SciPy is an open-source Python library that's used in almost every field
of science and engineering optimization, stats, and signal processing. Like
NumPy, SciPy is open source so we can use it freely. SciPy was created by
NumPy's creator Travis Olliphant.
6. Scrapy: Scrapy is a comprehensive open-source framework and is among the
most powerful libraries used for web data extraction. Scrapy natively
integrates functions for extracting data from HTML or XML sources using
CSS and XPath expressions.

7. Scikit-learn: Scikit-Learn, also known as sklearn is a python library to


implement machine learning models and statistical modelling. Through scikit-
learn, we can implement various machine learning models for regression,
classification, clustering, and statistical tools for analyzing these models.

8. PyGame: Pygame is a cross-platform set of Python modules designed for


writing video games. It includes computer graphics and sound libraries
designed to be used with the Python programming language.

__________________________________________________________

Chapter Ends…

OMega TechEd 55
Compiled By: Asst.Prof. MEGHA SHARMA

Chapter 6

Exploratory data Analysis (EDA)

Exploratory Data Analysis (EDA): Data visualization techniques:


histograms, scatter plots, box plots, etc., Descriptive statistics: mean, median,
mode, standard deviation, etc., Hypothesis testing: t-tests, chi-square tests,
ANOVA, etc.

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data
visualization methods.
Exploratory Data Analysis (EDA) refers to the method of studying and exploring
record sets to apprehend their predominant traits, discover patterns, locate outliers,
and identify relationships between variables. EDA is normally carried out as a
preliminary step before undertaking extra formal statistical analyses or modeling.

The Foremost Goals of EDA

1. Data Cleaning: EDA involves examining the information for errors, lacking
values, and inconsistencies. It includes techniques including recording imputation,
managing missing statistics, and figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important
tendency, variability, and distribution of variables. Measures like suggest, median,
mode, preferred deviation, range, and percentiles are usually used.
3. Data Visualization: EDA employs visual techniques to represent the statistics
graphically. Visualizations consisting of histograms, box plots, scatter plots, line
plots, heatmaps, and bar charts assist in identifying styles, trends, and relationships
within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and
their adjustments to create new functions or derive meaningful insights. Feature
engineering can include scaling, normalization, binning, encoding express variables,
and creating interplay or derived variables.

OMega TechEd 56
Compiled By: Asst.Prof. MEGHA SHARMA

5. Correlation and Relationships: EDA allows discover relationships and


dependencies between variables. Techniques such as correlation analysis, scatter
plots, and pass-tabulations offer insights into the power and direction of relationships
between variables.
6. Data Segmentation: EDA can contain dividing the information into significant
segments based totally on sure standards or traits. This segmentation allows
advantage insights into unique subgroups inside the information and might cause
extra focused analysis.
7. Hypothesis Generation: EDA aids in generating hypotheses or studies questions
based totally on the preliminary exploration of the data. It facilitates forming the
inspiration for in addition evaluation and model building.
8. Data Quality Assessment: EDA permits assessing the inability and reliability of
the information. It involves checking for records integrity, consistency, and accuracy
to make certain the information is suitable for analysis.

Data Visualization Techniques

Histograms

Histograms are one of the most popular visualizations to analyze the distribution of
data. They show the numerical variable's distribution with bars. The hist function in
Matplotlib is used to create histogram.
To build a histogram, the numerical data is first divided into several ranges or bins,
and the frequency of occurrence of each range is counted. The horizontal axis shows
the range, while the vertical axis represents the frequency or percentage of
occurrences of a range.
Histograms immediately showcase how a variable's distribution is skewed or where
it peaks.

OMega TechEd 57
Compiled By: Asst.Prof. MEGHA SHARMA

Box and whisker plots.

Another great plot to summarize the distribution of a variable is boxplots. Boxplots


provide an intuitive and compelling way to spot the following elements:

• Median. The middle value of a dataset where 50% of the data is less than the
median and 50% of the data is higher than the median.
• The upper quartile. The 75th percentile of a dataset where 75% of the data
is less than the upper quartile, and 25% of the data is higher than the upper
quartile.
• The lower quartile. The 25th percentile of a dataset where 25% of the data
is less than the lower quartile and 75% is higher than the lower quartile.
• The interquartile range. The upper quartile minus the lower quartile
• The upper adjacent value. Or colloquially, the “maximum.” It represents the
upper quartile plus 1.5 times the interquartile range.
• The lower adjacent value. Or colloquially, the “minimum." It represents the
lower quartile minus 1.5 times the interquartile range.
• Outliers. Any values above the “maximum” or below the “minimum.”

OMega TechEd 58
Compiled By: Asst.Prof. MEGHA SHARMA

Scatter plots.

Scatter plots are used to visualize the relationship between two continuous variables.
Each point in the plot represents a single data point, and the position of the point on
the x and y-axis represents the values of the two variables. It is often used in data
exploration to understand the data and quickly surface potential correlations.

OMega TechEd 59
Compiled By: Asst.Prof. MEGHA SHARMA

Heat maps.
A heatmap is a common and beautiful matrix plot that can be used to graphically
summarize the relationship between two variables. The degree of correlation
between two variables is represented by a color code.

OMega TechEd 60
Compiled By: Asst.Prof. MEGHA SHARMA

Mean, Median and Mode


Mean, Median, and Mode are measures of the central tendency. These values are
used to define the various parameters of the given data set. The measure of central
tendency (Mean, Median, and Mode) gives useful insights about the data studied.
These are used to study any type of data such as the average salary of employees in
an organization, the median age of any class, the number of people who play cricket
in a sports club, etc.

Measure of central tendency is the representation of various values of the given data
set. There are various measures of central tendency and the most important three
measures of central tendency are,
• Mean (x̅ or μ)
• Median(M)
• Mode(Z)

Mean is the sum of all the values in the data set divided by the number of values in
the data set. It is also called the Arithmetic Average. Mean is denoted as x̅ and is
read as x bar.

The formula to calculate the mean is,

Mean Formula
The formula to calculate the mean is,
Mean (x̅) = Sum of Values / Number of Values
If x1, x2, x3,……, xn are the values of a data set then the mean is calculated as:
x̅ = (x1 + x2 + x3 + …… + xn) / n

Median:
A Median is a middle value for sorted data. The sorting of the data can be done either
in ascending order or descending order. A median divide the data into two equal
halves.

The formula for the median is,

OMega TechEd 61
Compiled By: Asst.Prof. MEGHA SHARMA

If the number of values (n value) in the data set is odd then the formula to calculate
the median is,
Median = [(n + 1)/2]th term
If the number of values (n value) in the data set is even then the formula to calculate
the median is:
Median = [(n/2)th term + {(n/2) + 1}th term] / 2

Mode:
A mode is the most frequent value or item of the data set. A data set can generally
have one or more than one mode value. If the data set has one mode, then it is called
“Uni-modal”. Similarly, If the data set contains 2 modes, then it is called “Bimodal”
and if the data set contains 3 modes, then it is known as “Trimodal”. If the data set
consists of more than one mode, then it is known as “multi-modal” (can be bimodal
or trimodal). There is no mode for a data set if every number appears only once.

Mode Formula
Mode = Highest Frequency Term

Standard Deviation
Standard Deviation is a measure which shows how much variation (such as spread,
dispersion, spread,) from the mean exists. The standard deviation indicates a
“typical” deviation from the mean. It is a popular measure of variability because it
returns to the original units of measure of the data set. Like the variance, if the data
points are close to the mean, there is a small variation whereas the data points are
highly spread out from the mean, then it has a high variance. Standard deviation
calculates the extent to which the values differ from the average. Standard Deviation,
the most widely used measure of dispersion, is based on all values. Therefore, a
change in even one value affects the value of standard deviation. It is independent
of origin but not of scale. It is also useful in certain advanced statistical problems.
Standard Deviation Formula
The population standard deviation formula is given as:

Here,

OMega TechEd 62
Compiled By: Asst.Prof. MEGHA SHARMA

σ = Population standard deviation


N = Number of observations in population
Xi = ith observation in the population
μ = Population mean

Hypothesis Testing
Hypothesis testing is used to assess the plausibility of a hypothesis by using sample
data. The test provides evidence concerning the plausibility of the hypothesis, given
the data. Statistical analysts test a hypothesis by measuring and examining a random
sample of the population being analyzed.
An analyst performs hypothesis testing on a statistical sample to present evidence of
the plausibility of the null hypothesis. Measurements and analyses are conducted on
a random sample of the population to test a theory. Analysts use a random population
sample to test two hypotheses: the null and alternative hypotheses.
The null hypothesis is typically an equality hypothesis between population
parameters; for example, a null hypothesis may claim that the population means
return equals zero. The alternate hypothesis is essentially the inverse of the null
hypothesis (e.g., the population means the return is not equal to zero). As a result,
they are mutually exclusive, and only one can be correct. One of the two possibilities,
however, will always be correct.

Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)

• Here, x̅ is the sample mean,


• μ0 is the population mean,
• σ is the standard deviation,
• n is the sample size.

Let's consider a hypothesis test for the average height of women in the United States.
Suppose our null hypothesis is that the average height is 5'4". We gather a sample of

OMega TechEd 63
Compiled By: Asst.Prof. MEGHA SHARMA

100 women and determine that their average height is 5'5". The standard deviation
of population is 2.
To calculate the z-score, we would use the following formula:
z = ( x̅ – μ0 ) / (σ /√n)
z = (5'5" - 5'4") / (2" / √100)
z = 0.5 / (0.045)
z = 11.11
We will reject the null hypothesis as the z-score of 11.11 is very large and conclude
that there is evidence to suggest that the average height of women in the US is greater
than 5'4".

Null Hypothesis and Alternate Hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
H0 is the symbol for it, and it is pronounced H-naught.
The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null hypothesis.
H1 is the symbol for it.

Types of Hypothesis Testing

Z Test
To determine whether a discovery or relationship is statistically significant,
hypothesis testing uses a z-test. It usually checks to see if the two means are the same
(the null hypothesis). Only when the population standard deviation is known and the
sample size is 30 data points or more, can a z-test be applied.

T Test
A statistical test called a t-test is employed to compare the means of two groups. To
determine whether two groups differ or if a procedure or treatment affects the
population of interest, it is frequently used in hypothesis testing.

OMega TechEd 64
Compiled By: Asst.Prof. MEGHA SHARMA

Chi-Square
The Chi-square test analyzes the differences between categorical variables from a
random sample. The test's fundamental premise is that the observed values in our
data should be compared to the predicted values that would be present if the null
hypothesis were true. In other words, the chi square test is a hypothesis testing
method that is used to check whether the variables in a population are independent
or not.

One -tailed Hypothesis testing


A one tailed hypothesis, also known as a directional hypothesis, points to what
direction the effect will appear in, for example if we were studying whether student's
attendance affects their grades, the one tailed hypothesis would be that students with
higher attendance will have significantly higher grades than students with low
attendance. This type of testing is further classified into the right tailed test and left
tailed test.
i) Right tailed test: A right tailed test (sometimes called an upper test) is where our
hypothesis statement contains a greater than (>) symbol. In other words, the
inequality points to the right. For example, we might be comparing the life of
batteries before and after a manufacturing change. If we want to know if the battery
life is greater than the original (let’s say 90 hours), our hypothesis statements might
be:
Null hypothesis: No change or less than (H0 ≤ 90).
Alternate hypothesis: Battery life has increased (H1) > 90.
ii) Left tailed hypothesis testing: A left-tailed test is applied to a data point or
parameter that is less than the reference value. This test looks only at the data that
falls in the left extreme or if the value of the actual population is less than the value
that is hypothesized. For example, we might be comparing the life of batteries before
and after a manufacturing change. If we want to know if the battery life is less than
the original (let’s say 90 hours), our hypothesis statements might be:
Null hypothesis: No change or less than (H0 >90).
Alternate hypothesis: Battery life has increased (H1) < 90.

OMega TechEd 65
Compiled By: Asst.Prof. MEGHA SHARMA

Two-tailed Hypothesis testing

A two-tailed hypothesis, also known as non-directional, will still predict that there
will be an effect, but will not say what direction it will appear in. For example, in
the same study a two-tailed hypothesis might look like, there will be a significant
difference in the grades of students with high attendance and students with low
attendance.
A two-tailed test is designed to determine whether a claim is true or not given a
population parameter. It examines both sides of a specified data range as designated
by the probability distribution involved. As such, the probability distribution should
represent the likelihood of a specified outcome based on predetermined standards.
The hypothesis can be set up as follows:
Null hypothesis: The population parameter = some value
Alternative hypothesis: the population parameter ≠
some value

Analysis of Variance (ANOVA)

The analysis of variance (ANOVA) is a test of hypothesis that is appropriate to


compare means of a continuous variable in two or more independent comparison
groups. For example, in some clinical trials there are more than two comparison
groups.
ANOVA checks the impact of one or more factors by comparing the means of
different samples.

OMega TechEd 66
Compiled By: Asst.Prof. MEGHA SHARMA

i) One-way ANOVA: One-Way ANOVA ("analysis of variance") compares the


means of two or more independent groups to determine whether there is statistical
evidence that the associated population means are significantly different.
The variables used in this test are known as:

• Dependent variable
• Independent variable (also known as the grouping variable, or factor)
o This variable divides cases into two or more mutually exclusive levels,
or groups

ii) Two-way ANOVA: A two-way ANOVA is used to estimate how the mean of a
quantitative variable changes according to the levels of two categorical variables.
Use a two-way ANOVA when we want to know how two independent variables, in
combination, affect a dependent variable.

Note: Refer A.I. notes for Machine learning


Introduction to Machine Learning: Supervised learning: classification and
regression, Unsupervised learning: clustering and dimensionality reduction, Bias-
variance tradeoff, underfitting, and overfitting Regression Analysis: Simple linear
regression, Multiple linear regression, Stepwise regression, Logistic regression for
classification Model Evaluation and Selection: Techniques for evaluating model
performance: accuracy, precision, recall, F1-score, Confusion matrix and ROC curve
analysis, Cross-validation: k-fold cross-validation, stratified cross-validation,
Hyperparameter tuning and model selection Machine Learning Algorithms:
Decision Trees and Random Forests, Support Vector Machines (SVM), Artificial
Neural Networks (ANN), Ensemble Learning: Boosting and Bagging, K-Nearest
Neighbors (K-NN), Gradient Descent for optimization

_______________________________________________________________

Chapter Ends…

OMega TechEd 67
Compiled By: Asst. Prof. MEGHA SHARMA

Unit-3
Model Evaluation, Data Visualization, and Management

Chapter-11
Model Evaluation Metrics: Accuracy, precision, recall, F1-score, Area Under the
Curve (AUC), Evaluating models for imbalanced datasets.

Model evaluation metrics.

Evaluation metrics are quantitative measures used to assess the performance and
effectiveness of a statistical or machine learning model. These metrics provide
insights into how well the model is performing and help in comparing different
models or algorithms.
When evaluating a machine learning model, it is crucial to assess its predictive
ability, generalization capability, and overall quality. Evaluation metrics provide
objective criteria to measure these aspects. The choice of evaluation metrics depends
on the specific problem domain, the type of data, and the desired outcome.
Some commonly used evaluation metrics in machine learning:

1. Accuracy
2. Precision
3. Recall
4. F1 Score
5. Area under the Receiver Operating Characteristic (ROC-AUC)
6. Confusion Matrix

1. Accuracy

The accuracy metric is one of the simplest Classification metrics to


implement, and it can be determined as the number of correct predictions to
the total number of predictions.

It can be formulated as:

OMega TechEd 68
Compiled By: Asst. Prof. MEGHA SHARMA

To implement an accuracy metric, we can compare ground truth and predicted values
in a loop, or we can also use the scikit-learn module for this. Although it is simple
to use and implement, it is suitable only for cases where an equal number of samples
belong to each class.
It is good to use the Accuracy metric when the target variable classes in data are
approximately balanced. For example, if 60% of classes in a fruit image dataset are
of Apple, 40% are Mango. In this case, if the model is asked to predict whether the
image is of Apple or Mango, it will give a prediction with 97% accuracy.
It is recommended not to use the Accuracy measure when the target variable majorly
belongs to one class. For example, suppose there is a model for a disease prediction
in which, out of 100 people, only five people have a disease, and 95 people don't
have one. In this case, if our model predicts every person with no disease (which
means a bad prediction), the Accuracy measure will be 95%, which is not correct.

2. Precision

The precision metric is used to overcome the limitation of Accuracy. The precision
determines the proportion of positive predictions that was correct. It can be
calculated as the True Positive or predictions that are true to the total positive
predictions (True Positive and False Positive).

3. Recall or Sensitivity

It is also like the Precision metric; however, it aims to calculate the proportion of
actual positives that were identified incorrectly. It can be calculated as True Positive
or predictions that are actually true to the total number of positives, either correctly
predicted as positive or incorrectly predicted as negative (true Positive and false
negative).

OMega TechEd 69
Compiled By: Asst. Prof. MEGHA SHARMA

The formula for calculating Recall is given below:

From the above definitions of Precision and Recall, we can say that recall determines
the performance of a classifier with respect to a false negative, whereas precision
gives information about the performance of a classifier with respect to a false
positive.
In simple words, if we maximize precision, it will minimize the FP errors, and if we
maximize recall, it will minimize the FN error.

4. F1-score

F-score or F1 Score is a metric to evaluate a binary classification model on the basis


of predictions that are made for the positive class. It is calculated with the help of
Precision and Recall. It is a type of single score that represents both Precision and
Recall. So, the F1 Score can be calculated as the harmonic mean of both precision
and Recall, assigning equal weight to each of them.
The formula for calculating the F1 score is given below:

As F-score makes use of both precision and recall, it should be used if both of them
are important for evaluation, but one (precision or recall) is slightly more important
to consider than the other. For example, when False negatives are comparatively
more important than false positives, or vice versa.

5. AUC-ROC

Sometimes we need to visualize the performance of the classification model on


charts; then, we can use the AUC-ROC curve. It is one of the popular and important
metrics for evaluating the performance of the classification model.

OMega TechEd 70
Compiled By: Asst. Prof. MEGHA SHARMA

Firstly, let's understand the ROC (Receiver Operating Characteristic curve) curve.
ROC represents a graph to show the performance of a classification model at
different threshold levels. The curve is plotted between two parameters, which are:

o True Positive Rate


o False Positive Rate

TPR or true Positive rate is a synonym for Recall, hence can be calculated as:

FPR or False Positive Rate can be calculated as:

To calculate value at any point in a ROC curve, we can evaluate a logistic regression
model multiple times with different classification thresholds, but this would not be
much efficient. So, for this, one efficient method is used, which is known as AUC.
AUC: Area Under the ROC curve
AUC is known for Area Under the ROC curve. As its name suggests, AUC
calculates the two-dimensional area under the entire ROC curve, as shown below
image:

AUC calculates the performance across all the thresholds and provides an aggregate
measure. The value of AUC ranges from 0 to 1. It means a model with 100% wrong

OMega TechEd 71
Compiled By: Asst. Prof. MEGHA SHARMA

prediction will have an AUC of 0.0, whereas models with 100% correct predictions
will have an AUC of 1.0.
When to Use AUC
AUC should be used to measure how well the predictions are ranked rather than their
absolute values. Moreover, it measures the quality of predictions of the model
without considering the classification threshold.
When not to use AUC
As AUC is scale-invariant, which is not always desirable, and we need calibrating
probability outputs, then AUC is not preferable.
Further, AUC is not a useful metric when there are wide disparities in the cost of
false negatives vs. false positives, and it is difficult to minimize one type of
classification error.

6 Confusion Matrix
A confusion matrix is a tabular representation of prediction outcomes of any binary
classifier, which is used to describe the performance of the classification model on
a set of test data when true values are known.
The confusion matrix is simple to implement, but the terminologies used in this
matrix might be confusing for beginners.
A typical confusion matrix for a binary classifier looks like the below image
(However, it can be extended to use for classifiers with more than two classes).

We can determine the following from the above matrix:

OMega TechEd 72
Compiled By: Asst. Prof. MEGHA SHARMA

o In the matrix, columns are for the prediction values, and rows specify the
Actual values. Here Actual and prediction give two possible classes, Yes or
No. So, if we are predicting the presence of a disease in a patient, the
Prediction column with Yes means, Patient has the disease, and for NO, the
Patient doesn't have the disease.
o In this example, the total number of predictions is 165, out of which 110 times
predicted yes, whereas 55 times predicted No.
o However, 60 cases in which patients don't have the disease, whereas 105 cases
in which patients have the disease.

In general, the table is divided into four terminologies, which are as follows:

1. True Positive (TP): In this case, the predicted outcome is true, and it is true
in reality, also.
2. True Negative (TN): in this case, the predicted outcome is false, and it is
false, also.
3. False Positive (FP): In this case, predicted outcomes are true, but they are
false in actuality.
4. False Negative (FN): In this case, predictions are false, and they are true in
actuality.

Evaluating models for imbalanced datasets.

Metrics for imbalanced data


Classification accuracy is a metric that summarizes the performance of a
classification model as the number of correct predictions divided by the total number
of predictions.
Accuracy = Correct Predictions / Total Predictions

Achieving 90 percent classification accuracy, or even 99 percent classification


accuracy, may be trivial on an imbalanced classification problem. Consider the case
of an imbalanced dataset with a 1:100 class imbalance. Blind guesses will give us a
99% accuracy score (by betting on the majority class).

The rule of thumb is accuracy never helps in an imbalanced dataset.

OMega TechEd 73
Compiled By: Asst. Prof. MEGHA SHARMA

The most common metrics to use for imbalanced datasets are:

• F1 score
• Precision
• Recall
• AUC score (AUC ROC)

It is good practice to track multiple metrics when developing a machine learning


model as each highlight’s different aspects of model performance.

Chapter ends…

OMega TechEd 74
Compiled By: Asst. Prof. MEGHA SHARMA

Chapter-12

Data Visualization and Communication: Principles of effective data visualization,


Types of visualizations: bar charts, line charts, scatter plots, etc. Visualization
tools: matplotlib, seaborn, Tableau, etc. Data storytelling: communicating insights
through visualizations.

Data visualization: Data visualization is a crucial aspect of machine learning that


enables analysts to understand and make sense of data patterns, relationships, and
trends. Through data visualization, insights and patterns in data can be easily
interpreted and communicated to a wider audience, making it a critical component
of machine learning.
Data visualization is an essential step in data preparation and analysis as it helps to
identify outliers, trends, and patterns in the data that may be missed by other forms
of analysis.
With the increasing availability of big data, it has become more important than ever
to use data visualization techniques to explore and understand the data. Machine
learning algorithms work best when they have high-quality and clean data, and data
visualization can help to identify and remove any inconsistencies or anomalies in the
data.

Principles of effective data visualization

1. Determine your audience. What questions will they need answered?


2. Choose the right kind of chart (or other visualization) to depict the type of
information you have.
3. Form follows function. Focus on how your audience needs to use the data and
let that determine the presentation style.
4. Provide the necessary context for data to be interpreted and acted upon
appropriately.
5. Keep it simple. Remove any non-essential information.
6. Choose colors carefully to draw attention while also considering accessibility
issues such as contrast.

OMega TechEd 75
Compiled By: Asst. Prof. MEGHA SHARMA

7. Seek balance in your visual elements, including texture, color, shape, and
negative space.
8. Use patterns (of chart types, colors, or other design elements) to identify
similar types of information.
9. Use proportion carefully so that differences in design size fairly represent
differences in value.
10. Be skeptical. Ask yourself questions about what data is not represented and
what insights might therefore be misinterpreted or missing.

Types of Data Visualization

1. Line Charts: In a line chart, each data point is represented by a point on the
graph, and these points are connected by a line. We may find patterns and
trends in the data across time by using line charts. Time-series data is
frequently displayed using line charts.

Advantages/Use

• A line graph is a graph that is used to display change over time as a series of
data points connected by straight line segments on two axes.
• A line graph is also called a line chart. It helps to determine the relationship
between two sets of values, with one data set always being dependent on the
other data set.
• They are helpful to demonstrate information on factors and patterns. Line
diagrams can make expectations about the consequences of information not
yet recorded.

OMega TechEd 76
Compiled By: Asst. Prof. MEGHA SHARMA

various parts of a line graph.


• Title: The title of the graph tells us what the graph is all about, i.e., what
information is depicted by the graph.
• Labels: The horizontal axis across the bottom and the vertical label along
the side tell us what kinds of data are being shown.
• Scales: The horizontal scale across the bottom and the vertical scale along
the side tell us how much or how many.
• Points: The points or dots on the graph represent the (x,y) coordinates or
ordered pairs. More than one data line can be present in a line graph. Here,
data on the horizontal axis is the independent variable, and data on the y-
axis is the dependent variable.
• Lines: Straight lines connecting the points give estimated values between
the points.

2. Scatter Plots: A quick and efficient method of displaying the relationship


between two variables is to use scatter plots. With one variable plotted on the x-axis
and the other variable drawn on the y-axis, each data point in a scatter plot is
represented by a point on the graph. We may use scatter plots to visualize data to
find patterns, clusters, and outliers.

OMega TechEd 77
Compiled By: Asst. Prof. MEGHA SHARMA

The link between variables in scatter diagrams is indicated by the direction of the
correlation on the graph. A correlation in a scatter diagram occurs when two
variables are determined to have a connection.

Positive correlation
If variables have a positive correlation, this signifies that when the independent
variable's value rises, the dependent variable's value rises as well.
As the weight of human adults increases, the risk of diabetes also increases. The
pattern of observation in this example would slant from the chart's bottom left to the
upper right.

Negative correlation
In the negative correlation, when the value of one variable grows, the value of the
other variable falls. The dependent variable's value drops as the independent
variable's value rises.
Here’s an example: When summer temperatures rise, sales of winter clothing
decline. The pattern of observation in this example would slant from the top left to
the bottom right of the graph.

OMega TechEd 78
Compiled By: Asst. Prof. MEGHA SHARMA

No correlation
The "no correlation" type is used when there's no potential link between the
variables. It's also known as zero correlation. The two variables plotted aren't
connected in any way.
The area of land and air quality index, for example, have no relationship. As an area
grows, there is no effect on the air quality. These two variables have no association,
and the observations will be dispersed all over the graph.

Advantages/ Uses

1. Patterns are easy to spot in scatter diagrams.


2. A scatter diagram is easy to plot with two variables.
3. Scatter diagrams are an effective way to demonstrate non-linear patterns.
4. Scatter diagrams make it possible to determine data flow range, such as the
maximum and minimum values.
5. Plotting scatter diagrams helps with better project decisions.
6. Scatter diagrams help uncover the underlying root causes of issues.
7. They can objectively assess if a given cause and effect are connected.

Disadvantages of scatter diagrams include:

1. Reading scatter diagrams incorrectly may lead to false conclusions that one
variable caused the other, when both may have been influenced by a third.
2. A relationship in a scatter diagram may not be apparent because the data does
not cover a wide enough range.
3. Associations between more than two variables are not shown in scatter plots.
4. Scatter diagrams cannot provide the precise extent of association.
5. A scatter plot does not indicate the quantitative measure of the relationship
between the two variables.

3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar
chart, each category is represented by a bar, with the height of the bar indicating the
frequency or proportion of that category in the data. Bar graphs are useful for
comparing several categories and seeing patterns over time.

OMega TechEd 79
Compiled By: Asst. Prof. MEGHA SHARMA

Advantages
• show each data category in a frequency distribution.
• display relative numbers or proportions of multiple categories.
• summarize a large dataset in visual form.
• clarify trends better than do tables.
• estimate key values immediately.
• permit a visual check of the accuracy and reasonableness of calculations.
• be easily understood due to widespread use in business and the media.
Disadvantages
• Require additional explanation.
• Be easily manipulated to yield false impressions.
• Fail to reveal key assumptions, causes, effects, or patterns.

4. Box plots

Box plots are a graphical representation of the distribution of a set of data. In a box
plot, the median is shown by a line inside the box, while the center box depicts the
range of the data. The whiskers extend from the box to the highest and lowest values
in the data, excluding outliers. Box plots can help us to identify the spread and
skewness of the data.

OMega TechEd 80
Compiled By: Asst. Prof. MEGHA SHARMA

The box plot is suitable for comparing range and distribution for groups of numerical
data.

Advantages:
The box plot organizes large amounts of data and visualizes outlier values.
Disadvantages:
The box plot is not relevant for detailed analysis of the data as it deals with a
summary of the data distribution.

5. Histogram
The histogram is suitable for visualizing distribution of numerical data over a
continuous interval, or a certain time. The data is divided into bins, and each bar in
a histogram represents the tabulated frequency at each bin.

OMega TechEd 81
Compiled By: Asst. Prof. MEGHA SHARMA

The histogram is suitable for visualizing distribution of numerical data over a


continuous interval, or a certain period.

Advantages
The histogram organizes large amounts of data, and produces a visualization
quickly, using a single dimension.
Disadvantages
The histogram is not relevant for detailed analysis of the data as it deals with a
summary of the data distribution.

6. Heat Maps
Heat maps are a type of graphical representation that displays data in a matrix format.
The value of the data point that each matrix cell represents determines its hue.
Heatmaps are often used to visualize the correlation between variables or to identify
patterns in time-series data.

OMega TechEd 82
Compiled By: Asst. Prof. MEGHA SHARMA

Advantages of Heat maps


1. Rich Insights: Heat map analysis gives quite rich insights in terms of usability, if
used correctly can determine not only user behavior but also why they behave the
way they do.
2. Optimization direction: Heat map analysis can help figure the user experience of
prototypes or models and indicate the possible design directions that the designs
must take.
3. Easy to interpret: Heat maps are easy to interpret so they can be easily read and
fairly understood by most people.

Disadvantages of Heat maps


1. Incorrect use of heat maps: Depending on whether an eye tracking or mouse
tracking device is used or whether a scroll or confetti type heat map is used to
represent the data gathered, the findings that a researcher can draw from the heat
maps will vary.
2. Use of additional methods: Standalone heat maps are not as effective as using it
in conjunction with usage analytics and testing methodologies.
3. Data interpretation: For a researcher who doesn’t have any background
knowledge in analytics, heat map data analysis may not be as in-depth as required.

OMega TechEd 83
Compiled By: Asst. Prof. MEGHA SHARMA

7. Tree Maps: Tree maps are used to display hierarchical data in a compact format
and are useful in showing the relationship between different levels of a hierarchy.

Advantages:

The biggest advantages of tree map charts include:


1. The ability to identify patterns and discern relationships between two
categories or two elements in a hierarchical data structure. Similarly, sub-
structures or sub-elements are represented.
2. Utilization of space when rendering tens of thousands of data points, with the
ability to drill down as needed.
3. Accurately displaying multiple elements at once, including “part to whole”
ratios. This makes visualization of data easy.
4. Uses size and color keys to visualize various attributes. Categories and
subcategories can be color-coded to match the parent categories. For instance,
electronics sales in different branches would be shades of blue, while furniture
sales could be shades of yellow.

OMega TechEd 84
Compiled By: Asst. Prof. MEGHA SHARMA

Limitations:

1. A tree map chart does not accommodate data sets that vary in magnitude.
2. All values of the quantitative variable that represent the size of the rectangle must
be positive values. Negative values are not acceptable.
3. Since the data points are depicted in the form of rectangles with no other sorting
options, it follows that they take up space. In addition to the spatial constraint,
readability can be a little more difficult as it is easier to read long and linear data
plots than wide and large ones. This also makes it difficult to print the tree map.
4. Some tree maps take a lot of effort to generate, even with specialized programs.
5. Sometimes tree maps do not display hierarchical levels as sharply as other charts
used to visualize hierarchical data, such as a sunburst diagram or a tree diagram.

Visualization Tools
There are many data visualization tools available.
1. matplotlib
matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Created by John D. Hunter in 2003, matplotlib provides the
building blocks to create rich visualizations of many kinds of datasets. All kinds of
visualizations, such as line plots, histograms, bar plots, and scatter plots, can be
easily created with matplotlib in a few lines of code.
We can customize every aspect of a plot you can think of with matplotlib. This
makes the tool extremely flexible, but also it can be challenging and time-consuming
to get the perfect plot.
Key features:

• It is the standard Python library for creating data visualizations.


• Export visualizations in multiple file formats, such as .pdf, .png, and .svg.
• Data professionals can also use matplotlib’s APIs to embed plots in Graphical
User Interface (GUI) applications.

Pros:

• High versatility.

OMega TechEd 85
Compiled By: Asst. Prof. MEGHA SHARMA

• Allow full customization of plots.


• Universal Python data visualization tool, backed by a huge community.

Cons:

• Cumbersome documentation, with a steep learning curve.


• Users need to know Python to use it.

2. Seaborn
Any kind of visualization is possible with matplotlib. However, sometimes the wide
flexibility of matplotlib can become difficult to master. You may spend hours in a
plot whose design seemed straightforward at the outset. Seaborn was designed to
address these pitfalls.
It’s a Python library that allows us to generate elegant graphs easily. Seaborn is based
on matplotlib and provides a high-level interface for drawing attractive and
informative statistical graphics.
Key features:

• Powerful high-level interface to build plots in a few lines of code.


• Focus on statistical data visualization.
• Build on top of matplotlib.

Pros:

• Quickly creates simple visualizations.


• Visualizations are by default aesthetically appealing.
• Large collection of powerful graphs.
• Well-defined documentation, with numerous examples.

Cons:

• Customization options are limited.


• Doesn’t provide interactive graphs.
• Users may need to use matplotlib to optimize visualizations.

OMega TechEd 86
Compiled By: Asst. Prof. MEGHA SHARMA

3. Tableau
Tableau is a powerful and popular data visualization tool that allows you to analyze
data from multiple sources simultaneously. Founded in 2003 at Stanford University,
in 2019, Salesforce acquired the platform.
Tableau is used by top companies to extract insights from tons of raw data. Thanks
to its intuitive and powerful platform, you can do anything with Tableau. However,
if you are just interested in building simple charts, you should go for less robust and
more affordable options.
Key features:

• Best-in-class Business Intelligence platform.


• Conceived for data-driven organizations.
• Drag-and-drop interface makes it easy to use.

Pros:

• Includes a wide range of chart templates.


• Can handle large volumes of data.
• We can import data from a wide range of sources.
• Quickly creates interactive visualizations and dashboards.

Cons:

• Steep learning curve


• Especially for big organizations, Tableau is a relatively expensive product.
• Limited data preprocessing features.

4. Power BI
Power BI is a cloud-based business analytics solution that allows you to bring
together different data sources, analyze them, and present data analysis through
visualizations, reports, and dashboards.
Microsoft’s PowerBI is the leader in BI solutions in the industry. Power BI makes it
easy to access data on almost any device inside and outside the organization.

OMega TechEd 87
Compiled By: Asst. Prof. MEGHA SHARMA

Key features:

• Best-in-class Business Intelligence platform.


• Fully customizable dashboards.
• Real-time views.
• Conceived to be used simultaneously across users and departments within a
company.

Pros:

• Includes many preset charts and report templates.


• In recent years, it has included machine learning capabilities.
• Available on desktop and mobile devices.
• More affordable option than competitors.

Cons:

• Limited sharing of data.


• Bulky user interface.

6. ggplot2

Arguably R’s most powerful package, ggplot2 is a plotting package that provides
helpful commands to create complex plots from data in a data frame. Since its launch
by Hadley Wickham in 2007, ggplot2 has become the go-to tool for flexible and
professional plots in R. ggplot2 is inspired by the data visualization methodology
called “the grammar of Graphics,” whose idea is to independently specify the
components of the graph and then combine them.

Key features:

• Most popular library for data visualization in R.


• Based on the “grammar of graphics” philosophy.

OMega TechEd 88
Compiled By: Asst. Prof. MEGHA SHARMA

Pros:

• Simple and intuitive syntax.


• Plots are visually appealing by default.
• Provide full customization.

Cons:

• Inconsistent syntax compared to other R packages.


• ggplot2 is often computationally slower than other R packages.
• Limited flexibility to create certain visualizations.

Data Storytelling
Data storytelling is the process of using data to communicate a story or message in
a clear and effective way. It involves selecting and organizing data in a way that
helps the audience understand and remember the key points and presenting the data
in a visually appealing and engaging manner.

Key components of Data Storytelling:

1. Audience: It is important to consider who the audience is and what their needs
and interests are when selecting and presenting data. This will help ensure that
the data is relevant and engaging to the audience.

2. Data selection: Choose the data that is most relevant to the message you want
to convey and that supports your argument or point of view. This will help
make the data more meaningful and impactful.

3. Data organization: Arrange the data in a logical and easy-to-follow way,


using techniques such as visualization, narration, and analysis. This will help
the audience understand and remember the key points.

OMega TechEd 89
Compiled By: Asst. Prof. MEGHA SHARMA

4. Visualization: Use visual aids such as graphs, charts, and maps to help the
audience understand and remember the key points. Choose the most
appropriate type of visualization for the data and the message you want to
convey.

5. Narration: Use clear and concise language to explain the data and its
implications. This will help the audience understand the context and
significance of the data.

6. Analysis: Analyze the data to identify trends, patterns, and insights that can
help the audience understand the implications of the data.

7. Story structure: Use a clear and logical story structure to organize the data
and present it in a way that is easy for the audience to follow. This might
include an introduction, main body, and conclusion.

Benefits of Data Storytelling:

1. Improved understanding and retention: Data storytelling can help the


audience understand and remember complex information more easily. By
organizing the data in a logical and visually appealing way and explaining it
with clear and concise language, data storytelling can help the audience grasp
the key points and retain the information over time.
2. Greater impact: Data storytelling can make data-driven insights and findings
more impactful and persuasive. By presenting the data in an engaging and
visually appealing manner, data storytelling can help the audience see the
significance of the data and understand its implications.
3. Increased accessibility: Data storytelling can help make data-driven insights
and findings more accessible to a wider audience. By presenting the data in a
way that is easy to understand and remember, data storytelling can help people
who are not familiar with data analysis, or statistics grasp the key points and
understand the implications of the data.
4. Enhanced collaboration: Data storytelling can help facilitate collaboration
and decision-making by providing a clear and understandable way to
communicate data-driven insights and findings. By presenting the data in a

OMega TechEd 90
Compiled By: Asst. Prof. MEGHA SHARMA

visually appealing and easy-to-follow manner, data storytelling can help team
members understand and discuss the data, leading to more informed and
effective decision-making.

Data storytelling is a powerful tool for improving understanding, retention, impact,


accessibility, and collaboration when communicating data-driven insights and
findings. By using data storytelling effectively, organizations and individuals can
more effectively communicate complex information and drive action based on data-
driven insights.

_________________________________________________________________

Chapter ends…

OMega TechEd 91
Compiled By: Asst. Prof. MEGHA SHARMA

Chapter- 13
Data Management

Data Management: Introduction to data management activities, Data pipelines:


data extraction, transformation, and loading (ETL), Data governance and data
quality assurance, Data privacy and security considerations.

Data Management:
Data management is the practice of collecting, organizing, protecting, and storing an
organization's data so it can be analyzed for business decisions. As organizations
create and consume data at unprecedented rates, data management solutions become
essential for making sense of the vast quantities of data.
“Data management comprises all disciplines related to handling data as a valuable
resource, it is the practice of managing an organization’s data so it can be
analyzed for decision making.”

Data Management Activities

1. Data collection: Collecting data through surveys, sensors, logs, user


interaction or other sources.
2. Data storage: Choose appropriate data storage solutions (databases, data
warehouses, file systems) based on the volume of data.
3. Data Quality Assurance:
• Ensure the accuracy, consistency, and completeness of data through
validation and verification processes.
• Implement data cleaning and deduplication techniques to maintain high
quality data.
4. Data Security:
• Establish access controls and encryption mechanisms to protect
sensitive data.
• Regularly audit and monitor access to prevent unauthorized use.
• Develop and implement a robust data backup and recovery plan.
5. Data Organizing: Create a data catalog that documents metadata, including
data definitions, formats, and relationships.

OMega TechEd 92
Compiled By: Asst. Prof. MEGHA SHARMA

6. Data Integration: Integrate data from various sources to create a unified and
comprehensive view. Implement ETL process for data integration.
7. Data Governance: Develop and enforce policies and procedures for data
management.
8. Data Privacy and Compliance: Ensures compliance with data protection
laws and regulations.
9. Data Retrieval and Analysis:
• Develop tools and systems for querying and retrieving data efficiently.
• Perform data analysis and reporting to derive insights and support
decision-making.
10. Data Documentation: Maintain documentation to facilitate collaboration and
knowledge transfer.
11. Data Auditing and Monitoring:
• Regularly audit data to ensure compliance with quality standards and
policies.
• Implement a monitoring system to detect anomalies or unauthorized
activities.

Data Pipelines

A data pipeline is a method in which raw data is ingested from various data sources,
transformed and then ported to a data store, such as a data lake or data warehouse,
for analysis. Before data flows into a data repository, it usually undergoes some
data processing. This is inclusive of data transformations, such as filtering,
masking, and aggregations, which ensure appropriate data integration and
standardization.

Data pipeline architecture

1. Data ingestion: Data is collected from various sources—including software-as-


a-service (SaaS) platforms, internet-of-things (IoT) devices and mobile devices—
and various data structures, both structured and unstructured data. Within streaming
data, these raw data sources are typically known as producers, publishers, or senders.
While businesses can choose to extract data only when ready to process it, it’s a
better practice to land the raw data within a cloud data warehouse provider first. This

OMega TechEd 93
Compiled By: Asst. Prof. MEGHA SHARMA

way, the business can update any historical data if they need to make adjustments to
data processing jobs. During this data ingestion process, various validations and
checks can be performed to ensure the consistency and accuracy of data.
2. Data transformation: During this step, a series of jobs are executed to process
data into the format required by the destination data repository. These jobs embed
automation and governance for repetitive workstreams, such as business reporting,
ensuring that data is cleansed and transformed consistently. For example, a data
stream may come in a nested JSON format, and the data transformation stage will
aim to unroll that JSON to extract the key fields for analysis.
3. Data storage: The transformed data is then stored within a data repository, where
it can be exposed to various stakeholders. Within streaming data, this transformed
data is typically known as consumers, subscribers, or recipients.

ETL (Extract, Transform, Load)

1. ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a
format suitable for loading into a data warehouse, and then load it into the
warehouse. The process of ETL can be broken down into the following
three stages:
2. Extract: The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This step
involves reading data from the source systems and storing it in a staging
area.
3. Transform: In this stage, the extracted data is transformed into a format
that is suitable for loading into the data warehouse. This may involve
cleaning and validating the data, converting data types, combining data
from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the data
into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is
added to the warehouse. The process is important because it ensures that
the data in the data warehouse is accurate, complete, and up to date. It also

OMega TechEd 94
Compiled By: Asst. Prof. MEGHA SHARMA

helps to ensure that the data is in the format required for data mining and
reporting.

Data Privacy and Security

Data privacy focuses on issues related to collecting, storing and retaining


data, as well as data transfers within applicable regulations and laws, such as
GDPR and HIPAA.
Data privacy is essential for protecting personal information, establishing
trust, complying with regulations, maintaining ethical practices, driving
innovation, and preserving individual autonomy.

Data security is the protection of data against unauthorized access, loss or


corruption throughout the data lifecycle.
Importance of data security:
The core elements of data security include availability, confidentiality, and
integrity.

• The main purpose of data security is to protect organizational data, which


contains trade information and customer data. The data can be accessed by
cybercriminals for malicious reasons, compromising customer privacy.
• Compliance with industry and government regulations; it is critical to adhere
to regulations for the business to carry on operating legally. The regulations
exist to protect consumer privacy.
• Data security is also important because if a data breach occurs, an organization
can be exposed to litigation, fines, and reputational damage.

OMega TechEd 95
Compiled By: Asst. Prof. MEGHA SHARMA

• Due to a lack of adequate data security practices, data breaches can occur and
expose organizations to financial loss, a decrease in consumer confidence, and
brand erosion. If consumers lose trust in an organization, they will likely move
their business elsewhere and devalue the brand.
• Breaches that result in the loss of trade secrets and intellectual property can
affect an organization’s ability to innovate and remain profitable in the future.

Types of Data Security Technologies

There are various types of data security technologies in use today that protect against
various external and internal threats. Organizations should be using many of them
to secure all potential threat access points and safeguard their data. Below are some
of the techniques:

Data encryption
Data encryption uses an algorithm to scramble every data character converting
information to an unreadable format. Encryption keys from authorized users are only
needed to decrypt the data before reading the files.
Encryption technology acts as the last line of defense in the event of a breach when
confidential and sensitive data is concerned. It is crucial to ensure that the encryption
keys are stored in a secure place where access is restricted. Data encryption can also
include capabilities for security key management.
Authentication
Authentication is a process of confirming or validating user login credentials to make
sure they match the information stored in the database. User credentials include
usernames, passwords, PINS, security tokens, swipe cards, biometrics, etc.
Authentication is a frontline defense against unauthorized access to confidential and
sensitive information, making it an important process. Authentication technologies,
such as single sign-on, multi-factor authentication, and breached password detection
make it simpler to secure the authentication process while maintaining user
convenience.
Data masking
Masking whole data or specific data areas can help protect it from exposure to
unauthorized or malicious sources externally or internally. Masking can be applied
to personally identifiable information (PII), such as a phone number or email

OMega TechEd 96
Compiled By: Asst. Prof. MEGHA SHARMA

address, by obscuring parts of the PPI, e.g., the first eight digits or letters within a
database.
Proxy characters are used to mask the data characters. The data masking software
changes the data back to its original form only when the data is received by an
authorized user. Data masking allows the development of applications using actual
data.
Tokenization
Tokenization is like data encryption but differs in that it replaces data with random
characters, where encryption scrambles data with an algorithm. The “token,” which
relates to the original data, is stored away separately in a database lookup table,
where it is protected from unauthorized access.
Data erasure
Data erasure occurs when data is no longer needed or active in the system. The
erasure process uses software to delete data on a hardware storage device. The data
is permanently deleted from the system and is irretrievable.
Data resilience
Data resilience is determined by the ability of an organization to recover from
incidences of a data breach, corruption, power failure, failure of hardware systems,
or loss of data. Data centers with backup copies of data can easily get back on their
feet after a disruptive event.
Physical access controls
Unlike digital access control, which can be managed through authentication,
physical access control is managed through control of access to physical areas or
premises where data is physically stored, i.e., server rooms and data center locations.
Physical access control uses security personnel, key cards, retina scans, thumbprint
recognition, and other biometric authentication measures.

An organization can take several steps in addition to the data security technologies
above to ensure robust data security management.

1. External and internal firewalls: Using external and internal firewalls


ensures effective data protection against malware and other cyberattacks.
2. Data security policy: An organization should adopt a clear and
comprehensive data security policy, which should be known by all staff.

OMega TechEd 97
Compiled By: Asst. Prof. MEGHA SHARMA

3. Data backup: Practicing backup of all data ensures the business will continue
uninterrupted in the event of a data breach, software or hardware failure, or
any type of data loss. Backup copies of critical data should be robustly tested
to ensure adequate insurance against data loss. Furthermore, backup files
should be subjected to equal security control protocols that manage access to
core primary systems.
4. Data security risk assessment: it is prudent to carry out regular assessments
of data security systems to detect vulnerabilities and potential losses in the
event of a breach. The assessment can also detect out-of-date software and
any misconfigurations needing redress.
5. Quarantine sensitive files: Data security software should be able to
frequently categorize sensitive files and transfer them to a secure location.
6. Data file activity monitoring: Data security software should be able to
analyze data usage patterns for all users. It will enable the early identification
of any anomalies and possible risks. Users may be given access to more data
than they need for their role in the organization. The practice is called over-
permission, and data security software should be able to profile user behavior
to match permissions with their behavior.
7. Application security and patching: Relates to the practice of updating
software to the latest version promptly as patches or new updates are released.
8. Training: Employees should continually be trained on the best practices in
data security. They can include training on password use, threat detection, and
social engineering attacks. Employees who are knowledgeable about data
security can enhance the organization’s role in safeguarding data.

__________________________________________________________________

Chapter ends…

OMega TechEd 98

You might also like