Tycs Data Science Sem6
Tycs Data Science Sem6
SEM-VI
DATA SCIENCE
Chapter-1
What is Data Science? Definition and scope of Data Science, Applications
and domains of Data Science, Comparison with other fields like Business
Intelligence (BI), Artificial Intelligence (AI), Machine Learning (ML), and
Data Warehousing/Data Mining (DW-DM)
Data Science:
Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate data so
that you can find something new and meaningful.
OMega TechEd 1
Compiled By: Asst.Prof. MEGHA SHARMA
o Transport:
Transport industries are also using data science technology to create self-
driving cars. With self-driving cars, it will be easy to reduce the number of
road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science
is being used for tumor detection, drug discovery, medical image analysis,
virtual medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and
you start getting suggestions for similar products, so this is because of data
science technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued.
Most of the finance companies are looking for data scientists to avoid risk and
any type of losses with an increase in customer satisfaction.
BI stands for business intelligence, which is also used for data analysis of business
information:
Data Business intelligence deals with Data science deals with structured
Source structured data, e.g., data and unstructured data, e.g.,
warehouse. weblogs, feedback, etc.
OMega TechEd 2
Compiled By: Asst.Prof. MEGHA SHARMA
OMega TechEd 3
Compiled By: Asst.Prof. MEGHA SHARMA
It is a broad term that includes It is used in the data modeling step of data
various steps to create a model for a science as a complete process.
given problem and deploy the
model.
It can work with raw, structured, and It mostly requires structured data to work
unstructured data. on.
Data scientists spend lots of time ML engineers spend a lot of time managing
handling the data, cleansing the data, the complexities that occur during the
and understanding its patterns. implementation of algorithms and
mathematical concepts behind that.
OMega TechEd 4
Compiled By: Asst.Prof. MEGHA SHARMA
Identifying the patterns that are Automation of the process and the
Goals concealed in the data is the main granting of autonomy to the data
objective of data science. model are the main goals of artificial
intelligence.
OMega TechEd 5
Compiled By: Asst.Prof. MEGHA SHARMA
OMega TechEd 6
Compiled By: Asst.Prof. MEGHA SHARMA
Data Warehousing
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses
on providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
OMega TechEd 7
Compiled By: Asst.Prof. MEGHA SHARMA
Characteristics:
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that is not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attribute types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the most
current file is kept.
OMega TechEd 8
Compiled By: Asst.Prof. MEGHA SHARMA
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from
the source operational RDBMS. The operational updates of data do not occur in the
data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered the warehouse, and data should not change.
Goals of Data Warehousing
OMega TechEd 9
Compiled By: Asst.Prof. MEGHA SHARMA
2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. since they are denormalized. This is
This is done to reduce redundant files and done to minimize the response time
to save storage space. for analytical queries.
OMega TechEd 10
Compiled By: Asst.Prof. MEGHA SHARMA
7. The database is the place where the data 7. Data Warehouse is the place
is taken as a base and managed to get where the application data is
available fast and efficient access. handled for analysis and reporting
objectives.
Extraction
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed
to improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
OMega TechEd 11
Compiled By: Asst.Prof. MEGHA SHARMA
1. Refresh: Data Warehouse data is completely rewritten. This means that older
files are replaced. Refresh is usually used in combination with static extraction
to populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the
Data Warehouse. An update is typically carried out without deleting or
modifying pre-existing data. This method is used in combination with
incremental extraction to update data warehouses regularly.
Data Mining:
The process of extracting information to identify patterns, trends, and useful data
that would allow the business to take the data-driven decision from huge sets of data
is called Data Mining.
We can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is
collected and assembled areas such as data warehouses, efficient analysis, data
mining algorithms, helping decision making and other data requirements to
eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the
OMega TechEd 12
Compiled By: Asst.Prof. MEGHA SHARMA
OMega TechEd 13
Compiled By: Asst.Prof. MEGHA SHARMA
o There is a probability that the organizations may sell useful data of customers
to other organizations for money. As per the report, American Express has
sold credit card purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance
training to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data
mining tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.
OMega TechEd 14
Compiled By: Asst.Prof. MEGHA SHARMA
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can
incorporate statistical models, machine learning techniques, and mathematical
algorithms, such as neural networks or decision trees. Thus, data mining incorporates
analysis and prediction.
Depending on various methods and technologies from the intersection of machine
learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
Chapter Ends…
OMega TechEd 15
Compiled By: Asst.Prof. MEGHA SHARMA
Chapter-2
Data Types and Sources
Data Types and Sources: Different types of data: structured, unstructured,
semi-structured, Data sources: databases, files, APIs, web scraping, sensors,
social media
1. Structured
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is
typically a database. It concerns all data which can be stored in database
SQL in a table with rows and columns. They have relational keys and
can easily be mapped into pre-designed fields. Today, those data are
most processed in the development and simplest way to manage
information. Example: Relational-data.
2. Semi-Structured –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it easier
to analyze. With some processes, you can store them in the relation
database (it could be very hard for some kind of semi-structured data),
but Semi-structured exists to ease space. Example: XML data.
3. Unstructured data – Unstructured data is data which is not organized in
a predefined manner or does not have a predefined data model; thus, it
is not a good fit for a mainstream relational database. So, for
Unstructured data, there are alternative platforms for storing and
managing. It is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.
OMega TechEd 16
Compiled By: Asst.Prof. MEGHA SHARMA
Unstructured
Properties Structured data Semi-structured data
data
It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational character and
Description
database table binary data
Framework).
Matured
No transaction
transaction and Transaction is adapted
Transaction management
various from DBMS not
management and no
concurrency matured
concurrency
techniques
OMega TechEd 17
Compiled By: Asst.Prof. MEGHA SHARMA
It is very difficult
It’s scaling is simpler It is more
Scalability to scale DB
than structured data scalable.
schema
Definition Primary Data refers to the Secondary Data has been collected
first-hand data collected by by other teams in the past. It does
the team. It is collected based not necessarily need to be aligned
on the researcher’s needs. with the researcher’s requirements.
OMega TechEd 18
Compiled By: Asst.Prof. MEGHA SHARMA
Types of Data:
OMega TechEd 19
Compiled By: Asst.Prof. MEGHA SHARMA
In statistics, there are four main types of data: nominal, ordinal, interval, and ratio.
These types of data are used to describe the nature of the data being collected or
analyzed, and they help determine the appropriate statistical tests to use.
Nominal Data
Nominal data is a type of data that consists of categories or names that cannot be
ordered or ranked. Nominal data is often used to categorize observations into groups,
and the groups are not comparable. In other words, nominal data has no inherent
order or ranking. Examples of nominal data include gender (Male or female), race
(White, Black, Asian), religion (Hinduism, Christianity, Islam, Judaism), and blood
type (A, B, AB, O).
Nominal data can be represented using frequency tables and bar charts, which
display the number or proportion of observations in each category. For example, a
frequency table for gender might show the number of males and females in a sample
of people.
Nominal data is analyzed using non-parametric tests, which do not make any
assumptions about the underlying distribution of the data. Common non-parametric
tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These
OMega TechEd 20
Compiled By: Asst.Prof. MEGHA SHARMA
Discrete Data
Discrete data type is a type of data in statistics that only uses Discrete Value or Single
Values. These data types have values that can be easily counted as whole numbers.
The example of the discrete data types is,
• Height of Students in a class
• Marks of the students in a class test
• Weight of different members of a family, etc.
Continuous Data
OMega TechEd 21
Compiled By: Asst.Prof. MEGHA SHARMA
Continuous data is the type of quantitative data that represent the data in a continuous
range. The variable in the data set can have any value between the range of the data
set. Examples of the continuous data types are,
• Temperature Range
• Salary range of Workers in a Factory, etc.
OMega TechEd 22
Compiled By: Asst.Prof. MEGHA SHARMA
The type of data that has clear spaces This information falls into a continuous
between values is discrete data. series.
There are distinct or different values Every value within a range is included in
in discrete data. continuous data.
Data Sources:
OMega TechEd 23
Compiled By: Asst.Prof. MEGHA SHARMA
A data source is the location where data that is being used originates from. A data
source may be the initial location where data is born or where physical information
is first digitized, however even the most refined data may serve as a source, as long
as another process accesses and utilizes it.
Databases
A database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by a
database management system (DBMS).
Types:
Relational Database
NoSQL Database
Files:
Data stored in files, which can be in various formats such as text files, CSV, Excel
Spreadsheets, and more.
Web Scraping
Web scraping is the process of using bots to extract content and data from a website.
Unlike screen scraping, which only copies pixels displayed on screen, web scraping
extracts underlying HTML code, and, with it, data stored in a database. The scraper
can then replicate entire website content elsewhere.
Usage: Extracting news articles, product information, reviews, and more from
websites.
OMega TechEd 24
Compiled By: Asst.Prof. MEGHA SHARMA
Sensors
A sensor is a device that detects and responds to some type of input from the physical
environment. The input can be light, heat, motion, moisture, pressure, or any number
of other environmental phenomena. Sensors collect data from the environment or
devices, providing valuable information for various applications and IOT projects.
In the context of data science sensor data is valuable for IOT applications,
environmental monitoring, health care manufacturing and more.
Social Media
Social Media platforms generate vast amounts of data daily including text messages,
videos, and user engagement metrics.
Usage: Analyzing trends, sentiments, user behavior, and engagement patterns.
__________________________________________________________________
Chapter Ends…
OMega TechEd 25
Compiled By: Asst.Prof. MEGHA SHARMA
Chapter-3
Data Preprocessing
Data Preprocessing: Data cleaning: handling missing values, outliers,
duplicates, Data transformation: scaling, normalization, encoding categorical.
variables, Feature selection: selecting relevant features/columns, Data.
merging: combining multiple datasets.
Data cleaning: Data cleaning is one of the important parts of machine learning. It
plays a significant part in building a model.
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML
model. Professional data scientists usually invest a very large portion of their time
in this step because of the belief that “Better data beats fancier algorithms”.
Data cleaning is essential because raw data is often noisy, incomplete, and
inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.
Data cleaning involves the systematic identification and correction of errors,
inconsistencies, and inaccuracies within a dataset, encompassing tasks such as
handling missing values, removing duplicates, and addressing outliers. This
meticulous process is essential for enhancing the integrity of analyses, promoting
more accurate modeling, and ultimately facilitating informed decision-making based
on trustworthy and high-quality data.
OMega TechEd 26
Compiled By: Asst.Prof. MEGHA SHARMA
OMega TechEd 27
Compiled By: Asst.Prof. MEGHA SHARMA
below the cutoff. The reason for these missing values can be described by data
in another column.
• Missing not at random (MNAR): Sometimes, the missing value is related to
the value itself. For example, higher income people may not disclose their
incomes. Here, there is a correlation between the missing values and the actual
income. The missing values are not dependent on other variables in the
dataset.
The first common strategy for dealing with missing data is to delete the rows with
missing values. Typically, any row which has a missing value in any cell gets
deleted. However, this often means many rows will get removed, leading to loss of
information and data. Therefore, this method is typically not used when there are
few data samples.
We can also impute the missing data. This can be based solely on information in the
column that has missing values, or it can be based on other columns present in the
dataset.
Finally, we can use classification or regression models to predict missing values.
• Replace it with a constant value. This can be a good approach when used in
discussion with the domain expert for the data we are dealing with.
• Replace it with the mean or median. This is a decent approach when the data
size is small—but it does add bias.
• Replace it with values by using information from other columns.
OMega TechEd 28
Compiled By: Asst.Prof. MEGHA SHARMA
the predictive model. A simple way to manage this is to choose only the features that
do not have missing values or take the rows that do not have missing values in any
of the cells.
2. Handling Duplicates:
The simplest and most straightforward way to handle duplicate data is to delete it.
This can reduce the noise and redundancy in our data, as well as improve the
efficiency and accuracy of your models. However, we need to be careful and make
sure that you are not losing any valuable or relevant information by removing
duplicate data. We also need to consider the criteria and logic for choosing which
duplicates to keep or discard. For example, we can use the df.drop_duplicates()
method in pandas to remove duplicate rows or columns, specifying the subset, keep,
and inplace arguments.
Removing duplicates:
In python using Pandas: df.drop_duplicates()
In SQL: Use DISTINCT keyword in SELECT statement.
OMega TechEd 29
Compiled By: Asst.Prof. MEGHA SHARMA
A data scientist can use several techniques to identify outliers and decide if they are
errors or novelties.
Numeric outlier
This is the simplest nonparametric technique, where data is in a one-dimensional
space. Outliers are calculated by dividing them into three quartiles. The range limits
are then set as upper and lower whiskers of a box plot. Then, the data that is outside
those ranges can be removed.
Z-score
This parametric technique indicates how many standard deviations a certain point of
data is from the sample’s mean. This assumes a gaussian distribution (a normal, bell-
shaped curve). However, if the data is not normally distributed, data can be
transformed by scaling it, and giving it a more normal appearance. The z-score of
data points is then calculated, placed on the bell curve, and then using heuristics (rule
of thumb) a cut-off point for thresholds of standard deviation can be decided. Then,
the data points that lie beyond that standard deviation can be classified as outliers
and removed from the equation. The Z-score is a simple, powerful way to remove
outliers, but it is only useful with medium to small data sets. It can’t be used for
nonparametric data.
DBSCAN
This is Density Based Spatial Clustering of Applications with Noise, which is
basically a graphical representation showing density of data. Using complex
calculations, it clusters data together in groups of related points. DBSCAN groups
data into core points, border points, and outliers. Core points are main data groups,
border points have enough density to be considered part of the data group, and
outliers are in no cluster at all, and can be disregarded from data.
Isolation forest
This method is effective for finding novelties and outliers. It uses binary decision
trees which are constructed using randomly selected features and a random split
value. The forest trees then form a tree forest, which is averaged out. Then, outlier
scores can be calculated, giving each node, or data point, a score from 0 to 1, 0 being
normal and 1 being more of an outlier.
OMega TechEd 30
Compiled By: Asst.Prof. MEGHA SHARMA
We can use the box plot, or the box and whisker plot, to explore the dataset and
visualize the presence of outliers. The points that lie beyond the whiskers are
detected as outliers.
Handling Outliers
a. Removing Outliers
i) Listwise detection: Remove rows with outliers.
ii) Trimming: Remove extreme values while keeping a certain percentage (1%
or 5%) of data.
b. Transforming Outliers
i) Winsorization: Cap or replace outliers with values at a specified percentile.
ii) Log Transformation: Apply a log transformation to reduce the impact of
extreme values.
c. Imputation
Impute outliers with a value derived from statistical measures(mean,median)
or more advanced imputation methods.
d. Treating as Anomaly: Treat outliers as anomalies and analyze them
separately. This is common in fraud detection or network security.
Data Transformation:
OMega TechEd 31
Compiled By: Asst.Prof. MEGHA SHARMA
Min-Max Scaling:
The objective of Min-Max scaling is to shift the values closer to the mean of the
column. This method scales the data to a fixed range, usually [0, 1] or [-1, 1]. A
drawback of bounding this data to a small, fixed range is that we will, in turn, end
up with smaller standard deviations, which suppresses the weight of outliers in our
data.
OMega TechEd 32
Compiled By: Asst.Prof. MEGHA SHARMA
This transforms your data, so the resulting distribution has a mean of 0 and a standard
deviation of 1. This method is useful (in comparison to normalization) when we have
important outliers in our data, and we don’t want to remove them and lose their
impact.
2. Normalization
There are several different normalization techniques that can be used in data mining,
including:
OMega TechEd 33
Compiled By: Asst.Prof. MEGHA SHARMA
Note: The main difference between normalizing and scaling is that in normalization
you are changing the shape of the distribution and in scaling you are changing the
range of your data. Normalizing is a useful method when you know the distribution
is not Gaussian. Normalization adjusts the values of your numeric data to a common
scale without changing the range whereas scaling shrinks or stretches the data to fit
within a specific range.
OMega TechEd 34
Compiled By: Asst.Prof. MEGHA SHARMA
Height Height
Tall 0
Medium 1
Short 2
Advantages:
• It allows the use of categorical variables in models that require numerical
input.
• It can improve model performance by providing more information to the
model about the categorical variable.
OMega TechEd 35
Compiled By: Asst.Prof. MEGHA SHARMA
▪ It can help to avoid the problem of ordinality, which can occur when a
categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).
apple 1 5
mango 2 10
apple 1 15
orange 3 20
The output after applying one-hot encoding on the data is given as follows,
OMega TechEd 36
Compiled By: Asst.Prof. MEGHA SHARMA
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
OMega TechEd 37
Compiled By: Asst.Prof. MEGHA SHARMA
3. Binary Encoding:
Binary encoding combines elements of label encoding and one-hot encoding. It first
assigns unique integer labels to each category and then represents these labels in
binary form. It’s especially useful when we have many categories, reducing the
dimensionality compared to one hot encoding.
Feature Selection
A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection.
Each machine learning process depends on feature engineering, which mainly
contains two processes, which are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective,
both are completely different from each other. The main difference between them is
that feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only
relevant data to reduce overfitting in the model.
OMega TechEd 38
Compiled By: Asst.Prof. MEGHA SHARMA
OMega TechEd 39
Compiled By: Asst.Prof. MEGHA SHARMA
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
Based on the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
OMega TechEd 40
Compiled By: Asst.Prof. MEGHA SHARMA
2. Filter Methods
In the Filter Method, features are selected based on statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant features and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and
does not overfit the data.
OMega TechEd 41
Compiled By: Asst.Prof. MEGHA SHARMA
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
OMega TechEd 42
Compiled By: Asst.Prof. MEGHA SHARMA
These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:
OMega TechEd 43
Compiled By: Asst.Prof. MEGHA SHARMA
• Inner Join: Uses a comparison operator to match rows from two tables that
are based on the values in common columns from each table.
• Left join/left outer join.
Returns all the rows from the left table that are specified in the left outer join
clause, not just the rows in which the columns match.
• Right join/right outer join
Returns all the rows from the right table that are specified in the right outer
join clause, not just the rows in which the columns match.
• Full outer join
Returns all the rows in both the left and right tables.
• Cross joins (cartesian join)
Returns all possible combinations of rows from two tables.
____________________________________________________________
Chapter Ends…
OMega TechEd 44
Compiled By: Asst.Prof. MEGHA SHARMA
Chapter-4
Data Wrangling and Feature Engineering
Data Wrangling and Feature Engineering: Data wrangling techniques:
reshaping, pivoting, aggregating, Feature engineering: creating new features,
handling time-series data Dummification: converting categorical variables.
into binary indicators, Feature scaling: standardization, normalization
Data Wrangling:
A data wrangling process, also known as a data munging process, consists of
reorganizing, transforming, and mapping data from one "raw" form into
another to make it more usable and valuable for a variety of downstream uses
including analytics.
Data wrangling can be defined as the process of cleaning, organizing, and
transforming raw data into the desired format for analysts to use for prompt
decision-making. Also known as data cleaning or data munging, data
wrangling enables businesses to tackle more complex data in less time,
produce more accurate results, and make better decisions.
Reshaping Data:
Reshaping data involves changing the structure of the dataset. The shape of a data
set refers to the way in which a data set is arranged into rows and columns, and
reshaping data is the rearrangement of the data without altering the content of the
data set. Reshaping data sets is a very frequent and cumbersome task in the process
of data manipulation and analysis.
OMega TechEd 45
Compiled By: Asst.Prof. MEGHA SHARMA
Pivoting
Data pivoting enables us to rearrange the columns and rows in a report so we
can view data from different perspectives. Common pivoting techniques
include:
• Pivot Tables: A PivotTable is an interactive way to quickly summarize
large amounts of data. You can use a PivotTable to analyze numerical
data in detail and answer unanticipated questions about your data.
PivotTable is especially designed for: Querying large amounts of data
in many user-friendly ways.
• Crosstabs (Contingency Tables): A contingency table (also known as
a cross tabulation or crosstab) is a type of table in a matrix format that
displays the multivariate frequency distribution of the variables. They
are heavily used in survey research, business intelligence, engineering,
and scientific research.
• Transpose: This simple operation flips rows and columns, making the
data easier to work with in some cases.
Data Aggregation
Data aggregation is the process of compiling typically [large] amounts of
information from a given database and organizing it into a more consumable
and comprehensive medium. A common statistical data aggregation is reducing
a distribution of values to a mean and standard deviation. Another example of
data reduction is frequency tables.
A histogram is an example of aggregation for exploration. Histograms count
(aggregate) the number of observations that fall into bins. While some data is
lost in this aggregation, it also provides a very useful visualization of the
distribution of a set of values.
OMega TechEd 46
Compiled By: Asst.Prof. MEGHA SHARMA
OMega TechEd 47
Compiled By: Asst.Prof. MEGHA SHARMA
the variety of data; it ensures that all the variables are on the same scale,
making the model easier to understand. It improves the model's accuracy and
ensures that all the features are within the acceptable range to avoid any
computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering
process that generates new variables by extracting them from the raw data.
The main aim of this step is to reduce the volume of data so that it can be
easily used and managed for data modeling. Feature extraction methods
include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features
are either redundant or irrelevant. If we input the dataset with all these
redundant and irrelevant features, it may negatively impact and reduce the
overall performance and accuracy of the model. Hence it is very important to
identify and select the most appropriate features from the data and remove the
irrelevant or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of selecting the
subset of the most relevant features from the original features set by removing
the redundant, irrelevant, or noisy features."
1. Imputation
OMega TechEd 48
Compiled By: Asst.Prof. MEGHA SHARMA
2. Handling Outliers
Outliers are the deviated values or data points that are observed too away from other
data points in such a way that they badly affect the performance of the model.
Outliers can be handled with this feature engineering technique. This technique first
identifies the outliers and then removes them.
Standard deviation can be used to identify the outliers. For example, each value
within a space has a definite to an average distance, but if a value is greater than a
certain value, it can be considered as an outlier. Z-score can also be used to detect
outliers.
3. Log transform
Logarithm transformation or log transform is one of the commonly used
mathematical techniques in machine learning. Log transform helps in handling the
skewed data, and it makes the distribution more approximate to normal after
transformation. It also reduces the effects of outliers on the data, as because of the
normalization of magnitude differences, a model becomes much more robust.
4. Binning
In machine learning, overfitting is one of the main issues that degrades the
performance of the model, and which occurs due to a greater number of parameters
and noisy data. However, one of the popular techniques of feature engineering,
"binning", can be used to normalize the noisy data. This process involves segmenting
different features into bins.
5. Feature Split
As the name suggests, feature split is the process of splitting features intimately into
two or more parts and performing to make new features. This technique helps the
algorithms to better understand and learn the patterns in the dataset.
OMega TechEd 49
Compiled By: Asst.Prof. MEGHA SHARMA
The feature splitting process enables the new features to be clustered and binned,
which results in extracting useful information and improving the performance of the
data models.
• Lagged variables: A lag variable is a variable based on the past values of the
time series. By incorporating previous time series values as features, patterns
such as seasonality and trends can be captured. For example, if we want to
predict today's sales, using lagged variables like yesterday’s sales can provide
valuable information about the ongoing trend.
• Moving window statistics: Moving statistics can also be called moving
window statistics, rolling statistics, or running statistics. A predefined window
around each dimension value is used to calculate various statistics before
moving to the next.
• Time-based features: such as the day of the week, the month of the year,
holiday indicators, seasonal it, and other time related patterns can be valuable
for prediction. For instance, if certain products tend to have higher average
sales on weekends, incorporating the day of the week as a feature can improve
the accuracy of the forecasting model.
OMega TechEd 50
Compiled By: Asst.Prof. MEGHA SHARMA
Here, Xmax and Xmin are the maximum and the minimum values of the
feature, respectively.
• When the value of X is the minimum value in the column, the numerator will
be 0, and hence X’ is 0
OMega TechEd 51
Compiled By: Asst.Prof. MEGHA SHARMA
• On the other hand, when the value of X is the maximum value in the column,
the numerator is equal to the denominator, and thus the value of X’ is 1
• If the value of X is between the minimum and the maximum value, then the
value of X’ is between 0 and 1
is the mean of the feature values and is the standard deviation of the feature
values. Note that, in this case, the values are not restricted to a particular range.
Normalization Standardization
Rescales values to a range between 0 Centers data around the mean and scales
and 1 to a standard deviation of 1
Useful when the distribution of the data Useful when the distribution of the data
is unknown or not Gaussian is Gaussian or unknown
Retains the shape of the original Changes the shape of the original
distribution distribution
OMega TechEd 52
Compiled By: Asst.Prof. MEGHA SHARMA
May not preserve the relationships Preserves the relationships between the
between the data points data points
__________________________________________________________________
Chapter Ends…
OMega TechEd 53
Compiled By: Asst.Prof. MEGHA SHARMA
Chapter-5
Tools and Libraries
Tools and Libraries: Introduction to popular libraries and technologies used,
in Data Science like Pandas, NumPy, Sci-kit Learn, etc.
3. Pandas: Pandas are built on top of two core Python libraries—matplotlib for
data visualization and NumPy for mathematical operations. Pandas acts as a
wrapper over these libraries, allowing you to access many of matplotlib and
NumPy's methods with less code.
OMega TechEd 54
Compiled By: Asst.Prof. MEGHA SHARMA
5. Scipy: SciPy is an open-source Python library that's used in almost every field
of science and engineering optimization, stats, and signal processing. Like
NumPy, SciPy is open source so we can use it freely. SciPy was created by
NumPy's creator Travis Olliphant.
6. Scrapy: Scrapy is a comprehensive open-source framework and is among the
most powerful libraries used for web data extraction. Scrapy natively
integrates functions for extracting data from HTML or XML sources using
CSS and XPath expressions.
__________________________________________________________
Chapter Ends…
OMega TechEd 55
Compiled By: Asst.Prof. MEGHA SHARMA
Chapter 6
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data
visualization methods.
Exploratory Data Analysis (EDA) refers to the method of studying and exploring
record sets to apprehend their predominant traits, discover patterns, locate outliers,
and identify relationships between variables. EDA is normally carried out as a
preliminary step before undertaking extra formal statistical analyses or modeling.
1. Data Cleaning: EDA involves examining the information for errors, lacking
values, and inconsistencies. It includes techniques including recording imputation,
managing missing statistics, and figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important
tendency, variability, and distribution of variables. Measures like suggest, median,
mode, preferred deviation, range, and percentiles are usually used.
3. Data Visualization: EDA employs visual techniques to represent the statistics
graphically. Visualizations consisting of histograms, box plots, scatter plots, line
plots, heatmaps, and bar charts assist in identifying styles, trends, and relationships
within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and
their adjustments to create new functions or derive meaningful insights. Feature
engineering can include scaling, normalization, binning, encoding express variables,
and creating interplay or derived variables.
OMega TechEd 56
Compiled By: Asst.Prof. MEGHA SHARMA
Histograms
Histograms are one of the most popular visualizations to analyze the distribution of
data. They show the numerical variable's distribution with bars. The hist function in
Matplotlib is used to create histogram.
To build a histogram, the numerical data is first divided into several ranges or bins,
and the frequency of occurrence of each range is counted. The horizontal axis shows
the range, while the vertical axis represents the frequency or percentage of
occurrences of a range.
Histograms immediately showcase how a variable's distribution is skewed or where
it peaks.
OMega TechEd 57
Compiled By: Asst.Prof. MEGHA SHARMA
• Median. The middle value of a dataset where 50% of the data is less than the
median and 50% of the data is higher than the median.
• The upper quartile. The 75th percentile of a dataset where 75% of the data
is less than the upper quartile, and 25% of the data is higher than the upper
quartile.
• The lower quartile. The 25th percentile of a dataset where 25% of the data
is less than the lower quartile and 75% is higher than the lower quartile.
• The interquartile range. The upper quartile minus the lower quartile
• The upper adjacent value. Or colloquially, the “maximum.” It represents the
upper quartile plus 1.5 times the interquartile range.
• The lower adjacent value. Or colloquially, the “minimum." It represents the
lower quartile minus 1.5 times the interquartile range.
• Outliers. Any values above the “maximum” or below the “minimum.”
OMega TechEd 58
Compiled By: Asst.Prof. MEGHA SHARMA
Scatter plots.
Scatter plots are used to visualize the relationship between two continuous variables.
Each point in the plot represents a single data point, and the position of the point on
the x and y-axis represents the values of the two variables. It is often used in data
exploration to understand the data and quickly surface potential correlations.
OMega TechEd 59
Compiled By: Asst.Prof. MEGHA SHARMA
Heat maps.
A heatmap is a common and beautiful matrix plot that can be used to graphically
summarize the relationship between two variables. The degree of correlation
between two variables is represented by a color code.
OMega TechEd 60
Compiled By: Asst.Prof. MEGHA SHARMA
Measure of central tendency is the representation of various values of the given data
set. There are various measures of central tendency and the most important three
measures of central tendency are,
• Mean (x̅ or μ)
• Median(M)
• Mode(Z)
Mean is the sum of all the values in the data set divided by the number of values in
the data set. It is also called the Arithmetic Average. Mean is denoted as x̅ and is
read as x bar.
Mean Formula
The formula to calculate the mean is,
Mean (x̅) = Sum of Values / Number of Values
If x1, x2, x3,……, xn are the values of a data set then the mean is calculated as:
x̅ = (x1 + x2 + x3 + …… + xn) / n
Median:
A Median is a middle value for sorted data. The sorting of the data can be done either
in ascending order or descending order. A median divide the data into two equal
halves.
OMega TechEd 61
Compiled By: Asst.Prof. MEGHA SHARMA
If the number of values (n value) in the data set is odd then the formula to calculate
the median is,
Median = [(n + 1)/2]th term
If the number of values (n value) in the data set is even then the formula to calculate
the median is:
Median = [(n/2)th term + {(n/2) + 1}th term] / 2
Mode:
A mode is the most frequent value or item of the data set. A data set can generally
have one or more than one mode value. If the data set has one mode, then it is called
“Uni-modal”. Similarly, If the data set contains 2 modes, then it is called “Bimodal”
and if the data set contains 3 modes, then it is known as “Trimodal”. If the data set
consists of more than one mode, then it is known as “multi-modal” (can be bimodal
or trimodal). There is no mode for a data set if every number appears only once.
Mode Formula
Mode = Highest Frequency Term
Standard Deviation
Standard Deviation is a measure which shows how much variation (such as spread,
dispersion, spread,) from the mean exists. The standard deviation indicates a
“typical” deviation from the mean. It is a popular measure of variability because it
returns to the original units of measure of the data set. Like the variance, if the data
points are close to the mean, there is a small variation whereas the data points are
highly spread out from the mean, then it has a high variance. Standard deviation
calculates the extent to which the values differ from the average. Standard Deviation,
the most widely used measure of dispersion, is based on all values. Therefore, a
change in even one value affects the value of standard deviation. It is independent
of origin but not of scale. It is also useful in certain advanced statistical problems.
Standard Deviation Formula
The population standard deviation formula is given as:
Here,
OMega TechEd 62
Compiled By: Asst.Prof. MEGHA SHARMA
Hypothesis Testing
Hypothesis testing is used to assess the plausibility of a hypothesis by using sample
data. The test provides evidence concerning the plausibility of the hypothesis, given
the data. Statistical analysts test a hypothesis by measuring and examining a random
sample of the population being analyzed.
An analyst performs hypothesis testing on a statistical sample to present evidence of
the plausibility of the null hypothesis. Measurements and analyses are conducted on
a random sample of the population to test a theory. Analysts use a random population
sample to test two hypotheses: the null and alternative hypotheses.
The null hypothesis is typically an equality hypothesis between population
parameters; for example, a null hypothesis may claim that the population means
return equals zero. The alternate hypothesis is essentially the inverse of the null
hypothesis (e.g., the population means the return is not equal to zero). As a result,
they are mutually exclusive, and only one can be correct. One of the two possibilities,
however, will always be correct.
Z = ( x̅ – μ0 ) / (σ /√n)
Let's consider a hypothesis test for the average height of women in the United States.
Suppose our null hypothesis is that the average height is 5'4". We gather a sample of
OMega TechEd 63
Compiled By: Asst.Prof. MEGHA SHARMA
100 women and determine that their average height is 5'5". The standard deviation
of population is 2.
To calculate the z-score, we would use the following formula:
z = ( x̅ – μ0 ) / (σ /√n)
z = (5'5" - 5'4") / (2" / √100)
z = 0.5 / (0.045)
z = 11.11
We will reject the null hypothesis as the z-score of 11.11 is very large and conclude
that there is evidence to suggest that the average height of women in the US is greater
than 5'4".
The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
H0 is the symbol for it, and it is pronounced H-naught.
The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null hypothesis.
H1 is the symbol for it.
Z Test
To determine whether a discovery or relationship is statistically significant,
hypothesis testing uses a z-test. It usually checks to see if the two means are the same
(the null hypothesis). Only when the population standard deviation is known and the
sample size is 30 data points or more, can a z-test be applied.
T Test
A statistical test called a t-test is employed to compare the means of two groups. To
determine whether two groups differ or if a procedure or treatment affects the
population of interest, it is frequently used in hypothesis testing.
OMega TechEd 64
Compiled By: Asst.Prof. MEGHA SHARMA
Chi-Square
The Chi-square test analyzes the differences between categorical variables from a
random sample. The test's fundamental premise is that the observed values in our
data should be compared to the predicted values that would be present if the null
hypothesis were true. In other words, the chi square test is a hypothesis testing
method that is used to check whether the variables in a population are independent
or not.
OMega TechEd 65
Compiled By: Asst.Prof. MEGHA SHARMA
A two-tailed hypothesis, also known as non-directional, will still predict that there
will be an effect, but will not say what direction it will appear in. For example, in
the same study a two-tailed hypothesis might look like, there will be a significant
difference in the grades of students with high attendance and students with low
attendance.
A two-tailed test is designed to determine whether a claim is true or not given a
population parameter. It examines both sides of a specified data range as designated
by the probability distribution involved. As such, the probability distribution should
represent the likelihood of a specified outcome based on predetermined standards.
The hypothesis can be set up as follows:
Null hypothesis: The population parameter = some value
Alternative hypothesis: the population parameter ≠
some value
OMega TechEd 66
Compiled By: Asst.Prof. MEGHA SHARMA
• Dependent variable
• Independent variable (also known as the grouping variable, or factor)
o This variable divides cases into two or more mutually exclusive levels,
or groups
ii) Two-way ANOVA: A two-way ANOVA is used to estimate how the mean of a
quantitative variable changes according to the levels of two categorical variables.
Use a two-way ANOVA when we want to know how two independent variables, in
combination, affect a dependent variable.
_______________________________________________________________
Chapter Ends…
OMega TechEd 67
Compiled By: Asst. Prof. MEGHA SHARMA
Unit-3
Model Evaluation, Data Visualization, and Management
Chapter-11
Model Evaluation Metrics: Accuracy, precision, recall, F1-score, Area Under the
Curve (AUC), Evaluating models for imbalanced datasets.
Evaluation metrics are quantitative measures used to assess the performance and
effectiveness of a statistical or machine learning model. These metrics provide
insights into how well the model is performing and help in comparing different
models or algorithms.
When evaluating a machine learning model, it is crucial to assess its predictive
ability, generalization capability, and overall quality. Evaluation metrics provide
objective criteria to measure these aspects. The choice of evaluation metrics depends
on the specific problem domain, the type of data, and the desired outcome.
Some commonly used evaluation metrics in machine learning:
1. Accuracy
2. Precision
3. Recall
4. F1 Score
5. Area under the Receiver Operating Characteristic (ROC-AUC)
6. Confusion Matrix
1. Accuracy
OMega TechEd 68
Compiled By: Asst. Prof. MEGHA SHARMA
To implement an accuracy metric, we can compare ground truth and predicted values
in a loop, or we can also use the scikit-learn module for this. Although it is simple
to use and implement, it is suitable only for cases where an equal number of samples
belong to each class.
It is good to use the Accuracy metric when the target variable classes in data are
approximately balanced. For example, if 60% of classes in a fruit image dataset are
of Apple, 40% are Mango. In this case, if the model is asked to predict whether the
image is of Apple or Mango, it will give a prediction with 97% accuracy.
It is recommended not to use the Accuracy measure when the target variable majorly
belongs to one class. For example, suppose there is a model for a disease prediction
in which, out of 100 people, only five people have a disease, and 95 people don't
have one. In this case, if our model predicts every person with no disease (which
means a bad prediction), the Accuracy measure will be 95%, which is not correct.
2. Precision
The precision metric is used to overcome the limitation of Accuracy. The precision
determines the proportion of positive predictions that was correct. It can be
calculated as the True Positive or predictions that are true to the total positive
predictions (True Positive and False Positive).
3. Recall or Sensitivity
It is also like the Precision metric; however, it aims to calculate the proportion of
actual positives that were identified incorrectly. It can be calculated as True Positive
or predictions that are actually true to the total number of positives, either correctly
predicted as positive or incorrectly predicted as negative (true Positive and false
negative).
OMega TechEd 69
Compiled By: Asst. Prof. MEGHA SHARMA
From the above definitions of Precision and Recall, we can say that recall determines
the performance of a classifier with respect to a false negative, whereas precision
gives information about the performance of a classifier with respect to a false
positive.
In simple words, if we maximize precision, it will minimize the FP errors, and if we
maximize recall, it will minimize the FN error.
4. F1-score
As F-score makes use of both precision and recall, it should be used if both of them
are important for evaluation, but one (precision or recall) is slightly more important
to consider than the other. For example, when False negatives are comparatively
more important than false positives, or vice versa.
5. AUC-ROC
OMega TechEd 70
Compiled By: Asst. Prof. MEGHA SHARMA
Firstly, let's understand the ROC (Receiver Operating Characteristic curve) curve.
ROC represents a graph to show the performance of a classification model at
different threshold levels. The curve is plotted between two parameters, which are:
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
To calculate value at any point in a ROC curve, we can evaluate a logistic regression
model multiple times with different classification thresholds, but this would not be
much efficient. So, for this, one efficient method is used, which is known as AUC.
AUC: Area Under the ROC curve
AUC is known for Area Under the ROC curve. As its name suggests, AUC
calculates the two-dimensional area under the entire ROC curve, as shown below
image:
AUC calculates the performance across all the thresholds and provides an aggregate
measure. The value of AUC ranges from 0 to 1. It means a model with 100% wrong
OMega TechEd 71
Compiled By: Asst. Prof. MEGHA SHARMA
prediction will have an AUC of 0.0, whereas models with 100% correct predictions
will have an AUC of 1.0.
When to Use AUC
AUC should be used to measure how well the predictions are ranked rather than their
absolute values. Moreover, it measures the quality of predictions of the model
without considering the classification threshold.
When not to use AUC
As AUC is scale-invariant, which is not always desirable, and we need calibrating
probability outputs, then AUC is not preferable.
Further, AUC is not a useful metric when there are wide disparities in the cost of
false negatives vs. false positives, and it is difficult to minimize one type of
classification error.
6 Confusion Matrix
A confusion matrix is a tabular representation of prediction outcomes of any binary
classifier, which is used to describe the performance of the classification model on
a set of test data when true values are known.
The confusion matrix is simple to implement, but the terminologies used in this
matrix might be confusing for beginners.
A typical confusion matrix for a binary classifier looks like the below image
(However, it can be extended to use for classifiers with more than two classes).
OMega TechEd 72
Compiled By: Asst. Prof. MEGHA SHARMA
o In the matrix, columns are for the prediction values, and rows specify the
Actual values. Here Actual and prediction give two possible classes, Yes or
No. So, if we are predicting the presence of a disease in a patient, the
Prediction column with Yes means, Patient has the disease, and for NO, the
Patient doesn't have the disease.
o In this example, the total number of predictions is 165, out of which 110 times
predicted yes, whereas 55 times predicted No.
o However, 60 cases in which patients don't have the disease, whereas 105 cases
in which patients have the disease.
In general, the table is divided into four terminologies, which are as follows:
1. True Positive (TP): In this case, the predicted outcome is true, and it is true
in reality, also.
2. True Negative (TN): in this case, the predicted outcome is false, and it is
false, also.
3. False Positive (FP): In this case, predicted outcomes are true, but they are
false in actuality.
4. False Negative (FN): In this case, predictions are false, and they are true in
actuality.
OMega TechEd 73
Compiled By: Asst. Prof. MEGHA SHARMA
• F1 score
• Precision
• Recall
• AUC score (AUC ROC)
Chapter ends…
OMega TechEd 74
Compiled By: Asst. Prof. MEGHA SHARMA
Chapter-12
OMega TechEd 75
Compiled By: Asst. Prof. MEGHA SHARMA
7. Seek balance in your visual elements, including texture, color, shape, and
negative space.
8. Use patterns (of chart types, colors, or other design elements) to identify
similar types of information.
9. Use proportion carefully so that differences in design size fairly represent
differences in value.
10. Be skeptical. Ask yourself questions about what data is not represented and
what insights might therefore be misinterpreted or missing.
1. Line Charts: In a line chart, each data point is represented by a point on the
graph, and these points are connected by a line. We may find patterns and
trends in the data across time by using line charts. Time-series data is
frequently displayed using line charts.
Advantages/Use
• A line graph is a graph that is used to display change over time as a series of
data points connected by straight line segments on two axes.
• A line graph is also called a line chart. It helps to determine the relationship
between two sets of values, with one data set always being dependent on the
other data set.
• They are helpful to demonstrate information on factors and patterns. Line
diagrams can make expectations about the consequences of information not
yet recorded.
OMega TechEd 76
Compiled By: Asst. Prof. MEGHA SHARMA
OMega TechEd 77
Compiled By: Asst. Prof. MEGHA SHARMA
The link between variables in scatter diagrams is indicated by the direction of the
correlation on the graph. A correlation in a scatter diagram occurs when two
variables are determined to have a connection.
Positive correlation
If variables have a positive correlation, this signifies that when the independent
variable's value rises, the dependent variable's value rises as well.
As the weight of human adults increases, the risk of diabetes also increases. The
pattern of observation in this example would slant from the chart's bottom left to the
upper right.
Negative correlation
In the negative correlation, when the value of one variable grows, the value of the
other variable falls. The dependent variable's value drops as the independent
variable's value rises.
Here’s an example: When summer temperatures rise, sales of winter clothing
decline. The pattern of observation in this example would slant from the top left to
the bottom right of the graph.
OMega TechEd 78
Compiled By: Asst. Prof. MEGHA SHARMA
No correlation
The "no correlation" type is used when there's no potential link between the
variables. It's also known as zero correlation. The two variables plotted aren't
connected in any way.
The area of land and air quality index, for example, have no relationship. As an area
grows, there is no effect on the air quality. These two variables have no association,
and the observations will be dispersed all over the graph.
Advantages/ Uses
1. Reading scatter diagrams incorrectly may lead to false conclusions that one
variable caused the other, when both may have been influenced by a third.
2. A relationship in a scatter diagram may not be apparent because the data does
not cover a wide enough range.
3. Associations between more than two variables are not shown in scatter plots.
4. Scatter diagrams cannot provide the precise extent of association.
5. A scatter plot does not indicate the quantitative measure of the relationship
between the two variables.
3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar
chart, each category is represented by a bar, with the height of the bar indicating the
frequency or proportion of that category in the data. Bar graphs are useful for
comparing several categories and seeing patterns over time.
OMega TechEd 79
Compiled By: Asst. Prof. MEGHA SHARMA
Advantages
• show each data category in a frequency distribution.
• display relative numbers or proportions of multiple categories.
• summarize a large dataset in visual form.
• clarify trends better than do tables.
• estimate key values immediately.
• permit a visual check of the accuracy and reasonableness of calculations.
• be easily understood due to widespread use in business and the media.
Disadvantages
• Require additional explanation.
• Be easily manipulated to yield false impressions.
• Fail to reveal key assumptions, causes, effects, or patterns.
4. Box plots
Box plots are a graphical representation of the distribution of a set of data. In a box
plot, the median is shown by a line inside the box, while the center box depicts the
range of the data. The whiskers extend from the box to the highest and lowest values
in the data, excluding outliers. Box plots can help us to identify the spread and
skewness of the data.
OMega TechEd 80
Compiled By: Asst. Prof. MEGHA SHARMA
The box plot is suitable for comparing range and distribution for groups of numerical
data.
Advantages:
The box plot organizes large amounts of data and visualizes outlier values.
Disadvantages:
The box plot is not relevant for detailed analysis of the data as it deals with a
summary of the data distribution.
5. Histogram
The histogram is suitable for visualizing distribution of numerical data over a
continuous interval, or a certain time. The data is divided into bins, and each bar in
a histogram represents the tabulated frequency at each bin.
OMega TechEd 81
Compiled By: Asst. Prof. MEGHA SHARMA
Advantages
The histogram organizes large amounts of data, and produces a visualization
quickly, using a single dimension.
Disadvantages
The histogram is not relevant for detailed analysis of the data as it deals with a
summary of the data distribution.
6. Heat Maps
Heat maps are a type of graphical representation that displays data in a matrix format.
The value of the data point that each matrix cell represents determines its hue.
Heatmaps are often used to visualize the correlation between variables or to identify
patterns in time-series data.
OMega TechEd 82
Compiled By: Asst. Prof. MEGHA SHARMA
OMega TechEd 83
Compiled By: Asst. Prof. MEGHA SHARMA
7. Tree Maps: Tree maps are used to display hierarchical data in a compact format
and are useful in showing the relationship between different levels of a hierarchy.
Advantages:
OMega TechEd 84
Compiled By: Asst. Prof. MEGHA SHARMA
Limitations:
1. A tree map chart does not accommodate data sets that vary in magnitude.
2. All values of the quantitative variable that represent the size of the rectangle must
be positive values. Negative values are not acceptable.
3. Since the data points are depicted in the form of rectangles with no other sorting
options, it follows that they take up space. In addition to the spatial constraint,
readability can be a little more difficult as it is easier to read long and linear data
plots than wide and large ones. This also makes it difficult to print the tree map.
4. Some tree maps take a lot of effort to generate, even with specialized programs.
5. Sometimes tree maps do not display hierarchical levels as sharply as other charts
used to visualize hierarchical data, such as a sunburst diagram or a tree diagram.
Visualization Tools
There are many data visualization tools available.
1. matplotlib
matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Created by John D. Hunter in 2003, matplotlib provides the
building blocks to create rich visualizations of many kinds of datasets. All kinds of
visualizations, such as line plots, histograms, bar plots, and scatter plots, can be
easily created with matplotlib in a few lines of code.
We can customize every aspect of a plot you can think of with matplotlib. This
makes the tool extremely flexible, but also it can be challenging and time-consuming
to get the perfect plot.
Key features:
Pros:
• High versatility.
OMega TechEd 85
Compiled By: Asst. Prof. MEGHA SHARMA
Cons:
2. Seaborn
Any kind of visualization is possible with matplotlib. However, sometimes the wide
flexibility of matplotlib can become difficult to master. You may spend hours in a
plot whose design seemed straightforward at the outset. Seaborn was designed to
address these pitfalls.
It’s a Python library that allows us to generate elegant graphs easily. Seaborn is based
on matplotlib and provides a high-level interface for drawing attractive and
informative statistical graphics.
Key features:
Pros:
Cons:
OMega TechEd 86
Compiled By: Asst. Prof. MEGHA SHARMA
3. Tableau
Tableau is a powerful and popular data visualization tool that allows you to analyze
data from multiple sources simultaneously. Founded in 2003 at Stanford University,
in 2019, Salesforce acquired the platform.
Tableau is used by top companies to extract insights from tons of raw data. Thanks
to its intuitive and powerful platform, you can do anything with Tableau. However,
if you are just interested in building simple charts, you should go for less robust and
more affordable options.
Key features:
Pros:
Cons:
4. Power BI
Power BI is a cloud-based business analytics solution that allows you to bring
together different data sources, analyze them, and present data analysis through
visualizations, reports, and dashboards.
Microsoft’s PowerBI is the leader in BI solutions in the industry. Power BI makes it
easy to access data on almost any device inside and outside the organization.
OMega TechEd 87
Compiled By: Asst. Prof. MEGHA SHARMA
Key features:
Pros:
Cons:
6. ggplot2
Arguably R’s most powerful package, ggplot2 is a plotting package that provides
helpful commands to create complex plots from data in a data frame. Since its launch
by Hadley Wickham in 2007, ggplot2 has become the go-to tool for flexible and
professional plots in R. ggplot2 is inspired by the data visualization methodology
called “the grammar of Graphics,” whose idea is to independently specify the
components of the graph and then combine them.
Key features:
OMega TechEd 88
Compiled By: Asst. Prof. MEGHA SHARMA
Pros:
Cons:
Data Storytelling
Data storytelling is the process of using data to communicate a story or message in
a clear and effective way. It involves selecting and organizing data in a way that
helps the audience understand and remember the key points and presenting the data
in a visually appealing and engaging manner.
1. Audience: It is important to consider who the audience is and what their needs
and interests are when selecting and presenting data. This will help ensure that
the data is relevant and engaging to the audience.
2. Data selection: Choose the data that is most relevant to the message you want
to convey and that supports your argument or point of view. This will help
make the data more meaningful and impactful.
OMega TechEd 89
Compiled By: Asst. Prof. MEGHA SHARMA
4. Visualization: Use visual aids such as graphs, charts, and maps to help the
audience understand and remember the key points. Choose the most
appropriate type of visualization for the data and the message you want to
convey.
5. Narration: Use clear and concise language to explain the data and its
implications. This will help the audience understand the context and
significance of the data.
6. Analysis: Analyze the data to identify trends, patterns, and insights that can
help the audience understand the implications of the data.
7. Story structure: Use a clear and logical story structure to organize the data
and present it in a way that is easy for the audience to follow. This might
include an introduction, main body, and conclusion.
OMega TechEd 90
Compiled By: Asst. Prof. MEGHA SHARMA
visually appealing and easy-to-follow manner, data storytelling can help team
members understand and discuss the data, leading to more informed and
effective decision-making.
_________________________________________________________________
Chapter ends…
OMega TechEd 91
Compiled By: Asst. Prof. MEGHA SHARMA
Chapter- 13
Data Management
Data Management:
Data management is the practice of collecting, organizing, protecting, and storing an
organization's data so it can be analyzed for business decisions. As organizations
create and consume data at unprecedented rates, data management solutions become
essential for making sense of the vast quantities of data.
“Data management comprises all disciplines related to handling data as a valuable
resource, it is the practice of managing an organization’s data so it can be
analyzed for decision making.”
OMega TechEd 92
Compiled By: Asst. Prof. MEGHA SHARMA
6. Data Integration: Integrate data from various sources to create a unified and
comprehensive view. Implement ETL process for data integration.
7. Data Governance: Develop and enforce policies and procedures for data
management.
8. Data Privacy and Compliance: Ensures compliance with data protection
laws and regulations.
9. Data Retrieval and Analysis:
• Develop tools and systems for querying and retrieving data efficiently.
• Perform data analysis and reporting to derive insights and support
decision-making.
10. Data Documentation: Maintain documentation to facilitate collaboration and
knowledge transfer.
11. Data Auditing and Monitoring:
• Regularly audit data to ensure compliance with quality standards and
policies.
• Implement a monitoring system to detect anomalies or unauthorized
activities.
Data Pipelines
A data pipeline is a method in which raw data is ingested from various data sources,
transformed and then ported to a data store, such as a data lake or data warehouse,
for analysis. Before data flows into a data repository, it usually undergoes some
data processing. This is inclusive of data transformations, such as filtering,
masking, and aggregations, which ensure appropriate data integration and
standardization.
OMega TechEd 93
Compiled By: Asst. Prof. MEGHA SHARMA
way, the business can update any historical data if they need to make adjustments to
data processing jobs. During this data ingestion process, various validations and
checks can be performed to ensure the consistency and accuracy of data.
2. Data transformation: During this step, a series of jobs are executed to process
data into the format required by the destination data repository. These jobs embed
automation and governance for repetitive workstreams, such as business reporting,
ensuring that data is cleansed and transformed consistently. For example, a data
stream may come in a nested JSON format, and the data transformation stage will
aim to unroll that JSON to extract the key fields for analysis.
3. Data storage: The transformed data is then stored within a data repository, where
it can be exposed to various stakeholders. Within streaming data, this transformed
data is typically known as consumers, subscribers, or recipients.
1. ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a
format suitable for loading into a data warehouse, and then load it into the
warehouse. The process of ETL can be broken down into the following
three stages:
2. Extract: The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This step
involves reading data from the source systems and storing it in a staging
area.
3. Transform: In this stage, the extracted data is transformed into a format
that is suitable for loading into the data warehouse. This may involve
cleaning and validating the data, converting data types, combining data
from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse.
This step involves creating the physical data structures and loading the data
into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is
added to the warehouse. The process is important because it ensures that
the data in the data warehouse is accurate, complete, and up to date. It also
OMega TechEd 94
Compiled By: Asst. Prof. MEGHA SHARMA
helps to ensure that the data is in the format required for data mining and
reporting.
OMega TechEd 95
Compiled By: Asst. Prof. MEGHA SHARMA
• Due to a lack of adequate data security practices, data breaches can occur and
expose organizations to financial loss, a decrease in consumer confidence, and
brand erosion. If consumers lose trust in an organization, they will likely move
their business elsewhere and devalue the brand.
• Breaches that result in the loss of trade secrets and intellectual property can
affect an organization’s ability to innovate and remain profitable in the future.
There are various types of data security technologies in use today that protect against
various external and internal threats. Organizations should be using many of them
to secure all potential threat access points and safeguard their data. Below are some
of the techniques:
Data encryption
Data encryption uses an algorithm to scramble every data character converting
information to an unreadable format. Encryption keys from authorized users are only
needed to decrypt the data before reading the files.
Encryption technology acts as the last line of defense in the event of a breach when
confidential and sensitive data is concerned. It is crucial to ensure that the encryption
keys are stored in a secure place where access is restricted. Data encryption can also
include capabilities for security key management.
Authentication
Authentication is a process of confirming or validating user login credentials to make
sure they match the information stored in the database. User credentials include
usernames, passwords, PINS, security tokens, swipe cards, biometrics, etc.
Authentication is a frontline defense against unauthorized access to confidential and
sensitive information, making it an important process. Authentication technologies,
such as single sign-on, multi-factor authentication, and breached password detection
make it simpler to secure the authentication process while maintaining user
convenience.
Data masking
Masking whole data or specific data areas can help protect it from exposure to
unauthorized or malicious sources externally or internally. Masking can be applied
to personally identifiable information (PII), such as a phone number or email
OMega TechEd 96
Compiled By: Asst. Prof. MEGHA SHARMA
address, by obscuring parts of the PPI, e.g., the first eight digits or letters within a
database.
Proxy characters are used to mask the data characters. The data masking software
changes the data back to its original form only when the data is received by an
authorized user. Data masking allows the development of applications using actual
data.
Tokenization
Tokenization is like data encryption but differs in that it replaces data with random
characters, where encryption scrambles data with an algorithm. The “token,” which
relates to the original data, is stored away separately in a database lookup table,
where it is protected from unauthorized access.
Data erasure
Data erasure occurs when data is no longer needed or active in the system. The
erasure process uses software to delete data on a hardware storage device. The data
is permanently deleted from the system and is irretrievable.
Data resilience
Data resilience is determined by the ability of an organization to recover from
incidences of a data breach, corruption, power failure, failure of hardware systems,
or loss of data. Data centers with backup copies of data can easily get back on their
feet after a disruptive event.
Physical access controls
Unlike digital access control, which can be managed through authentication,
physical access control is managed through control of access to physical areas or
premises where data is physically stored, i.e., server rooms and data center locations.
Physical access control uses security personnel, key cards, retina scans, thumbprint
recognition, and other biometric authentication measures.
An organization can take several steps in addition to the data security technologies
above to ensure robust data security management.
OMega TechEd 97
Compiled By: Asst. Prof. MEGHA SHARMA
3. Data backup: Practicing backup of all data ensures the business will continue
uninterrupted in the event of a data breach, software or hardware failure, or
any type of data loss. Backup copies of critical data should be robustly tested
to ensure adequate insurance against data loss. Furthermore, backup files
should be subjected to equal security control protocols that manage access to
core primary systems.
4. Data security risk assessment: it is prudent to carry out regular assessments
of data security systems to detect vulnerabilities and potential losses in the
event of a breach. The assessment can also detect out-of-date software and
any misconfigurations needing redress.
5. Quarantine sensitive files: Data security software should be able to
frequently categorize sensitive files and transfer them to a secure location.
6. Data file activity monitoring: Data security software should be able to
analyze data usage patterns for all users. It will enable the early identification
of any anomalies and possible risks. Users may be given access to more data
than they need for their role in the organization. The practice is called over-
permission, and data security software should be able to profile user behavior
to match permissions with their behavior.
7. Application security and patching: Relates to the practice of updating
software to the latest version promptly as patches or new updates are released.
8. Training: Employees should continually be trained on the best practices in
data security. They can include training on password use, threat detection, and
social engineering attacks. Employees who are knowledgeable about data
security can enhance the organization’s role in safeguarding data.
__________________________________________________________________
Chapter ends…
OMega TechEd 98