0% found this document useful (0 votes)
15 views50 pages

SRU ADA Unit-1

Introduction ppts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views50 pages

SRU ADA Unit-1

Introduction ppts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

1

Advanced Data Analytics

Term: 2024-25
Unit-1

Text Books: 1. Data Mining Concepts and Techniques,


Han, Kamber, 3rd Edition, Morgan Kaufmann
Publishers

2
Unit-1 Syllabus
Overview of Data Analytics: Introduction and Importance, Types
of Data Analytics, Applications. Data Management: Design Data
Architecture and manage the data for analysis, understand
various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values,
duplicate data) and Data Processing & Processing.

3
Introduction
Data Analytics is defined as
It is the process of examining data sets to draw
conclusions about the information they
contain, increasingly with the aid of specialized
systems and software

Data analytics helps organizations make informed


decisions, improve operational efficiency, and gain
competitive advantages by uncovering patterns,
correlations, and insights from raw data. 4
5
6
 Big data analytics enabled companies to gain deeper insights into
their customers, operations, and markets. It also laid the foundation
for advanced analytics techniques, such as predictive analytics,
machine learning, and artificial intelligence.
 Key Milestones: Development of relational databases, introduction
of big data technologies, rise of AI and machine learning

7
The Evolving role of Data Analytics
 From early statistical methods to advanced machine learning
algorithms, data analytics has evolved significantly.
 Data analytics has become a critical component in modern
decision-making processes.
 Its role has evolved from basic data reporting to advanced
predictive and prescriptive analytics.
Past:
Descriptive Analytics (Reporting, Historical Data)
Present:
Diagnostic Analytics (Root Cause Analysis, Understanding Why)
Future:
•Predictive Analytics (Forecasting, Predicting Future Outcomes)
•Prescriptive Analytics (Recommending Actions, Optimization)
8
9
Importance of Data Analytics
● Enhancing Decision-Making: Data-driven decisions reduce risks and
increase the likelihood of successful outcomes.
● Driving Business Value: Identifying new revenue opportunities,
optimizing operations, and enhancing customer experiences.
● Improving Efficiency: Streamlining processes, reducing waste, and
improving resource utilization.

10
.

11
12
Types of Data Analytics
Data analytics is an important field that involves the process of
collecting, processing, and interpreting data to uncover insights and
help in making decisions. Data analytics is the practice of examining
raw data to identify trends, draw conclusions, and extract meaningful
information. This involves various techniques and tools to process and
transform data into valuable insights that can be used for decision-
making.

There are four major types of data analytics:


● Descriptive (business intelligence and data mining)
● Predictive (forecasting)
● Prescriptive (optimization and simulation)
● Diagnostic analytics 13
14
Descriptive (business intelligence and
data mining)
Descriptive analytics looks at data and analyze past event for insight
as to how to approach future events. It looks at past performance and
understands the performance by mining historical data to understand
the cause of success or failure in the past. Almost all management
reporting such as sales, marketing, operations, and finance uses this
type of analysis.

Descriptive analytics is the first step in data analysis. The goal of


descriptive analytics is to find out what happened?
 what was the average revenue for the month of January? Or
 how many children between the ages of two and ten attend school?
 It’s the first layer of information that you can get from the data
15
Predictive
Predictive analytics turn the data into valuable, actionable information.
predictive analytics uses data to determine the probable outcome of
an event or a likelihood of a situation occurring. Predictive analytics
holds a variety of statistical techniques from modeling,
machine learning, data mining, and game theory that analyze current
and historical facts to make predictions about a future
event. Techniques that are used for

Predictive analytics are:


● Linear Regression
● Time Series Analysis and Forecasting
● Data Mining
16
Prescriptive
 Prescriptive Analytics automatically synthesize big data,
mathematical science, business rule, and machine learning to make
a prediction and then suggests a decision option to take advantage
of the prediction.
 Prescriptive analytics goes beyond predicting future outcomes by
also suggesting action benefits from the predictions and showing
the decision maker the implication of each decision option.
 Prescriptive Analytics not only anticipates what will happen and
when to happen but also why it will happen. Further, Prescriptive
Analytics can suggest decision options on how to take advantage of
a future opportunity or mitigate a future risk and illustrate the
implication of each decision option.
Prescriptive Analytics can benefit healthcare strategic planning by 17
Diagnostic Analytics
In this analysis, we generally use historical data over other data to
answer any question or for the solution of any problem. We try to find
any dependency and pattern in the historical data of the particular
problem.

For example, companies go for this analysis because it gives a great


insight into a problem, and they also keep detailed information about
their disposal otherwise data collection may turn out individual for
every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:
● Data discovery
● Data mining
18
● Correlations
19
Applications of Data Analytics

20
Data Architecture Design and Data
Management
 Most of the data is generated from social media sites like Facebook,
Instagram, Twitter, etc, and the other sources can be e-business, e-
commerce transactions, hospital, school, bank data, etc. This data is
impossible to manage by traditional data storing techniques. So Big-
Data came into existence for handling the data which is big and
impure.

 Big Data is the field of collecting the large data sets from various
sources like social media, GPS, sensors etc and analyzing them
systematically and extract useful patterns using some tools and
techniques by enterprises. Before analyzing and determining the
data, the data architecture must be designed by the architect.
21
 Data architecture design is set of standards which are composed of
certain policies, rules, models and standards which manages, what
type of data is collected, from where it is collected, the arrangement
of collected data, storing that data, utilizing and securing the data
into the systems and data warehouses for further analysis.

 Data is one of the essential pillars of enterprise architecture through


which it succeeds in the execution of business strategy.
 Data architecture design is important for creating a vision of
interactions occurring between data systems, like for example if data
architect wants to implement data integration, so it will need
interaction between two systems and by using data architecture the 22
visionary model of data interaction during the process can be
Data architecture also describes the type of data structures applied
to manage data and it provides an easy way for data preprocessing.
The data architecture is formed by dividing into three essential
models and then are combined 23
● Conceptual model –
It is a business model which uses Entity Relationship (ER) model for
relation between entities and their attributes.
● Logical model –
It is a model where problems are represented in the form of logic
such as rows and column of data, classes, xml tags and other DBMS
techniques.
● Physical model –
Physical models holds the database design like which type of
database technology will be suitable for architecture.
A data architect is responsible for all the design, creation, manage,
deployment of data architecture and defines how data is to be stored
24
and retrieved, other decisions are made by internal bodies.
Factors that influence Data Architecture :
Few influences that can have an effect on data architecture are
business policies, business requirements, Technology used, economics,
and data processing needs.

Business requirements –
Business policies –
Technology in use –
Business economics –
Data processing needs –

These include factors such as mining of the data, large continuous


transactions, database management, and other data preprocessing 25
Data Management
● Data management is the process of managing tasks like extracting
data, storing data, transferring data, processing data, and then
securing data with low-cost consumption.

● Main motive of data management is to manage and safeguard the


people’s and organization data in an optimal way so that they can
easily create, access, delete, and update the data.

● Because data management is an essential process in each and


every enterprise growth, without which the policies and decisions
can’t be made for business advancement. The better the data
management the better productivity in business. 26
27
28
● Large volumes of data like big data are harder to manage
traditionally so there must be the utilization of optimal technologies
and tools for data management such as Hadoop, Scala, Tableau,
AWS, etc. Which can further used for big data analysis in achieving
improvements in patterns.

● Data management can be achieved by training the employees


necessarily and maintenance by DBA, data analyst, and data
architects.

29
Data Preprocessing Techniques
● Data Integration − Data integration involves combining multiple
datasets with similar variables or structures. R provides functions
like merge() and rbind() to merge datasets based on common
identifiers or variables. Proper data integration ensures a unified
dataset for analysis.
● Data Transformation − Data transformation involves converting
raw data into a suitable format for analysis. R provides functions
like scale(), log() or sqrt() to normalize or transform skewed data
distributions. These transformations help meet the assumptions of
statistical models and improve interpretability.
● Feature Selection − Feature selection aims to identify the most
relevant variables for analysis. R offers techniques like correlation
analysis, stepwise regression, or regularization methods (e.g., Lasso 30
● Encoding Categorical Variables − Categorical variables often
require encoding to numerical representations for analysis. R offers
functions like factor() or dummyVars() to convert categorical
variables into binary or numerical representations. This process
enables the inclusion of categorical variables in statistical models.
● Handling Imbalanced Data − Imbalanced datasets, where one
class dominates over others, can lead to biased predictions or
model performance. R provides techniques such as oversampling
(e.g., SMOTE) or under sampling to balance the dataset and
improve model training.

31
R Packages for Data Cleaning and
Preprocessing
● Tidyverse − Tidyverse is a collection of R packages, including dplyr,
tidyr, and stringr, that provide powerful tools for data manipulation,
cleaning, and tidying. These packages offer a consistent and intuitive
syntax for transforming and cleaning data.
● Caret − The caret package (Classification and Regression Training) in R
provides functions for data preprocessing, feature selection, and
resampling techniques. It offers a comprehensive set of tools for
preparing data for machine learning algorithms.
● DataPreparation − The DataPreparation package in R provides a wide
range of functions for data cleaning, transformation, and
preprocessing. It offers functionalities like missing value imputation,
outlier detection, feature scaling, and more. 32
Data Preprocessing
The data preprocessing process is divided into the following steps:

● Importing the dataset.


● Completing missing data.
● Encoding categorical data.
● Splitting the dataset.
● Feature scaling.

33
34
The command read.csv('filename') receive different optional parameters,
you will have to use some of them depending on how your dataset is
arranged on the .csv file. You can set the sep parameter to indicate
the separator on your file. For instance,

35
Completing Missing Data
● Completing missing data is optional. If your dataset is complete you
obviously will not have to do this part. But sometimes you will
find datasets with some missing cells, in that case, you could do 2
things,
● Remove a complete row (not recommended, you could delete
crucial information).
● Complete that missing information with the mean of the column.

36
Encoding Categorical Data
● This step is also optional. Depending on your dataset, you might
have from beginning on, a dataset with already encoded categorical
data. In that case you won’t need to do this.
● In our case, we have the Graduate column, this column has 2
possible values, either yes or no. In order to be able to work with
this data, we have to encode it, that means, changing the labels to
numbers.

37
Splitting the Dataset
● This part is mandatory and one of the most important parts when
working with Machine Learning models.
● Splitting the dataset means that you have to divide the whole
dataset into two parts, the training set and the test set. When you
want to train a model to solve or predict a specific thing, you first
have to train your model and then test if the models is doing a
correct prediction.
rmally the proportion is 80% training set and 20% test set, but it can vary depending on your mo
will split the dataset with that proportion.
u first have to install a package called caTools by doing the following,

packages.install('caTools')

Once installed you have to tell R that you will use that
library,
38
library(caTools)
39
Feature Scaling
● This last step is also not always necessary. In the dataset there are
some values that are not on the same scale, for example the Age
and the Income have a very different scale.
● Most of Machine Learning models work using the euclidian distance
between two points, but since the scales are different, the distance
between two points could be enormous and it could cause problems
on your model.

40
41
Different Sources of Data for Data
Analysis
● Data collection is the process of acquiring, collecting, extracting,
and storing the voluminous amount of data which may be in the
structured or unstructured form like text, video, audio, XML files,
records, or other image files used in later stages of data analysis. In
the process of big data analysis, “Data collection” is the initial step
before starting to analyze the patterns or useful information in data.
The data which is to be analyzed must be collected from different
valid sources.

● The data which is collected is known as raw data which is not useful
now but on cleaning the impure and utilizing that data for further
analysis forms information, the information obtained is known as
“knowledge”. 42
 Data collection starts with asking some questions such as what type
of data is to be collected and what is the source of collection.

 Most of the data collected are of two types known as “qualitative


data“ which is a group of non-numerical data such as words,
sentences mostly focus on behavior and actions of the group and
another one is “quantitative data” which is in numerical forms and
can be calculated using different scientific tools and sampling data.
The actual data is then further divided mainly into two types
known as:
● Primary data
● Secondary data
43
Primary data
 Primary data refers to the first hand data gathered by the
researcher himself. Sources of primary data are surveys,
observations, Experimental Methods.

44
45
46
47
48
49
Thank You

50

You might also like