SRU ADA Unit-1
SRU ADA Unit-1
Term: 2024-25
Unit-1
2
Unit-1 Syllabus
Overview of Data Analytics: Introduction and Importance, Types
of Data Analytics, Applications. Data Management: Design Data
Architecture and manage the data for analysis, understand
various sources of Data like Sensors/Signals/GPS etc. Data
Management, Data Quality (noise, outliers, missing values,
duplicate data) and Data Processing & Processing.
3
Introduction
Data Analytics is defined as
It is the process of examining data sets to draw
conclusions about the information they
contain, increasingly with the aid of specialized
systems and software
7
The Evolving role of Data Analytics
From early statistical methods to advanced machine learning
algorithms, data analytics has evolved significantly.
Data analytics has become a critical component in modern
decision-making processes.
Its role has evolved from basic data reporting to advanced
predictive and prescriptive analytics.
Past:
Descriptive Analytics (Reporting, Historical Data)
Present:
Diagnostic Analytics (Root Cause Analysis, Understanding Why)
Future:
•Predictive Analytics (Forecasting, Predicting Future Outcomes)
•Prescriptive Analytics (Recommending Actions, Optimization)
8
9
Importance of Data Analytics
● Enhancing Decision-Making: Data-driven decisions reduce risks and
increase the likelihood of successful outcomes.
● Driving Business Value: Identifying new revenue opportunities,
optimizing operations, and enhancing customer experiences.
● Improving Efficiency: Streamlining processes, reducing waste, and
improving resource utilization.
10
.
11
12
Types of Data Analytics
Data analytics is an important field that involves the process of
collecting, processing, and interpreting data to uncover insights and
help in making decisions. Data analytics is the practice of examining
raw data to identify trends, draw conclusions, and extract meaningful
information. This involves various techniques and tools to process and
transform data into valuable insights that can be used for decision-
making.
20
Data Architecture Design and Data
Management
Most of the data is generated from social media sites like Facebook,
Instagram, Twitter, etc, and the other sources can be e-business, e-
commerce transactions, hospital, school, bank data, etc. This data is
impossible to manage by traditional data storing techniques. So Big-
Data came into existence for handling the data which is big and
impure.
Big Data is the field of collecting the large data sets from various
sources like social media, GPS, sensors etc and analyzing them
systematically and extract useful patterns using some tools and
techniques by enterprises. Before analyzing and determining the
data, the data architecture must be designed by the architect.
21
Data architecture design is set of standards which are composed of
certain policies, rules, models and standards which manages, what
type of data is collected, from where it is collected, the arrangement
of collected data, storing that data, utilizing and securing the data
into the systems and data warehouses for further analysis.
Business requirements –
Business policies –
Technology in use –
Business economics –
Data processing needs –
29
Data Preprocessing Techniques
● Data Integration − Data integration involves combining multiple
datasets with similar variables or structures. R provides functions
like merge() and rbind() to merge datasets based on common
identifiers or variables. Proper data integration ensures a unified
dataset for analysis.
● Data Transformation − Data transformation involves converting
raw data into a suitable format for analysis. R provides functions
like scale(), log() or sqrt() to normalize or transform skewed data
distributions. These transformations help meet the assumptions of
statistical models and improve interpretability.
● Feature Selection − Feature selection aims to identify the most
relevant variables for analysis. R offers techniques like correlation
analysis, stepwise regression, or regularization methods (e.g., Lasso 30
● Encoding Categorical Variables − Categorical variables often
require encoding to numerical representations for analysis. R offers
functions like factor() or dummyVars() to convert categorical
variables into binary or numerical representations. This process
enables the inclusion of categorical variables in statistical models.
● Handling Imbalanced Data − Imbalanced datasets, where one
class dominates over others, can lead to biased predictions or
model performance. R provides techniques such as oversampling
(e.g., SMOTE) or under sampling to balance the dataset and
improve model training.
31
R Packages for Data Cleaning and
Preprocessing
● Tidyverse − Tidyverse is a collection of R packages, including dplyr,
tidyr, and stringr, that provide powerful tools for data manipulation,
cleaning, and tidying. These packages offer a consistent and intuitive
syntax for transforming and cleaning data.
● Caret − The caret package (Classification and Regression Training) in R
provides functions for data preprocessing, feature selection, and
resampling techniques. It offers a comprehensive set of tools for
preparing data for machine learning algorithms.
● DataPreparation − The DataPreparation package in R provides a wide
range of functions for data cleaning, transformation, and
preprocessing. It offers functionalities like missing value imputation,
outlier detection, feature scaling, and more. 32
Data Preprocessing
The data preprocessing process is divided into the following steps:
33
34
The command read.csv('filename') receive different optional parameters,
you will have to use some of them depending on how your dataset is
arranged on the .csv file. You can set the sep parameter to indicate
the separator on your file. For instance,
35
Completing Missing Data
● Completing missing data is optional. If your dataset is complete you
obviously will not have to do this part. But sometimes you will
find datasets with some missing cells, in that case, you could do 2
things,
● Remove a complete row (not recommended, you could delete
crucial information).
● Complete that missing information with the mean of the column.
36
Encoding Categorical Data
● This step is also optional. Depending on your dataset, you might
have from beginning on, a dataset with already encoded categorical
data. In that case you won’t need to do this.
● In our case, we have the Graduate column, this column has 2
possible values, either yes or no. In order to be able to work with
this data, we have to encode it, that means, changing the labels to
numbers.
37
Splitting the Dataset
● This part is mandatory and one of the most important parts when
working with Machine Learning models.
● Splitting the dataset means that you have to divide the whole
dataset into two parts, the training set and the test set. When you
want to train a model to solve or predict a specific thing, you first
have to train your model and then test if the models is doing a
correct prediction.
rmally the proportion is 80% training set and 20% test set, but it can vary depending on your mo
will split the dataset with that proportion.
u first have to install a package called caTools by doing the following,
packages.install('caTools')
Once installed you have to tell R that you will use that
library,
38
library(caTools)
39
Feature Scaling
● This last step is also not always necessary. In the dataset there are
some values that are not on the same scale, for example the Age
and the Income have a very different scale.
● Most of Machine Learning models work using the euclidian distance
between two points, but since the scales are different, the distance
between two points could be enormous and it could cause problems
on your model.
40
41
Different Sources of Data for Data
Analysis
● Data collection is the process of acquiring, collecting, extracting,
and storing the voluminous amount of data which may be in the
structured or unstructured form like text, video, audio, XML files,
records, or other image files used in later stages of data analysis. In
the process of big data analysis, “Data collection” is the initial step
before starting to analyze the patterns or useful information in data.
The data which is to be analyzed must be collected from different
valid sources.
● The data which is collected is known as raw data which is not useful
now but on cleaning the impure and utilizing that data for further
analysis forms information, the information obtained is known as
“knowledge”. 42
Data collection starts with asking some questions such as what type
of data is to be collected and what is the source of collection.
44
45
46
47
48
49
Thank You
50