Unit II
Unit II
TO ANALTYICS
UNIT -2
• Data Analytics has a key role in improving your business. Here are
4 main factors which signify the need for Data Analytics
• Gather Hidden Insights – Hidden insights from data
are gathered and then analyzed with respect to business
requirements.
• Generate Reports – Reports are generated from the
data and are passed on to the respective teams and
individuals to deal with further actions for a high rise in
business.
Introducti • Perform Market Analysis – Market Analysis can be
performed to understand the strengths and the
on to weaknesses of competitors.
• Improve Business Requirement – Analysis of Data
Analytics allows improving Business to customer requirements
and experience.
• Data Analytics refers to the techniques to analyze data to enhance
productivity and business gain. Data is extracted from various
sources and is cleaned and categorized to analyze different
behavioral patterns. The techniques and the tools used vary
according to the organization or individual.
• R programming – This tool is the leading analytics tool used for
statistics and data modeling. R compiles and runs on various
platforms such as UNIX, Windows, and Mac OS. It also provides
tools to automatically install all packages as per user-requirement.
• Python – Python is an open-source, object-oriented programming
language which is easy to read, write and maintain. It provides
various machine learning and visualization libraries such as Scikit-
2.1 learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database
Introduction or JSON
• Tableau Public – This is a free software that connects to any data
to Tools and source such as Excel, corporate Data Warehouse etc. It then creates
visualizations, maps, dashboards etc with real-time updates on the
Environment web.
• QlikView – This tool offers in-memory data processing with the
results delivered to the end-users quickly. It also offers data
association and data visualization with data being compressed to
almost 10% of its original size.
• SAS – A programming language and environment for data
manipulation and analytics, this tool is easily accessible and can
analyze data from different sources.
• Microsoft Excel – This tool is one of the most widely used tools
for data analytics. Mostly used for clients’ internal data, this tool
analyzes the tasks that summarize the data with a preview of pivot
tables.
• RapidMiner – A powerful, integrated platform that can integrate
with any data source types such as Access, Excel, Microsoft SQL,
Tera data, Oracle, Sybase etc. This tool is mostly used for
predictive analytics, such as data mining, text analytics, machine
learning.
2.2 • KNIME – Konstanz Information Miner (KNIME) is an open-
Introduction source data analytics platform, which allows you to analyze and
model data. With the benefit of visual programming, KNIME
to Tools and provides a platform for reporting and integration through its
modular data pipeline concept.
Environment • OpenRefine – Also known as GoogleRefine, this data cleaning
software will help you clean up data for analysis. It is used for
cleaning messy data, the transformation of data and parsing data
from websites.
• Apache Spark – One of the largest large-scale data processing
engine, this tool executes applications in Hadoop clusters 100
times faster in memory and 10 times faster on disk. This tool is
also popular for data pipelines and machine learning model
development.
• Using big data as fundamental factor of
making decision which need new
capability, most firms are far away from
accessing all data resources. Companies
in various sectors have acquired crucial
insight from the structured data collected
2.3 from different enterprise systems and
Application anatomize by commercial database
management systems.
of modelling • Eg:
a business • 1.) Facebook and Twitter to standard the
& Need for instantaneous influence on campaign and
to examine consumer opinion about their
business products
Modelling • 2.) Some companies, like Amazon, eBay,
and Google, considered as early
commandants, examining factors that
control performance to define what raise
sales revenue and user interactivity.
Hadoop is an open source software
platform that enables processing of
large data sets in a distributed
computing environment", it discusses
some concepts according to big data,
2.3.1Utiliz the rules for building, organizing and
analyzing huge data-sets in the business
ing environment, they offered 3 architecture
Hadoop in layers and also they indicate some
graphical tools to explore and represent
Big Data unstructured-data, the authors specified
how the famous companies could
Analytics improve their business. Eg: Google,
Twitter and Facebook show their
attention in processing big data within
cloud-environment
• The Map() step: Each worker node applies the Map()
function to the local data and writes the output to a
temporary storage space. The Map() code is run
exactly once for each K1 key value, generating output
that is organized by key values K2. A master node
arranges it so that for redundant copies of input data
only one is processed.
Mapred node.
Types of • In Nominal Data there is no natural ordering in values in the attribute of the
dataset. Eg: color, Gender, nouns (name, place, animal, thing). These
categories cannot be predefined
Data and • Quantitative data or (discrete or continuous data) can
be further divided in to two types’ discrete attribute and
variables continuous attribute.
• Discrete Attribute which takes only finite number of
numerical values (integers). Eg: number of buttons, no of
days for product delivery etc.. These data can be
represented at every specific interval in case of time series
data mining or even in ratio based entries.
• Continuous Attribute which takes finite number of
fractional values. Eg: price, discount, height, weight, length,
temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or
Fig 2.5 Types of Data Varables
• Data modelling is nothing but a process
through which data is stored structurally
in a format in a database. Data modelling
is important because it enables
2.5 Data organizations to make data-driven
decisions and meet varied business goals.
Modellin • The entire process of data modelling is
g not as easy as it seems, though. You are
required to have a deeper understanding
Techniq of the structure of an organization and
then proposea solution that aligns with its
ues end-goals and suffices it in achieving the
desired objectives.
• Hierarchical Model
• Relational Model
Types of • Network Model
Data • Object Oriented Model
• Entity-relationship model
Models
Hierarchi
cal Model
Relational
Model
Network Model
Object Oriented
Model
Entity-relationship
model
Why data modeling is
important?
• A clear representation of data makes it easier to analyze the
data properly. It provides a quick overview of the data which
can then be used by the developers in varied applications.
• Data modeling represents the data properly in a model. It
rules out any chances of data redundancy and omission. This
helps in clear analysis and processing.
• Data modeling improves data quality and enables the
concerned stakeholders to make data-driven decisions.
Best Data Modeling Practices to Drive
Your Key Business Decisions
• Have a clear understanding of your end-goals and results
• Have a clear understanding of your organization’s requirements and organize your
data properly.
• Keep it sweet and simple and scale as you grow
• Keep your data models simple. The best data modeling practice here is to use a tool
which can start small and scale up as needed.
• Organize your data based on facts, dimensions, filters, and order
• It is highly recommended to organize your data properly using individual tables for
facts and dimensions to enable quick analysis.
• Have a clear opinion on how much datasets you want to keep. Maintaining more than
what is actually required wastes your data modeling, and leads to performance issues.
B estD atM o d elin g P ractieso D riv eY o u rK ey B u sin esD eciso n s
Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data
SNo Column Column2 Column
1 3
1 3 6 NAN
2 5 10 12
3 6 11 15
4 NAN 12 14
5 6 NAN NAN
6 10 13 16
SNo Column 1 Column2 Column 3
1 3 6 9.5
2 5 10 12
3 6 11 15
4 5 12 14
5 6 8.66 9.5
6 10 13 16
• Imputations using (most frequent) or (zero / constant) values This can be used
for categorical attributes.
• Disadvantage:
• Does not correlate relation between columns
• Creates bias in data.
• Imputation using KNN
• It creates a basic mean impute then uses the resulting complete list to construct a
KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it
finds the k- NNs, it takes the weighted average of them.
• The k nearest neighbours is an algorithm that is used for simple classification. The
algorithm uses ‘feature similarity’ to predict the values of any new data points. This
means that the new point is assigned a value based on how closely it resembles the
points in the training set. This can be very useful in making predictions about the
missing values by finding the k’s closest neighbours to the observation with missing data
and then imputing them based on the non- missing values in the neighbourhood.
• Advantage: This method is very accurate than mean, median and mode
• Disadvantage: Sensitive to outliers