0% found this document useful (0 votes)
8 views32 pages

Unit II

Data analytics is essential for businesses to gather insights, generate reports, perform market analysis, and improve customer experiences. Various tools such as R, Python, Tableau, and Hadoop are utilized for data analysis and modeling, each serving different purposes and capabilities. Effective data modeling and handling of missing data are crucial for making informed business decisions and improving data quality.

Uploaded by

Shaik Javeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views32 pages

Unit II

Data analytics is essential for businesses to gather insights, generate reports, perform market analysis, and improve customer experiences. Various tools such as R, Python, Tableau, and Hadoop are utilized for data analysis and modeling, each serving different purposes and capabilities. Effective data modeling and handling of missing data are crucial for making informed business decisions and improving data quality.

Uploaded by

Shaik Javeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

INTRODUCTION

TO ANALTYICS
UNIT -2
• Data Analytics has a key role in improving your business. Here are
4 main factors which signify the need for Data Analytics
• Gather Hidden Insights – Hidden insights from data
are gathered and then analyzed with respect to business
requirements.
• Generate Reports – Reports are generated from the
data and are passed on to the respective teams and
individuals to deal with further actions for a high rise in
business.
Introducti • Perform Market Analysis – Market Analysis can be
performed to understand the strengths and the
on to weaknesses of competitors.
• Improve Business Requirement – Analysis of Data
Analytics allows improving Business to customer requirements
and experience.
• Data Analytics refers to the techniques to analyze data to enhance
productivity and business gain. Data is extracted from various
sources and is cleaned and categorized to analyze different
behavioral patterns. The techniques and the tools used vary
according to the organization or individual.
• R programming – This tool is the leading analytics tool used for
statistics and data modeling. R compiles and runs on various
platforms such as UNIX, Windows, and Mac OS. It also provides
tools to automatically install all packages as per user-requirement.
• Python – Python is an open-source, object-oriented programming
language which is easy to read, write and maintain. It provides
various machine learning and visualization libraries such as Scikit-
2.1 learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database
Introduction or JSON
• Tableau Public – This is a free software that connects to any data
to Tools and source such as Excel, corporate Data Warehouse etc. It then creates
visualizations, maps, dashboards etc with real-time updates on the
Environment web.
• QlikView – This tool offers in-memory data processing with the
results delivered to the end-users quickly. It also offers data
association and data visualization with data being compressed to
almost 10% of its original size.
• SAS – A programming language and environment for data
manipulation and analytics, this tool is easily accessible and can
analyze data from different sources.
• Microsoft Excel – This tool is one of the most widely used tools
for data analytics. Mostly used for clients’ internal data, this tool
analyzes the tasks that summarize the data with a preview of pivot
tables.
• RapidMiner – A powerful, integrated platform that can integrate
with any data source types such as Access, Excel, Microsoft SQL,
Tera data, Oracle, Sybase etc. This tool is mostly used for
predictive analytics, such as data mining, text analytics, machine
learning.
2.2 • KNIME – Konstanz Information Miner (KNIME) is an open-
Introduction source data analytics platform, which allows you to analyze and
model data. With the benefit of visual programming, KNIME
to Tools and provides a platform for reporting and integration through its
modular data pipeline concept.
Environment • OpenRefine – Also known as GoogleRefine, this data cleaning
software will help you clean up data for analysis. It is used for
cleaning messy data, the transformation of data and parsing data
from websites.
• Apache Spark – One of the largest large-scale data processing
engine, this tool executes applications in Hadoop clusters 100
times faster in memory and 10 times faster on disk. This tool is
also popular for data pipelines and machine learning model
development.
• Using big data as fundamental factor of
making decision which need new
capability, most firms are far away from
accessing all data resources. Companies
in various sectors have acquired crucial
insight from the structured data collected
2.3 from different enterprise systems and
Application anatomize by commercial database
management systems.
of modelling • Eg:
a business • 1.) Facebook and Twitter to standard the
& Need for instantaneous influence on campaign and
to examine consumer opinion about their
business products
Modelling • 2.) Some companies, like Amazon, eBay,
and Google, considered as early
commandants, examining factors that
control performance to define what raise
sales revenue and user interactivity.
Hadoop is an open source software
platform that enables processing of
large data sets in a distributed
computing environment", it discusses
some concepts according to big data,
2.3.1Utiliz the rules for building, organizing and
analyzing huge data-sets in the business
ing environment, they offered 3 architecture
Hadoop in layers and also they indicate some
graphical tools to explore and represent
Big Data unstructured-data, the authors specified
how the famous companies could
Analytics improve their business. Eg: Google,
Twitter and Facebook show their
attention in processing big data within
cloud-environment
• The Map() step: Each worker node applies the Map()
function to the local data and writes the output to a
temporary storage space. The Map() code is run
exactly once for each K1 key value, generating output
that is organized by key values K2. A master node
arranges it so that for redundant copies of input data
only one is processed.

Working • The Shuffle ()step: The map output is sent to the


reduce processors, which assign the K2 key value that
each processor should work on, and provide that

of processor with all of the map- generated data


associated with that key value, such that all data
belonging to one key are located on the same worker

Mapred node.

uce • The Reduce() step: Worker nodes process each group


of output data(per key) in parallel, executing the user
provided Reduce() code; each function is run exactly
once for each K2 key value produced by the map step.

• Produce the final output: The MapReduce system


collects all of the reduce outputs and sorts them by K2
to produce the final out-come.
prominent representatives. IBM
represented many big data
options that enable users to
storing, managing, and
analyzing data through various
2.3.2 resources; it has a good
The rendering on business-
intelligence also healthcare
Employme areas. Compared with IBM, also
nt of Big Microsoft showed powerful
Data work in the area of cloud
computing activities and
Analytics techniques another example is
on IBM. Face-book and Twitter, who are
collecting various data from
user's profiles and using it to
increase their revenue
• Big data analytics and
2.3.3 Business intelligence are
united fields which became
The widely significant in the
Performan business and academic area,
ce of Data companies are permanently
trying to make insight from
Driven the extending the three V's
Companie ( variety, volume and velocity)
s. to support decision making
• Database is an organized collection of structured
information, or data, typically
stored electronically in a computer
system. A database is usually controlled by a
database management system (DBMS)
• The database can be divided into various categories
such as text databases, desktop database programs,
relational database management systems (RDMS),
and NoSQL and object-oriented databases.
• A text database is a system that maintains a
2.4 (usually large) text collection and provides fast and
accurate access to it. Eg: Text book, magazine,
Databases journals, manuals, etc..
• A desktop database is a database system that is
made to run on a single computer or PC. These
simpler solutions for data storage are much more
limited and constrained than larger data center or
data warehouse systems, where primitive database
software is replaced by sophisticated hardware and
networking setups. Eg: Microsoft excel, open
access, etc.
• A relational database (RDB) is a collective set
of multiple data sets organized by tables,
records and columns. RDBs establish
a well-defined relationship between
database tables. Tables communicate and share
information, which facilitates data searchability,
organization and reporting. Eg: sql, oracle,
Db2, DbaaS etc
• NoSQL databases are non-tabular, and store
2.4 data differently than relational tables.
NoSQL databases come in a variety of types
Databas based on their data model. The main types are
document, key-value, wide-column, and graph.

es Eg: JSON,Mango DB,CouchDB etc.


• Object-oriented databases (OODB) are
databases that represent data in the form of
objects and classes. In object-oriented
terminology, an object is a real-world entity, and
a class is a collection of objects. Object-oriented
databases follow the fundamental principles of
object-oriented programming (OOP). Eg: c++,
java, c#, small talk, LISP etc..
• In relational data base management system we normally use rows to
represent data and columns to represent the attribute.
• In terms of big data we represent the columns from RDMS as an attribute or
a variable. This variable can be divided in to two types’ categorical data or
qualitative data and continuous or discrete data called as quantitative
data.
• Qualitative data or Categorical data is normally
2.5 represented as variable that holds characters. And this is
divided in to two types’ nominal data and ordinal data.

Types of • In Nominal Data there is no natural ordering in values in the attribute of the
dataset. Eg: color, Gender, nouns (name, place, animal, thing). These
categories cannot be predefined
Data and • Quantitative data or (discrete or continuous data) can
be further divided in to two types’ discrete attribute and
variables continuous attribute.
• Discrete Attribute which takes only finite number of
numerical values (integers). Eg: number of buttons, no of
days for product delivery etc.. These data can be
represented at every specific interval in case of time series
data mining or even in ratio based entries.
• Continuous Attribute which takes finite number of
fractional values. Eg: price, discount, height, weight, length,
temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or
Fig 2.5 Types of Data Varables
• Data modelling is nothing but a process
through which data is stored structurally
in a format in a database. Data modelling
is important because it enables
2.5 Data organizations to make data-driven
decisions and meet varied business goals.
Modellin • The entire process of data modelling is
g not as easy as it seems, though. You are
required to have a deeper understanding
Techniq of the structure of an organization and
then proposea solution that aligns with its
ues end-goals and suffices it in achieving the
desired objectives.
• Hierarchical Model
• Relational Model
Types of • Network Model
Data • Object Oriented Model
• Entity-relationship model
Models
Hierarchi
cal Model
Relational
Model
Network Model
Object Oriented
Model
Entity-relationship
model
Why data modeling is
important?
• A clear representation of data makes it easier to analyze the
data properly. It provides a quick overview of the data which
can then be used by the developers in varied applications.
• Data modeling represents the data properly in a model. It
rules out any chances of data redundancy and omission. This
helps in clear analysis and processing.
• Data modeling improves data quality and enables the
concerned stakeholders to make data-driven decisions.
Best Data Modeling Practices to Drive
Your Key Business Decisions
• Have a clear understanding of your end-goals and results
• Have a clear understanding of your organization’s requirements and organize your
data properly.
• Keep it sweet and simple and scale as you grow
• Keep your data models simple. The best data modeling practice here is to use a tool
which can start small and scale up as needed.
• Organize your data based on facts, dimensions, filters, and order
• It is highly recommended to organize your data properly using individual tables for
facts and dimensions to enable quick analysis.
• Have a clear opinion on how much datasets you want to keep. Maintaining more than
what is actually required wastes your data modeling, and leads to performance issues.
B estD atM o d elin g P ractieso D riv eY o u rK ey B u sin esD eciso n s

• Keep crosschecking before continuing :It is the best


practice to maintain one-to-one or one-to-many
relationships. The many-to-many relationship only
introduces complexity in the system.
• Let them evolve: Data models become outdated quicker
than you expect. It is necessary that you keep them updated
from time to time.
• The Wrap Up: Data modeling plays a crucial role in the
growth of businesses, especially when you organizations to
base your decisions on facts and figures. To achieve the
varied business intelligence insights and goals, it is
recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.
2.6 Missing Imputations
• In statistics, imputation is the process of replacing missing data with substituted values.
...
• Because missing data can create problems for analyzing data, imputation is seen as a
way to avoid pitfalls involved with list-wise deletion of cases that have missing values.
• Do nothing to missing data
• Fill the missing values in the dataset using mean, median.
• Advantages:
• Works well with numerical dataset.
• Very fast and reliable.

Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data
SNo Column Column2 Column
1 3
1 3 6 NAN
2 5 10 12
3 6 11 15
4 NAN 12 14
5 6 NAN NAN
6 10 13 16
SNo Column 1 Column2 Column 3

1 3 6 9.5
2 5 10 12
3 6 11 15
4 5 12 14

5 6 8.66 9.5
6 10 13 16
• Imputations using (most frequent) or (zero / constant) values This can be used
for categorical attributes.
• Disadvantage:
• Does not correlate relation between columns
• Creates bias in data.
• Imputation using KNN
• It creates a basic mean impute then uses the resulting complete list to construct a
KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it
finds the k- NNs, it takes the weighted average of them.
• The k nearest neighbours is an algorithm that is used for simple classification. The
algorithm uses ‘feature similarity’ to predict the values of any new data points. This
means that the new point is assigned a value based on how closely it resembles the
points in the training set. This can be very useful in making predictions about the
missing values by finding the k’s closest neighbours to the observation with missing data
and then imputing them based on the non- missing values in the neighbourhood.
• Advantage: This method is very accurate than mean, median and mode
• Disadvantage: Sensitive to outliers

You might also like