Unit II

Data analytics is essential for businesses to gather insights, generate reports, perform market analysis, and improve customer experiences. Various tools such as R, Python, Tableau, and Hadoop are utilized for data analysis and modeling, each serving different purposes and capabilities. Effective data modeling and handling of missing data are crucial for making informed business decisions and improving data quality.

Uploaded by

Shaik Javeed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views32 pages

Unit II

Uploaded by

Shaik Javeed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

INTRODUCTION

TO ANALTYICS
UNIT -2
• Data Analytics has a key role in improving your business. Here are
4 main factors which signify the need for Data Analytics
• Gather Hidden Insights – Hidden insights from data
are gathered and then analyzed with respect to business
requirements.
• Generate Reports – Reports are generated from the
data and are passed on to the respective teams and
individuals to deal with further actions for a high rise in
business.
Introducti • Perform Market Analysis – Market Analysis can be
performed to understand the strengths and the
on to weaknesses of competitors.
• Improve Business Requirement – Analysis of Data
Analytics allows improving Business to customer requirements
and experience.
• Data Analytics refers to the techniques to analyze data to enhance
productivity and business gain. Data is extracted from various
sources and is cleaned and categorized to analyze different
behavioral patterns. The techniques and the tools used vary
according to the organization or individual.
• R programming – This tool is the leading analytics tool used for
statistics and data modeling. R compiles and runs on various
platforms such as UNIX, Windows, and Mac OS. It also provides
tools to automatically install all packages as per user-requirement.
• Python – Python is an open-source, object-oriented programming
language which is easy to read, write and maintain. It provides
various machine learning and visualization libraries such as Scikit-
2.1 learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database
Introduction or JSON
• Tableau Public – This is a free software that connects to any data
to Tools and source such as Excel, corporate Data Warehouse etc. It then creates
visualizations, maps, dashboards etc with real-time updates on the
Environment web.
• QlikView – This tool offers in-memory data processing with the
results delivered to the end-users quickly. It also offers data
association and data visualization with data being compressed to
almost 10% of its original size.
• SAS – A programming language and environment for data
manipulation and analytics, this tool is easily accessible and can
analyze data from different sources.
• Microsoft Excel – This tool is one of the most widely used tools
for data analytics. Mostly used for clients’ internal data, this tool
analyzes the tasks that summarize the data with a preview of pivot
tables.
• RapidMiner – A powerful, integrated platform that can integrate
with any data source types such as Access, Excel, Microsoft SQL,
Tera data, Oracle, Sybase etc. This tool is mostly used for
predictive analytics, such as data mining, text analytics, machine
learning.
2.2 • KNIME – Konstanz Information Miner (KNIME) is an open-
Introduction source data analytics platform, which allows you to analyze and
model data. With the benefit of visual programming, KNIME
to Tools and provides a platform for reporting and integration through its
modular data pipeline concept.
Environment • OpenRefine – Also known as GoogleRefine, this data cleaning
software will help you clean up data for analysis. It is used for
cleaning messy data, the transformation of data and parsing data
from websites.
• Apache Spark – One of the largest large-scale data processing
engine, this tool executes applications in Hadoop clusters 100
times faster in memory and 10 times faster on disk. This tool is
also popular for data pipelines and machine learning model
development.
• Using big data as fundamental factor of
making decision which need new
capability, most firms are far away from
accessing all data resources. Companies
in various sectors have acquired crucial
insight from the structured data collected
2.3 from different enterprise systems and
Application anatomize by commercial database
management systems.
of modelling • Eg:
a business • 1.) Facebook and Twitter to standard the
& Need for instantaneous influence on campaign and
to examine consumer opinion about their
business products
Modelling • 2.) Some companies, like Amazon, eBay,
and Google, considered as early
commandants, examining factors that
control performance to define what raise
sales revenue and user interactivity.
Hadoop is an open source software
platform that enables processing of
large data sets in a distributed
computing environment", it discusses
some concepts according to big data,
2.3.1Utiliz the rules for building, organizing and
analyzing huge data-sets in the business
ing environment, they offered 3 architecture
Hadoop in layers and also they indicate some
graphical tools to explore and represent
Big Data unstructured-data, the authors specified
how the famous companies could
Analytics improve their business. Eg: Google,
Twitter and Facebook show their
attention in processing big data within
cloud-environment
• The Map() step: Each worker node applies the Map()
function to the local data and writes the output to a
temporary storage space. The Map() code is run
exactly once for each K1 key value, generating output
that is organized by key values K2. A master node
arranges it so that for redundant copies of input data
only one is processed.

Working • The Shuffle ()step: The map output is sent to the

reduce processors, which assign the K2 key value that
each processor should work on, and provide that

of processor with all of the map- generated data

associated with that key value, such that all data
belonging to one key are located on the same worker

Mapred node.

uce • The Reduce() step: Worker nodes process each group

of output data(per key) in parallel, executing the user
provided Reduce() code; each function is run exactly
once for each K2 key value produced by the map step.

• Produce the final output: The MapReduce system

collects all of the reduce outputs and sorts them by K2
to produce the final out-come.
prominent representatives. IBM
represented many big data
options that enable users to
storing, managing, and
analyzing data through various
2.3.2 resources; it has a good
The rendering on business-
intelligence also healthcare
Employme areas. Compared with IBM, also
nt of Big Microsoft showed powerful
Data work in the area of cloud
computing activities and
Analytics techniques another example is
on IBM. Face-book and Twitter, who are
collecting various data from
user's profiles and using it to
increase their revenue
• Big data analytics and
2.3.3 Business intelligence are
united fields which became
The widely significant in the
Performan business and academic area,
ce of Data companies are permanently
trying to make insight from
Driven the extending the three V's
Companie ( variety, volume and velocity)
s. to support decision making
• Database is an organized collection of structured
information, or data, typically
stored electronically in a computer
system. A database is usually controlled by a
database management system (DBMS)
• The database can be divided into various categories
such as text databases, desktop database programs,
relational database management systems (RDMS),
and NoSQL and object-oriented databases.
• A text database is a system that maintains a
2.4 (usually large) text collection and provides fast and
accurate access to it. Eg: Text book, magazine,
Databases journals, manuals, etc..
• A desktop database is a database system that is
made to run on a single computer or PC. These
simpler solutions for data storage are much more
limited and constrained than larger data center or
data warehouse systems, where primitive database
software is replaced by sophisticated hardware and
networking setups. Eg: Microsoft excel, open
access, etc.
• A relational database (RDB) is a collective set
of multiple data sets organized by tables,
records and columns. RDBs establish
a well-defined relationship between
database tables. Tables communicate and share
information, which facilitates data searchability,
organization and reporting. Eg: sql, oracle,
Db2, DbaaS etc
• NoSQL databases are non-tabular, and store
2.4 data differently than relational tables.
NoSQL databases come in a variety of types
Databas based on their data model. The main types are
document, key-value, wide-column, and graph.

es Eg: JSON,Mango DB,CouchDB etc.

• Object-oriented databases (OODB) are
databases that represent data in the form of
objects and classes. In object-oriented
terminology, an object is a real-world entity, and
a class is a collection of objects. Object-oriented
databases follow the fundamental principles of
object-oriented programming (OOP). Eg: c++,
java, c#, small talk, LISP etc..
• In relational data base management system we normally use rows to
represent data and columns to represent the attribute.
• In terms of big data we represent the columns from RDMS as an attribute or
a variable. This variable can be divided in to two types’ categorical data or
qualitative data and continuous or discrete data called as quantitative
data.
• Qualitative data or Categorical data is normally
2.5 represented as variable that holds characters. And this is
divided in to two types’ nominal data and ordinal data.

Types of • In Nominal Data there is no natural ordering in values in the attribute of the
dataset. Eg: color, Gender, nouns (name, place, animal, thing). These
categories cannot be predefined
Data and • Quantitative data or (discrete or continuous data) can
be further divided in to two types’ discrete attribute and
variables continuous attribute.
• Discrete Attribute which takes only finite number of
numerical values (integers). Eg: number of buttons, no of
days for product delivery etc.. These data can be
represented at every specific interval in case of time series
data mining or even in ratio based entries.
• Continuous Attribute which takes finite number of
fractional values. Eg: price, discount, height, weight, length,
temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or
Fig 2.5 Types of Data Varables
• Data modelling is nothing but a process
through which data is stored structurally
in a format in a database. Data modelling
is important because it enables
2.5 Data organizations to make data-driven
decisions and meet varied business goals.
Modellin • The entire process of data modelling is
g not as easy as it seems, though. You are
required to have a deeper understanding
Techniq of the structure of an organization and
then proposea solution that aligns with its
ues end-goals and suffices it in achieving the
desired objectives.
• Hierarchical Model
• Relational Model
Types of • Network Model
Data • Object Oriented Model
• Entity-relationship model
Models
Hierarchi
cal Model
Relational
Model
Network Model
Object Oriented
Model
Entity-relationship
model
Why data modeling is
important?
• A clear representation of data makes it easier to analyze the
data properly. It provides a quick overview of the data which
can then be used by the developers in varied applications.
• Data modeling represents the data properly in a model. It
rules out any chances of data redundancy and omission. This
helps in clear analysis and processing.
• Data modeling improves data quality and enables the
concerned stakeholders to make data-driven decisions.
Best Data Modeling Practices to Drive
Your Key Business Decisions
• Have a clear understanding of your end-goals and results
• Have a clear understanding of your organization’s requirements and organize your
data properly.
• Keep it sweet and simple and scale as you grow
• Keep your data models simple. The best data modeling practice here is to use a tool
which can start small and scale up as needed.
• Organize your data based on facts, dimensions, filters, and order
• It is highly recommended to organize your data properly using individual tables for
facts and dimensions to enable quick analysis.
• Have a clear opinion on how much datasets you want to keep. Maintaining more than
what is actually required wastes your data modeling, and leads to performance issues.
B estD atM o d elin g P ractieso D riv eY o u rK ey B u sin esD eciso n s

• Keep crosschecking before continuing :It is the best

practice to maintain one-to-one or one-to-many
relationships. The many-to-many relationship only
introduces complexity in the system.
• Let them evolve: Data models become outdated quicker
than you expect. It is necessary that you keep them updated
from time to time.
• The Wrap Up: Data modeling plays a crucial role in the
growth of businesses, especially when you organizations to
base your decisions on facts and figures. To achieve the
varied business intelligence insights and goals, it is
recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.
2.6 Missing Imputations
• In statistics, imputation is the process of replacing missing data with substituted values.
...
• Because missing data can create problems for analyzing data, imputation is seen as a
way to avoid pitfalls involved with list-wise deletion of cases that have missing values.
• Do nothing to missing data
• Fill the missing values in the dataset using mean, median.
• Advantages:
• Works well with numerical dataset.
• Very fast and reliable.

Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data
SNo Column Column2 Column
1 3
1 3 6 NAN
2 5 10 12
3 6 11 15
4 NAN 12 14
5 6 NAN NAN
6 10 13 16
SNo Column 1 Column2 Column 3

1 3 6 9.5
2 5 10 12
3 6 11 15
4 5 12 14

5 6 8.66 9.5
6 10 13 16
• Imputations using (most frequent) or (zero / constant) values This can be used
for categorical attributes.
• Disadvantage:
• Does not correlate relation between columns
• Creates bias in data.
• Imputation using KNN
• It creates a basic mean impute then uses the resulting complete list to construct a
KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it
finds the k- NNs, it takes the weighted average of them.
• The k nearest neighbours is an algorithm that is used for simple classification. The
algorithm uses ‘feature similarity’ to predict the values of any new data points. This
means that the new point is assigned a value based on how closely it resembles the
points in the training set. This can be very useful in making predictions about the
missing values by finding the k’s closest neighbours to the observation with missing data
and then imputing them based on the non- missing values in the neighbourhood.
• Advantage: This method is very accurate than mean, median and mode
• Disadvantage: Sensitive to outliers

Big Data
No ratings yet
Big Data
190 pages
050.6. Automation
No ratings yet
050.6. Automation
538 pages
Unit 1-BigDataTools
No ratings yet
Unit 1-BigDataTools
69 pages
Data Management & Data Architecture
No ratings yet
Data Management & Data Architecture
21 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
15 Chinese Diesel Heater Problems + Troubleshooting & Error Codes
100% (2)
15 Chinese Diesel Heater Problems + Troubleshooting & Error Codes
15 pages
Winch Turn Sensor Adjustment at LICCON 2
No ratings yet
Winch Turn Sensor Adjustment at LICCON 2
37 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
BA - Unit-1
No ratings yet
BA - Unit-1
82 pages
Notes Theory COMPUTER SCIENCE A LEVELS
100% (2)
Notes Theory COMPUTER SCIENCE A LEVELS
59 pages
Da Unit Ii
No ratings yet
Da Unit Ii
25 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Bda Unit 1
No ratings yet
Bda Unit 1
20 pages
STPM 4 Raspi
No ratings yet
STPM 4 Raspi
13 pages
Unit 1 Topic 0 Introduction To Big Data
No ratings yet
Unit 1 Topic 0 Introduction To Big Data
39 pages
Big Data Sent 24 10 24
No ratings yet
Big Data Sent 24 10 24
49 pages
DA unit-II
No ratings yet
DA unit-II
15 pages
BIG DATA AND ANALYTICS Presentation
No ratings yet
BIG DATA AND ANALYTICS Presentation
31 pages
Sinamics v90 Opi En-Us En-Us
No ratings yet
Sinamics v90 Opi En-Us En-Us
286 pages
BIGDATA
No ratings yet
BIGDATA
43 pages
Data Analytics II-unit
No ratings yet
Data Analytics II-unit
20 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Unit 1 Data Science and Big Data
No ratings yet
Unit 1 Data Science and Big Data
23 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Examiner Payment System User Manual
No ratings yet
Examiner Payment System User Manual
20 pages
Partiunit5introduction To Big Data Its Type and Advantagedisadvantages
No ratings yet
Partiunit5introduction To Big Data Its Type and Advantagedisadvantages
4 pages
Unit 1
No ratings yet
Unit 1
36 pages
Notes On Data Science
No ratings yet
Notes On Data Science
3 pages
Presentation 20
No ratings yet
Presentation 20
31 pages
R II Bca IV Sem Unit 3 Balu Sir
No ratings yet
R II Bca IV Sem Unit 3 Balu Sir
14 pages
BDA 02 - Fundamentals
No ratings yet
BDA 02 - Fundamentals
64 pages
Big Data - Module 1
No ratings yet
Big Data - Module 1
35 pages
3.0 Odl
No ratings yet
3.0 Odl
28 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
34 pages
Data Analytics
No ratings yet
Data Analytics
69 pages
Unit-1 Introduction To Data Analytics
No ratings yet
Unit-1 Introduction To Data Analytics
35 pages
UNIT-1 BigData
No ratings yet
UNIT-1 BigData
10 pages
Unit 1
No ratings yet
Unit 1
20 pages
BDA Question Bank
No ratings yet
BDA Question Bank
20 pages
MarkView For Oracle 6.4 Admin Guide
No ratings yet
MarkView For Oracle 6.4 Admin Guide
368 pages
MODULE 1 - ST
No ratings yet
MODULE 1 - ST
13 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Analyzing Limitations and Solutions of Existing Data Analytics
No ratings yet
Analyzing Limitations and Solutions of Existing Data Analytics
21 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Toyota Wigo 1.0 G AT Vs Honda Brio 1.2 V CVT Vs Suzuki Swift 1.2 GL CVT - AutoDeal
100% (1)
Toyota Wigo 1.0 G AT Vs Honda Brio 1.2 V CVT Vs Suzuki Swift 1.2 GL CVT - AutoDeal
7 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Bda U1
No ratings yet
Bda U1
78 pages
Chapter 1 - 大数据概念
No ratings yet
Chapter 1 - 大数据概念
21 pages
Cisco Data Center Networking Architecture and Operations Assessment
No ratings yet
Cisco Data Center Networking Architecture and Operations Assessment
6 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Windows Server 2008 R2 and Windows 7 Group Policy Settings
No ratings yet
Windows Server 2008 R2 and Windows 7 Group Policy Settings
354 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
22 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Unit 2
No ratings yet
Unit 2
15 pages
ICT Presentation by Aparna Vasaniya
No ratings yet
ICT Presentation by Aparna Vasaniya
15 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Unit 2
No ratings yet
Unit 2
35 pages
Process Synchronization
No ratings yet
Process Synchronization
15 pages
Lecture 9-SOCIAL MEDIA MARKETING
No ratings yet
Lecture 9-SOCIAL MEDIA MARKETING
26 pages
General Tolerances For Linear and Angular Dimensions (Din Iso 2768 T1)
No ratings yet
General Tolerances For Linear and Angular Dimensions (Din Iso 2768 T1)
45 pages
Data Mining 1
No ratings yet
Data Mining 1
13 pages
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
No ratings yet
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
35 pages
Big Data Analytic Report
No ratings yet
Big Data Analytic Report
10 pages
Unit 1
No ratings yet
Unit 1
61 pages
Why Software Requirements Traceability Remains A Challenge: July 2009
No ratings yet
Why Software Requirements Traceability Remains A Challenge: July 2009
7 pages
Fda 1
No ratings yet
Fda 1
5 pages
Harnessing Big Data
No ratings yet
Harnessing Big Data
29 pages
Role and Importance of Knowledge Management in Indian Business Enterprises
No ratings yet
Role and Importance of Knowledge Management in Indian Business Enterprises
4 pages
A Bit-Serial Adder Using Partially Reversible Logic
No ratings yet
A Bit-Serial Adder Using Partially Reversible Logic
9 pages
Visual-Programming - SEMESTER BREAKUP
No ratings yet
Visual-Programming - SEMESTER BREAKUP
4 pages
Sony HT-XT100 - Home Theater System SM
No ratings yet
Sony HT-XT100 - Home Theater System SM
44 pages
What Is The Internet
No ratings yet
What Is The Internet
37 pages
Adhoc Reports in Success Factors
100% (1)
Adhoc Reports in Success Factors
10 pages
PSQCA 2WheelerAuto
No ratings yet
PSQCA 2WheelerAuto
11 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Radar and Navigational Aids
0% (1)
Radar and Navigational Aids
1 page
Qls - Quick - Lock (Slickline Tools)
No ratings yet
Qls - Quick - Lock (Slickline Tools)
1 page
Ccs 334
No ratings yet
Ccs 334
16 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
GPRS Configuration Manual: How To Access GPRS?
No ratings yet
GPRS Configuration Manual: How To Access GPRS?
5 pages
Sample Resumes: Programmer Analyst Resume
No ratings yet
Sample Resumes: Programmer Analyst Resume
4 pages

Unit II

Uploaded by

Unit II

Uploaded by

INTRODUCTION

Working • The Shuffle ()step: The map output is sent to the

of processor with all of the map- generated data

uce • The Reduce() step: Worker nodes process each group

• Produce the final output: The MapReduce system

es Eg: JSON,Mango DB,CouchDB etc.

• Keep crosschecking before continuing :It is the best

You might also like