0% found this document useful (0 votes)
12 views38 pages

Unit 1 Notes Final Part C

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views38 pages

Unit 1 Notes Final Part C

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

What is Big Data Analytics?

Big Data analytics is a process used to extract meaningful insights, such as hidden
patterns, unknown correlations, market trends, and customer preferences. Big Data
analytics provides various advantages—it can be used for better decision making,
preventing fraudulent activities, among other things.
Uses and Examples of Big Data Analytics
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:

• Using analytics to understand customer behaviour in order to optimize the


customer experience
• Predicting future trends in order to make better business decisions
• Improving marketing campaigns by understanding what works and what
doesn't
• Increasing operational efficiency by understanding where bottlenecks are
and how to fix them
• Detecting fraud and other forms of misuse sooner

The Lifecycle Phases of Big Data Analytics


• Phase I Business Problem Definition

In this stage, the team learns about the business domain, which presents the
motivation and goals for carrying out the analysis. In this stage, the problem is
identified, and assumptions are made that how much potential gain a company will
make after carrying out the analysis. Important activities in this step include
framing the business problem as an analytics challenge that can be addressed in
subsequent phases. It helps the decision-makers understand the business resources
that will be required to be utilized thereby determining the underlying budget
required to carry out the project. Moreover, it can be determined, whether the
problem identified, is a Big Data problem or not, based on the business
requirements in the business case. To qualify as a big data problem, the business
case should be directly related to one (or more) of the characteristics of volume,
velocity, or variety.

• Phase II Data Definition

Once the business case is identified, now it’s time to find the appropriate datasets
to work with. In this stage, analysis is done to see what other companies have done
for a similar case. Depending on the business case and the scope of analysis of the
project being addressed, the sources of datasets can be either external or internal
to the company. In the case of internal datasets, the datasets can include data
collected from internal sources, such as feedback forms, from existing software,
On the other hand, for external datasets, the list includes datasets from third-party
providers.

• Phase III Data Acquisition and filtration–


Once the source of data is identified, now it is time to gather the data from such
sources. This kind of data is mostly unstructured. Then it is subjected to filtration,
such as removal of the corrupt data or irrelevant data, which is of no scope to the
analysis objective. Here corrupt data means data that may have missing records,
or the ones, which include incompatible data types. After filtration, a copy of the
filtered data is stored and compressed, as it can be of use in the future, for some
other analysis.

• Phase IV Data Extraction

Now the data is filtered, but there might be a possibility that some of the entries of
the data might be incompatible, to rectify this issue, a separate phase is created,
known as the data extraction phase. In this phase, the data, which don’t match with
the underlying scope of the analysis, are extracted and transformed in such a form.

• Phase V Data Munging

As mentioned in phase III, the data is collected from various sources, which results
in the data being unstructured. There might be a possibility, that the data might
have constraints, that are unsuitable, which can lead to false results. Hence there
is a need to clean and validate the data. It includes removing any invalid data and
establishing complex validation rules. There are many ways to validate and clean
the data. For example, a dataset might contain few rows, with null entries. If a
similar dataset is present, then those entries are copied from that dataset, else those
rows are dropped.

• Phase VI Data Aggregation & Representation

The data is cleansed and validates, against certain rules set by the enterprise. But
the data might be spread across multiple datasets, and it is not advisable to work
with multiple datasets. Hence, the datasets are joined together. For example: If
there are two datasets, namely that of a Student Academic section and Student
Personal Details section, then both can be joined together via common fields, i.e.
roll number. This phase calls for intensive operation since the amount of data can
be very large. Automation can be brought into consideration, so that these things
are executed, without any human intervention.

• Phase VII Exploratory Data Analysis

Here comes the actual step, the analysis task. Depending on the nature of the big
data problem, analysis is carried out. Data analysis can be classified as
Confirmatory analysis and Exploratory analysis. In confirmatory analysis, the
cause of a phenomenon is analyzed before. The assumption is called the
hypothesis. The data is analyzed to approve or disapprove the hypothesis.
This kind of analysis provides definitive answers to some specific questions and
confirms whether an assumption was true or not.In an exploratory analysis, the
data is explored to obtain information, why a phenomenon occurred. This type of
analysis answers “why” a phenomenon occurred. This kind of analysis doesn’t
provide definitive, meanwhile, it provides discovery of patterns.

• Phase VIII Data Visualization


Now we have the answer to some questions, using the information from the data
in the datasets. But these answers are still in a form that can’t be presented to
business users. A sort of representation is required to obtains value or some
conclusion from the analysis. Hence, various tools are used to visualize the data in
graphic form, which can easily be interpreted by business users.
Visualization is said to influence the interpretation of the results. Moreover, it
allows the users to discover answers to questions that are yet to be formulated.

• Phase IX Utilization of analysis results

The analysis is done, the results are visualized, now it’s time for the business users
to make decisions to utilize the results. The results can be used for optimization,
to refine the business process. It can also be used as an input for the systems to
enhance performance.
Different Types of Big Data Analytics
Here are the four types of Big Data analytics:
1. Descriptive Analytics: What is happening now based on incoming data.

This summarizes past data into a form that people can easily read. This helps in
creating reports, like a company’s revenue, profit, sales, and so on. Also, it helps in
the tabulation of social media metrics.

Use Case: The Dow Chemical Company analyzed its past data to increase facility
utilization across its office and lab space. Using descriptive analytics, Dow was able
to identify underutilized space. This space consolidation helped the company save
nearly US $4 million annually.

2. Predictive Analytics: What might happen in future.

This type of analytics looks into the historical and present data to make predictions of
the future. Predictive analytics uses data mining, AI, and machine learning to analyze
current data and make predictions about the future. It works on predicting customer
trends, market trends, and so on.

Use Case: PayPal determines what kind of precautions they have to take to protect
their clients against fraudulent transactions. Using predictive analytics, the company
uses all the historical payment data and user behavior data and builds an algorithm
that predicts fraudulent activities.
3. Prescriptive Analytics: What action should be taken

This type of analytics prescribes the solution to a particular problem. Perspective


analytics works with both descriptive and predictive analytics. Most of the time, it
relies on AI and machine learning.

Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type
of analytics is used to build an algorithm that will automatically adjust the flight fares
based on numerous factors, including customer demand, weather, destination, holiday
seasons, and oil prices.

4. Diagnostic Analytics: What did it happen

This is done to understand what caused a problem in the first place. Techniques like
drill-down, data mining, and data recovery are all examples. Organizations use
diagnostic analytics because they provide an in-depth insight into a particular
problem.

Use Case: An e-commerce company’s report shows that their sales have gone down,
although customers are adding products to their carts. This can be due to various
reasons like the form didn’t load correctly, the shipping fee is too high, or there are
not enough payment options available. This is where you can use diagnostic analytics
to find the reason.
Why is big data analytics important?
In today’s world, Big Data analytics is fueling everything we do online—in every
industry.

Take the music streaming platform Spotify for example. The company has nearly 96
million users that generate a tremendous amount of data every day. Through this
information, the cloud-based platform automatically generates suggested songs—
through a smart recommendation engine—based on likes, shares, search history, and
more. What enables this is the techniques, tools, and frameworks that are a result of
Big Data analytics.

If you are a Spotify user, then you must have come across the top recommendation
section, which is based on your likes, past history, and other things. Utilizing a
recommendation engine that leverages data filtering tools that collect data and then
filter it using algorithms works. This is what Spotify does.

History of Big Data Analytics


The history of Big Data analytics can be traced back to the early days of computing,
when organizations first began using computers to store and analyze large amounts of
data. However, it was not until the late 1990s and early 2000s that Big Data analytics
really began to take off, as organizations increasingly turned to computers to help
them make sense of the rapidly growing volumes of data being generated by their
businesses.

Today, Big Data analytics has become an essential tool for organizations of all sizes
across a wide range of industries. By harnessing the power of Big Data, organizations
are able to gain insights into their customers, their businesses, and the world around
them that were simply not possible before.

As the field of Big Data analytics continues to evolve, we can expect to see even more
amazing and transformative applications of this technology in the years to come.

Benefits and Advantages of Big Data Analytics


1. Risk Management

Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to
identify fraudulent activities and discrepancies. The organization leverages it to
narrow down a list of suspects or root causes of problems .
2. Product Development and Innovations

Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and
armed forces across the globe, uses Big Data analytics to analyze how efficient the
engine designs are and if there is any need for improvements.
3. Quicker and Better Decision Making Within Organizations

Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example,
the company leverages it to decide if a particular location would be suitable for a new
outlet or not. They will analyze several different factors, such as population,
demographics, accessibility of the location, and more.
4. Improve Customer Experience

Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences.
They monitor tweets to find out their customers’ experience regarding their journeys,
delays, and so on. The airline identifies negative tweets and does what’s necessary to
remedy the situation. By publicly addressing these issues and offering solutions, it
helps the airline build good customer relations.

Big Data Analytics Tools


Here are some of the key big data analytics tools :
• Hadoop - helps in storing and analyzing data
• MongoDB - used on datasets that change frequently
• Talend - used for data integration and management
• Cassandra - a distributed database used to handle chunks of data
• Spark - used for real-time processing and analyzing large amounts of data
• STORM - an open-source real-time computational system
• Kafka - a distributed streaming platform that is used for fault-tolerant storage
1.2. CHALLENGES OF CONVENTIONAL SYSTEMS

1.2.1 Introduction to Conventional Systems

What is Conventional System?

Conventional Systems.
 The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
 Big data is huge amount of data which is beyond the processing capacity
ofconventional data base systems to manage and analyze the data in a specific time
interval.

Difference between conventional computing and intelligent computing

 The conventional computing functions logically with a set of rules and calculations
while the neural computing can function via images, pictures, and concepts.
 Conventional computing is often unable to manage the variability of data obtained in the
real world.
 On the other hand, neural computing, like our own brains, is well suited to situations that
have no clear algorithmic solutions and are able to manage noisy imprecise data. This
allows them to excel in those areas that conventional computing often finds difficult.

1.2.2 Comparison of Big Data with Conventional Data

Big Data Conventional Data


Huge data sets Data set size in control.
Unstructured data such as text, video, Normally structured data such as numbers
and audio. and categories, but it can take other forms
as well.

Hard-to-perform queries and analysis Relatively easy-to-perform queries and


analysis.
Needs a new methodology for analysis. Data analysis can be achieved by using
conventional methods.
Need tools such as Hadoop, Hive, Tools such as SQL, SAS, R, and Excel
Hbase, Pig, Sqoop, and so on. alone may be sufficient.
The aggregated or sampled or filtered data. Raw transactional data.

Used for reporting, basic analysis, and Used for reporting, advanced analysis, and
text mining. Advanced analytics is only in predictive modeling .
a starting stage in big data.
Big data analysis needs both Analytical skills are sufficient for
programming skills (such as Java) and conventional data; advanced analysis tools
analytical skills to perform analysis. don’t require expert programing skills.

Petabytes/exabytes of data. Millions/billions of accounts.

Billions/trillions of transactions. Megabytes/gigabytes of data.

Thousands/millions of accounts. Millions of transactions

Generated by big financial institutions, Generated by small enterprises and small


Facebook, Google, Amazon, eBay, banks.
Walmart, and so on.

1.2.2 List of challenges of Conventional Systems

The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:

1) Uncertainty of Data Management Landscape


2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics

1) Uncertainty of Data Management Landscape:

 Because big data is continuously expanding, there are new companies and technologies
that are being developed everyday.
 A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.

2) The Big Data Talent Gap:

 While Big Data is a growing field, there are very few experts available in this field.
 This is because Big data is a complex field and people who understand the complexity
and intricate nature of this field are far few and between.

3) The talent gap that exists in the industry Getting data into the big data platform:
 Data is increasing every single day. This means that companies have to tackle limitless
amount of data on a regular basis.
 The scale and variety of data that is available today can overwhelm any data practitioner
and that is why it is important to make data accessibility simple and convenient for
brand mangers and owners.

4) Need for synchronization across data sources:


 As data sets become more diverse, there is a need to incorporate them into an analytical
platform.
 If this is ignored, it can create gaps and lead to wrong insights and messages.

5) Getting important insights through the use of Big data analytics:


 It is important that companies gain proper insights from big data analytics and it is
important that the correct department has access to this information.
 A major challenge in the big data analytics is bridging this gap in an effective fashion.

Other Three challenges of Conventional systems

Three Challenges That big data face.

1. Data
2. Process
3. Management

1. Data Challenges
Volume

1.The volume of data, especially machine-generated data, is exploding,


2.how fast that data is growing every year, withnew sources of data that are emerging.
3.For example, in the year 2000, 800,000petabytes (PB) of data were stored in the world, and it
is expected to reach 35 zetta bytes (ZB) by2020 (according to IBM).

Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook,
10 TB.
•Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in
2011.
•The challenge is how to deal with the size of Big Data.

Variety, Combining Multiple Data Sets


•More than 80% of today’s information is unstructured and it is typically too big to manage
effectively.
•Today, companies are looking to leverage a lot more•data from a wider variety of sources both
inside and outside the organization.
•Things like documents, contracts, machine data, sensor data, social media, health records,
emails, etc. The list is endless really.

Variety•A lot of this data is unstructured, or has a complex structure that’s hard to represent in
rows and columns.

2. Processing

 More than 80% of today’s information isunstructured and it is typically too big to
manage effectively.

 Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization.

 Things like documents, contracts, machine data, sensor data, social media, health
records, emails, etc. The list is endless really.
3. Management

 A lot of this data is unstructured, or has acomplex structure that’s hard to represent in
rows and columns.

Big Data Challenges

– The challenges include capture, duration, storage, search, sharing, transfer,


– analysis, and visualization.

• Big Data is trend to larger data sets


• due to the additional information derivable from analysis of a single large set of related
data,
– as compared to separate smaller sets with the same total amount of data, allowing
correlations to be found to
• "spot business trends, determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real-time roadway traffic
conditions.”

Challenges of Big Data


The following are the five most important challenges of the Big Data

a) Meeting the need for speed


In today’s hypercompetitive business environment, companies not only have to find and
analyze the relevant data they need, they must find it quickly.

b) Visualization helps organizations perform analyses and make decisions much more
rapidly, but the challenge is going through the sheer volumes of data and accessing the
level of detail needed, all at a high speed.

c) The challenge only grows as the degree of granularity increases. One possible solution
is hardware. Some vendors are using increased memory and powerful parallel
processing to crunch large volumes of data extremely quickly
d) Understanding the data
 It takes a lot of understanding to get data in the RIGHT SHAPE so that you can use
 visualization as part of data analysis.

d) Addressing data quality


 Even if you can find and analyze data quickly and put it in the proper context for the
 audience that will be consuming the information, the value of data for DECISION-
MAKING PURPOSES will be jeopardized if the data is not accurate or timely.
This is a challenge with any data analysis.

e) Displaying meaningful results


 Plotting points on a graph for analysis becomes difficult when dealing with extremely
 large amounts of information or a variety of categories of information.
 For example, imagine you have 10 billion rows of retail SKU data that you’re trying to
 compare. The user trying to view 10 billion plots on the screen will have a hard time
 seeing so many data points.
 . By grouping the data together, or “binning,” you can more effectively visualize the
data.

f) Dealing with outliers


 The graphical representations of data made possible by visualization can communicate
 trends and outliers much faster than tables containing numbers and text.
 Users can easily spot issues that need attention simply by glancing at a chart. Outliers
typically represent about 1 to 5 percent of data, but when you’re working with massive
amounts of data, viewing 1 to 5 percent of the data is rather difficult
 We can also bin the results to both view the distribution of data and see the outliers.
 While outliers may not be representative of the data, they may also reveal previously
 unseen and potentially valuable insights.

 Visual analytics enables organizations to take raw data and present it in a meaningful
way that generates the most value. However, when used with big data, visualization is
bound to lead to some challenges.
**************
1.3. INTELLIGENT DATA ANALYSIS

1.3.1 INTRODUCTION TO INTELLIGENT DATA ANALYSIS (IDA)

Intelligent Data Analysis (IDA) is one of the hot issues in the field of
artificial intelligence and information.

What is Intelligent Data Analysis (IDA)?

IDA is

… an interdisciplinary study concerned with the effective analysis of data;

… used for extracting useful information from large quantities of online data; extracting
desirable knowledge or interesting patterns from existing databases;

 the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;

 at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;

 usually in the context of human expertise used in solving problems.

 the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;

 at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;

 usually in the context of human expertise used in solving problems.

Goal:

Goal of Intelligent data analysis is to extract useful knowledge, the process demands a
combination of extraction, analysis, conversion, classification, organization, reasoning, and so
on.
1,3,2 Uses / Benefits of IDA

Intelligent Data Analysis provides a forum for the examination of issues related to the research
and applications of Artificial Intelligence techniques in data analysis across a variety of
disciplines and the techniques include (but are not limited to):

The benefit areas are:

 Data Visualization
 Data pre-processing (fusion, editing, transformation, filtering, sampling)
 Data Engineering
 Database mining techniques, tools and applications
 Use of domain knowledge in data analysis
 Big Data applications
 Evolutionary algorithms
 Machine Learning(ML)
 Neural nets
 Fuzzy logic
 Statistical pattern recognition
 Knowledge Filtering and
 Post-processing

Intelligent Data Analysis (IDA)

Why IDA?

 Decision making is asking for information and knowledge

 Data processing can give them

 Multidimensionality of problems is looking for methods for adequate and deep data
processing and analysis

 Epidemiological study (1970-1990)


 Sample of examinees died from cardiovascular diseases during the period

 Question: Did they know they were ill?

1 – they were healthy

2 – they were ill (drug treatment, positive clinical and laboratory findings)

1.3.4 Intelligent Data Analysis

Knowledge Acquisition
 The process of eliciting, analyzing, transforming, classifying, organizing and integrating
knowledge and representing that knowledge in a form that can be used in a computer
system.

Knowledge in a domain can be expressed as a number of rules

A Rule :

A formal way of specifying a recommendation, directive, or strategy, expressed as "IF


premise THEN conclusion" or "IF condition THEN action".

How to discover rules hidden in the data?

1.3.4 Intelligent Data Examples:

Example of IDA

 Epidemiological study (1970-1990)

 Sample of examinees died from cardiovascular diseases during the period

Question: Did they know they were ill?

1 – they were healthy

2 – they were ill (drug treatment, positive clinical and laboratory findings)
Illustration of IDA by using See5

 application.names - lists the classes to which cases may belong and the attributes
used to describe each case.

 Attributes are of two types: discrete attributes have a value drawn from a set of
possibilities, and continuous attributes have numeric values.

 application.data - provides information on the training cases from which See5 will
extract patterns.

 The entry for each case consists of one or more lines that give the values for all
attributes.

 application.data - provides information on the training cases from which See5 will
extract patterns.

 The entry for each case consists of one or more lines that give the values for all
attributes.

 application.test - provides information on the test cases (used for evaluation of


results).

 The entry for each case consists of one or more lines that give the values for all
attributes.

Goal 1.1 :
application.names – example
gender:M,F
activity:1,2,3
age: continuous
smoking: No, Yes

Goal:1,2 :
application.data – example
M,1,59,Yes,0,0,0,0,119,73,103,86,247,87,15979,?,?,?,1,73,2.5
M,1,66,Yes,0,0,0,0,132,81,183,239,?,783,14403,27221,19153,23187,1,73,2.6
M,1,61,No,0,0,0,0,130,79,148,86,209,115,21719,12324,10593,11458,1,74,2.5
… …

Result:
Results – example

Rule 1: (cover 26)


gender = M
SBP > 111
oil_fat > 2.9
-> class 1 [0.929]

Rule 1: (cover 26)


gender = M
SBP > 111
oil_fat > 2.9
-> class 1 [0.929]

Rule 4: (cover 14)


smoking = Yes
SBP > 131
glucose > 93
glucose <= 118
oil_fat <= 2.9
-> class 2 [0.938]

Rule 15: (cover 2)


SBP <= 111
oil_fat > 2.9
-> class 2 [0.750]

Evaluation on training data


(199 cases):
(a) (b) <-classified as
---- ----
107 3 (a): class 1
17 72 (b): class 2

Results on (training set):

Sensitivity=0.97
Specificity=0.81

Sensitivity=0.97
Specificity=0.81

Sensitivity=0.98
Specificity=0.90

Evaluation of IDA results

 Absolute & relative accuracy


 Sensitivity & specificity
 False positive & false negative
 Error rate
 Reliability of rules
 Etc.
******************
1. 4 NATURE OF DATA

1.4.1 INTRODUCTION
Data

 Data is a set of values of qualitative or quantitative variables; restated, pieces


of data are individual pieces of information.
 Data is measured, collected and reported, and analyzed, whereupon it can be visualized
using graphs or images.

Properties of Data
For examining the properties of data, reference to the various definitions of data.

Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement

.
a) Amenability of use: From the dictionary meaning of data it is learnt that data are facts
used in deciding something. In short, data are meant to be used as a base for arriving at
definitive conclusions.

b) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.

c) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.

d) Essence: A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the
matter.
e) Aggregation: Aggregation is cumulating or adding up.

f) Compression: Large amounts of data are always compressed to make them more
meaningful. Compress data to a manageable size.Graphs and charts are some examples
of compressed data.

g) Refinement: Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when
data are processed or refined.

1.4.2 TYPES OF DATA

 In order to understand the nature of data it is necessary to categorize them into various
types.
 Different categorizations of data are possible.
 The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
 Within each of these fields, there may be several ways in which data can be categorized into
types.

There are four types of data:

 Nominal
 Ordinal
 Interval
 Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
The distinction between the four types of scales center on three different characteristics:

1. The order of responses – whether it matters or not


2. The distance between observations – whether it matters or is interpretable
3. The presence or inclusion of a true zero
1.4.2.1 Nominal Scales
Nominal scales measure categories and have the following characteristics:

 Order: The order of the responses or observations does not matter.


 Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not
the same as a 2 and 3.
 True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.
Appropriate statistics for nominal scales: mode, count, frequencies
Displays: histograms or bar charts

1.4.2.2 Ordinal Scales


At the risk of providing a tautological definition, ordinal scales measure, well, order. So, our
characteristics for ordinal scales are:

 Order: The order of the responses or observations matters.


 Distance: Ordinal scales do not hold distance. The distance between first and second is
unknown as is the distance between first and third along with all observations.
 True Zero: There is no true or real zero. An item, observation, or category cannot finish
zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts

1.4.2.3 Interval Scales


Interval scales provide insight into the variability of the observations or data.
Classic interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
In an interval scale, users could respond to “I enjoy opening links to thwebsite from a company
email” with a response ranging on a scale of values.
The characteristics of interval scales are:

 Order: The order of the responses or observations does matter.


 Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the
same as 4 to 5. Also, six is twice as much as three and two is half of four. Hence, we can
perform arithmetic operations on the data.
 True Zero: There is no zero with interval scales. However, data can be rescaled in a
manner that contains zero. An interval scales measure from 1 to 9 remains the same as
11 to 19 because we added 10 to all values. Similarly, a 1 to 9 interval scale is the same
a -4 to 4 scale because we subtracted 5 from all values. Although the new scale contains
zero, zero remains uninterruptable because it only appears in the scale from the
transformation.
Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.

1.4.2.4 Ratio Scales


Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:

 Order: The order of the responses or observations matters.


 Distance: Ratio scales do do have an interpretable distance.
 True Zero: There is a true zero.
Income is a classic example of a ratio scale:

 Order is established. We would all prefer $100 to $1!


 Zero dollars means we have no income (or, in accounting terms, our revenue exactly
equals our expenses!)
 Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.
For the web analyst, the statistics for ratio scales are the same as for interval scales.
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
The table below summarizes the characteristics of all four types of scales.

Nominal Ordinal Interval Ratio

Order Matters No Yes Yes Yes

Distance Is No No Yes Yes


Interpretable

Zero Exists No No No Yes

1.4.3 DATA CONVERSION

 We can convert or transform our data from ratio to interval to ordinal to nominal.
However, we cannot convert or transform our data from nominal to ordinal to interval
to ratio.

 Scaled data can be measured in exact amounts.


For example, 60 degrees , 12.5 feet, 80 Miles per hour

 Scaled data can be measured with equal intervals.


For example, Between 0 and 1 is 1 inch, Between 13 and 14 is also 1 inch
 .Ordinal or ranked data provides comparative Amounts

Example:
1st Place 2nd Place 3rd Place

 Not equal intervals


1st Place 2nd Place 3rd Place

19.6 feet 18.2 feet 12.4 feet

1.4.4 DATA SELECTION

Another Example that handle the question as :

What is the average driving speed of teenagers on the freeway?


a) Scaled
b) Ordinal
Scaled – Speed:- Speed can be measured in exact amounts withequal intervals.

Example :
60 degrees 12.5 feet 80 Miles per hour

 Ordinal or ranked data provides comparative amounts.

For example, 1st Place 2nd Place 3rd Place

 Percentiles provide comparative amounts.

In this case, 93% of all hospital have lower patient satisfaction scores than Eastridge hospital.
31% have lower satisfaction scores than Westridge Hospital.

Thus the nature of data and its value have great influence on data insight in it.

***********************
5. ANALYTIC PROCESS AND TOOLS

• There are 6 analytic processes:


1. Deployment

2. Business Understanding

3. Data Exploration

4. Data Preparation

5. Data Modeling

6. Data Evaluation
Step 1: Deployment

• Here we need to:

– plan the deployment and monitoring and maintenance,

– we need to produce a final report and review the project.

– In this phase,

• we deploy the results of the analysis.

• This is also known as reviewing the project.

Step 2: Business Understanding

• Business Understanding

– The very first step consists of business understanding.

– Whenever any requirement occurs, firstly we need to determine the business


objective,

– assess the situation,

– determine data mining goals and then

– produce the project plan as per the requirement.

• Business objectives are defined in this phase.

Step 3: Data Exploration

• The second step consists of Data understanding.

– For the further process, we need to gather initial data, describe and explore the
data and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its application
and the need for the project in this phase.

– This is also known as data exploration.

• This is necessary to verify the quality of data collected.

Step 4: Data Preparation

• From the data collected in the last step,

– we need to select data as per the need, clean it, construct it to get useful
information and

– then integrate it all.

• Finally, we need to format the data to get the appropriate data.

• Data is selected, cleaned, and integrated into the format finalized for the analysis in this
phase.

Step 5: Data Modeling

• we need to
– select a modeling technique, generate test design, build a model and assess the
model built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented
on the data in this phase.

• Where processing is hosted?


– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing

• Big data tools for HPC and supercomputing


– MPI
• Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model
– Graph model
– Collective model
• Other BDA tools
– SaS
– R
– Hadoop

Thus the BDA tools are used through out the BDA applications development.
******************

1.6 ANALYSIS AND REPORTING

1.6.1 INTRODUCTION TO ANALYSIS AND REPORTING

What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.

• What is Reporting ?
• Reporting is
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing.”

1.6.2 COMPARING ANALYSIS WITH REPORTING

• Reporting is “the process of organizing data into informational summaries in order to


monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a slidedeck,
or online dashboard — falls under this category.
• Analytics is “the process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.”
• Reporting helps companies to monitor their online business and be alerted to when data
falls outside of expected ranges.
• Good reporting
• should raise questions about the business from its end users.
• The goal of analysis is
• to answer questions by interpreting the data at a deeper level and providing
actionable recommendations.

• A firm may be focused on the general area of analytics (strategy, implementation,


reporting, etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related
activities and don’t make it to the analysis stage

A reporting activity deliberately proposes Analysis activity.


1.6.3 CONTRAST BETWEEN ANALYSIS AND REPORTING

The basis differences between Analysis and Reporting are as follows:

Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly Inflexible

• Reporting translates raw data into information.


• Analysis transforms data and information into insights.
• reporting shows you what is happening
• while analysis focuses on explaining why it is happening and what you can do about it.

 Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
 Reporting and analysis can go hand-in-hand:
 Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
 Reporting translate a raw data into information
 Reporting usually raises a question – What is happening ?
 Analysis transforms the data into insights - Why is it happening ? What you can do
about it?

Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing
in the needy context.

*****************
1.7 MODERN ANALYTIC TOOLS

1.7.1 Introduction to Modern Analytic Tools

• Modern Analytic Tools: Current Analytic tools concentrate on three classes:


a) Batch processing tools
b) Stream Processing tools and
c) Interactive Analysis tools.

a) Big Data Tools Based on Batch Processing:


Batch processing system :-
• Batch Processing System involves
– collecting a series of processing jobs and carrying them out periodically as
a group (or batch) of jobs.
• It allows a large volume of jobs to be processed at the same time.
• An organization can schedule batch processing for a time when there is little
activity on their computer systems, for example overnight or at weekends.
• One of the most famous and powerful batch process-based Big Data tools is
Apache Hadoop.
 It provides infrastructures and platforms for other specific Big Data
applications.
b) Stream Processing tools
• Stream processing – Envisioning (predicting) the life in data as and when it
transpires.
• The key strength of stream processing is that it can provide insights faster, often
within milliseconds to seconds.
– It helps understanding the hidden patterns in millions of data records in
real time.
– It translates into processing of data from single or multiple sources
– in real or near-real time applying the desired business logic and emitting the
processed information to the sink.
• Stream processing serves
– multiple
– resolves in today’s business arena.
Real time data streaming tools are:
a) Storm
• Storm is a stream processing engine without batch support,
• a true real-time processing framework,
• taking in a stream as an entire ‘event’ instead of series of small batches.
• Apache Storm is a distributed real-time computation system.
• It’s applications are designed as directed acyclic graphs.

b) Apache flink
• Apache flink is
– an open source platform
– which is a streaming data flow engine that provides communication fault
tolerance and
– data distribution computation over data stream .
– flink is a top level project of Apache flink is scalable data analytics framework
that is fully compatible to hadoop .
– flink can execute both stream processing and batch processing easily.
– flink was designed as an alternative to map-reduce.

c) Kinesis
– Kinesis as an out of the box streaming data tool.
– Kinesis comprises of shards which Kafka calls partitions.
– For organizations that take advantage of real-time or near real-time access to
large stores of data,
– Amazon Kinesis is great.
– Kinesis Streams solves a variety of streaming data problems.
– One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
– Data is put into Kinesis streams.
– This ensures durability and elasticity.

c) Interactive Analysis -Big Data Tools


• The interactive analysis presents
– the data in an interactive environment,
– allowing users to undertake their own analysis of information.
• Users are directly connected to
– the computer and hence can interact with it in real time.
• The data can be :
– reviewed,
– compared and
– analyzed
• in tabular or graphic format or both at the same time.

IA -Big Data Tools -

a) Google’s Dremel is the google proposed an interactive analysis system in 2010. And
named named Dremel.
– which is scalable for processing nested data.
– Dremel provides
• a very fast SQL like interface to the data by using a different technique
than MapReduce.
• Dremel has a very different architecture:
– compared with well-known Apache Hadoop, and
– acts as a successful complement of Map/Reduce-based computations.

• Dremel has capability to:


– run aggregation queries over trillion-row tables in seconds
– by means of:
• combining multi-level execution trees and
• columnar data layout.

b) Apache drill
• Apache drill is:
– Drill is an Apache open-source SQL query engine for Big Data exploration.
– It is similar to Google’s Dremel.
• For Drill, there is:
– more flexibility to support
• a various different query languages,
• data formats and
• data sources.
• Drill is designed from the ground up to:
– support high-performance analysis on the semi-structured and
– rapidly evolving data coming from modern Big Data applications.
• Drill provides plug-and-play integration with existing Apache Hive and Apache
HBase deployments.

7.1.2 Categories of Modern Analytic Tools

a) Big data tools for HPC and supercomputing


– MPI
b) Big Data Tools for HPC and Supercomputing
• MPI(Message Passing Interface, 1992)
– Provide standardized function interfaces for communication between
parallel processes.
• Collective communication operations
– Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reduce-
scPopular implementations
– atter.
– MPICH (2001)
– OpenMPI (2004)
c) Big data tools on clouds
i. MapReduce model
ii. Iterative MapReduce model
iii. DAG model
iv. Graph model
v. Collective model

a) MapReduce Model
Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters.
OSDI 2004.
b) Apache Hadoop (2005)
Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC 2013.
Key Features of MapReduce Model
Designed for clouds
Large clusters of commodity machines
Designed for big data
Support from local disks based distributed file system (GFS / HDFS)
Disk based intermediate data transfer in Shuffling
MapReduce programming model
Computation pattern: Map tasks and Reduce tasks
Data abstraction: KeyValue pairs

Iterative MapReduce Model


• Twister:
A runtime for iterative MapReduce.
Have simple collectives: Boradcasting and aggregation.

• HaLoop
• An efficient Data Processing on Large clusters
• Have features:
– Loop-Aware Task Scheduling
– Caching and Indexing for Loop-Invariant Data on local disk.

Resilient Distributed Datasets(RDD):


• A Fault-Tolerant Abstraction for In-Memory Cluster Computing
– RDD operations
• MapReduce-like parallel operations
– DAG of execution stages and pipelined transformations
– Simple collectives: broadcasting and aggregation
DAG (Directed Acyclic Graph) Model
– A Distributed Data-Parallel Programs from Sequential Building Blocks,
• Apache Spark
– Cluster Computing with Working Sets

Graph Model
• Graph Processing with BSP model
• Pregel (2010)
– A System for Large-Scale Graph Processing. SIGMOD 2010.
– Apache Hama (2010)
• Apache Giraph (2012)
– Scaling Apache Giraph to a trillion edges

Pregel & Apache Giraph


• Computation Model
– Superstep as iteration
– Vertex state machine:
Active and Inactive, vote to halt
– Message passing between vertices
– Combiners
– Aggregators
– Topology mutation
• Master/worker model
• Graph partition: hashing
• Fault tolerance: checkpointing and confined recovery

GraphLab (2010)
• GraphLab: A New Parallel Framework for Machine Learning. UAI 2010.
• Distributed GraphLab: A Framework for Machine Learning and Data Mining in the
Cloud.
• Data graph
• Update functions and the scope
• PowerGraph (2012)
– PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.
– Gather, apply, Scatter (GAS) model
• GraphX (2013)
– A Resilient Distributed Graph System on Spark. GRADES

Collective Model
• Harp (2013)
– A Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0)
– Hierarchical data abstraction on arrays, key-values and graphs for easy
programming expressiveness.
– Collective communication model to support various communication operations
on the data abstractions.
– Caching with buffer management for memory allocation required from
computation and communication
– BSP style parallelism
– Fault tolerance with check-pointing.

Other major Tools


a) AWS
b) BigData
c) Cassandra
d) Data Warehousing
e) DevOps
f) HBase
g) Hive
h) MongoDB
i) NiFi
j) Tableau
k) Talend
l) ZooKeeper

Thus the modern analytical tools play an important role in the modern data world.

**********

You might also like