Unit 1 Notes Final Part C
Unit 1 Notes Final Part C
Big Data analytics is a process used to extract meaningful insights, such as hidden
patterns, unknown correlations, market trends, and customer preferences. Big Data
analytics provides various advantages—it can be used for better decision making,
preventing fraudulent activities, among other things.
Uses and Examples of Big Data Analytics
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
In this stage, the team learns about the business domain, which presents the
motivation and goals for carrying out the analysis. In this stage, the problem is
identified, and assumptions are made that how much potential gain a company will
make after carrying out the analysis. Important activities in this step include
framing the business problem as an analytics challenge that can be addressed in
subsequent phases. It helps the decision-makers understand the business resources
that will be required to be utilized thereby determining the underlying budget
required to carry out the project. Moreover, it can be determined, whether the
problem identified, is a Big Data problem or not, based on the business
requirements in the business case. To qualify as a big data problem, the business
case should be directly related to one (or more) of the characteristics of volume,
velocity, or variety.
Once the business case is identified, now it’s time to find the appropriate datasets
to work with. In this stage, analysis is done to see what other companies have done
for a similar case. Depending on the business case and the scope of analysis of the
project being addressed, the sources of datasets can be either external or internal
to the company. In the case of internal datasets, the datasets can include data
collected from internal sources, such as feedback forms, from existing software,
On the other hand, for external datasets, the list includes datasets from third-party
providers.
Now the data is filtered, but there might be a possibility that some of the entries of
the data might be incompatible, to rectify this issue, a separate phase is created,
known as the data extraction phase. In this phase, the data, which don’t match with
the underlying scope of the analysis, are extracted and transformed in such a form.
As mentioned in phase III, the data is collected from various sources, which results
in the data being unstructured. There might be a possibility, that the data might
have constraints, that are unsuitable, which can lead to false results. Hence there
is a need to clean and validate the data. It includes removing any invalid data and
establishing complex validation rules. There are many ways to validate and clean
the data. For example, a dataset might contain few rows, with null entries. If a
similar dataset is present, then those entries are copied from that dataset, else those
rows are dropped.
The data is cleansed and validates, against certain rules set by the enterprise. But
the data might be spread across multiple datasets, and it is not advisable to work
with multiple datasets. Hence, the datasets are joined together. For example: If
there are two datasets, namely that of a Student Academic section and Student
Personal Details section, then both can be joined together via common fields, i.e.
roll number. This phase calls for intensive operation since the amount of data can
be very large. Automation can be brought into consideration, so that these things
are executed, without any human intervention.
Here comes the actual step, the analysis task. Depending on the nature of the big
data problem, analysis is carried out. Data analysis can be classified as
Confirmatory analysis and Exploratory analysis. In confirmatory analysis, the
cause of a phenomenon is analyzed before. The assumption is called the
hypothesis. The data is analyzed to approve or disapprove the hypothesis.
This kind of analysis provides definitive answers to some specific questions and
confirms whether an assumption was true or not.In an exploratory analysis, the
data is explored to obtain information, why a phenomenon occurred. This type of
analysis answers “why” a phenomenon occurred. This kind of analysis doesn’t
provide definitive, meanwhile, it provides discovery of patterns.
The analysis is done, the results are visualized, now it’s time for the business users
to make decisions to utilize the results. The results can be used for optimization,
to refine the business process. It can also be used as an input for the systems to
enhance performance.
Different Types of Big Data Analytics
Here are the four types of Big Data analytics:
1. Descriptive Analytics: What is happening now based on incoming data.
This summarizes past data into a form that people can easily read. This helps in
creating reports, like a company’s revenue, profit, sales, and so on. Also, it helps in
the tabulation of social media metrics.
Use Case: The Dow Chemical Company analyzed its past data to increase facility
utilization across its office and lab space. Using descriptive analytics, Dow was able
to identify underutilized space. This space consolidation helped the company save
nearly US $4 million annually.
This type of analytics looks into the historical and present data to make predictions of
the future. Predictive analytics uses data mining, AI, and machine learning to analyze
current data and make predictions about the future. It works on predicting customer
trends, market trends, and so on.
Use Case: PayPal determines what kind of precautions they have to take to protect
their clients against fraudulent transactions. Using predictive analytics, the company
uses all the historical payment data and user behavior data and builds an algorithm
that predicts fraudulent activities.
3. Prescriptive Analytics: What action should be taken
Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type
of analytics is used to build an algorithm that will automatically adjust the flight fares
based on numerous factors, including customer demand, weather, destination, holiday
seasons, and oil prices.
This is done to understand what caused a problem in the first place. Techniques like
drill-down, data mining, and data recovery are all examples. Organizations use
diagnostic analytics because they provide an in-depth insight into a particular
problem.
Use Case: An e-commerce company’s report shows that their sales have gone down,
although customers are adding products to their carts. This can be due to various
reasons like the form didn’t load correctly, the shipping fee is too high, or there are
not enough payment options available. This is where you can use diagnostic analytics
to find the reason.
Why is big data analytics important?
In today’s world, Big Data analytics is fueling everything we do online—in every
industry.
Take the music streaming platform Spotify for example. The company has nearly 96
million users that generate a tremendous amount of data every day. Through this
information, the cloud-based platform automatically generates suggested songs—
through a smart recommendation engine—based on likes, shares, search history, and
more. What enables this is the techniques, tools, and frameworks that are a result of
Big Data analytics.
If you are a Spotify user, then you must have come across the top recommendation
section, which is based on your likes, past history, and other things. Utilizing a
recommendation engine that leverages data filtering tools that collect data and then
filter it using algorithms works. This is what Spotify does.
Today, Big Data analytics has become an essential tool for organizations of all sizes
across a wide range of industries. By harnessing the power of Big Data, organizations
are able to gain insights into their customers, their businesses, and the world around
them that were simply not possible before.
As the field of Big Data analytics continues to evolve, we can expect to see even more
amazing and transformative applications of this technology in the years to come.
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to
identify fraudulent activities and discrepancies. The organization leverages it to
narrow down a list of suspects or root causes of problems .
2. Product Development and Innovations
Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and
armed forces across the globe, uses Big Data analytics to analyze how efficient the
engine designs are and if there is any need for improvements.
3. Quicker and Better Decision Making Within Organizations
Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example,
the company leverages it to decide if a particular location would be suitable for a new
outlet or not. They will analyze several different factors, such as population,
demographics, accessibility of the location, and more.
4. Improve Customer Experience
Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences.
They monitor tweets to find out their customers’ experience regarding their journeys,
delays, and so on. The airline identifies negative tweets and does what’s necessary to
remedy the situation. By publicly addressing these issues and offering solutions, it
helps the airline build good customer relations.
Conventional Systems.
The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
Big data is huge amount of data which is beyond the processing capacity
ofconventional data base systems to manage and analyze the data in a specific time
interval.
The conventional computing functions logically with a set of rules and calculations
while the neural computing can function via images, pictures, and concepts.
Conventional computing is often unable to manage the variability of data obtained in the
real world.
On the other hand, neural computing, like our own brains, is well suited to situations that
have no clear algorithmic solutions and are able to manage noisy imprecise data. This
allows them to excel in those areas that conventional computing often finds difficult.
Used for reporting, basic analysis, and Used for reporting, advanced analysis, and
text mining. Advanced analytics is only in predictive modeling .
a starting stage in big data.
Big data analysis needs both Analytical skills are sufficient for
programming skills (such as Java) and conventional data; advanced analysis tools
analytical skills to perform analysis. don’t require expert programing skills.
The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:
Because big data is continuously expanding, there are new companies and technologies
that are being developed everyday.
A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
While Big Data is a growing field, there are very few experts available in this field.
This is because Big data is a complex field and people who understand the complexity
and intricate nature of this field are far few and between.
3) The talent gap that exists in the industry Getting data into the big data platform:
Data is increasing every single day. This means that companies have to tackle limitless
amount of data on a regular basis.
The scale and variety of data that is available today can overwhelm any data practitioner
and that is why it is important to make data accessibility simple and convenient for
brand mangers and owners.
1. Data
2. Process
3. Management
1. Data Challenges
Volume
Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook,
10 TB.
•Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in
2011.
•The challenge is how to deal with the size of Big Data.
Variety•A lot of this data is unstructured, or has a complex structure that’s hard to represent in
rows and columns.
2. Processing
More than 80% of today’s information isunstructured and it is typically too big to
manage effectively.
Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization.
Things like documents, contracts, machine data, sensor data, social media, health
records, emails, etc. The list is endless really.
3. Management
A lot of this data is unstructured, or has acomplex structure that’s hard to represent in
rows and columns.
b) Visualization helps organizations perform analyses and make decisions much more
rapidly, but the challenge is going through the sheer volumes of data and accessing the
level of detail needed, all at a high speed.
c) The challenge only grows as the degree of granularity increases. One possible solution
is hardware. Some vendors are using increased memory and powerful parallel
processing to crunch large volumes of data extremely quickly
d) Understanding the data
It takes a lot of understanding to get data in the RIGHT SHAPE so that you can use
visualization as part of data analysis.
Visual analytics enables organizations to take raw data and present it in a meaningful
way that generates the most value. However, when used with big data, visualization is
bound to lead to some challenges.
**************
1.3. INTELLIGENT DATA ANALYSIS
Intelligent Data Analysis (IDA) is one of the hot issues in the field of
artificial intelligence and information.
IDA is
… used for extracting useful information from large quantities of online data; extracting
desirable knowledge or interesting patterns from existing databases;
the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;
the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;
Goal:
Goal of Intelligent data analysis is to extract useful knowledge, the process demands a
combination of extraction, analysis, conversion, classification, organization, reasoning, and so
on.
1,3,2 Uses / Benefits of IDA
Intelligent Data Analysis provides a forum for the examination of issues related to the research
and applications of Artificial Intelligence techniques in data analysis across a variety of
disciplines and the techniques include (but are not limited to):
Data Visualization
Data pre-processing (fusion, editing, transformation, filtering, sampling)
Data Engineering
Database mining techniques, tools and applications
Use of domain knowledge in data analysis
Big Data applications
Evolutionary algorithms
Machine Learning(ML)
Neural nets
Fuzzy logic
Statistical pattern recognition
Knowledge Filtering and
Post-processing
Why IDA?
Multidimensionality of problems is looking for methods for adequate and deep data
processing and analysis
2 – they were ill (drug treatment, positive clinical and laboratory findings)
Knowledge Acquisition
The process of eliciting, analyzing, transforming, classifying, organizing and integrating
knowledge and representing that knowledge in a form that can be used in a computer
system.
A Rule :
Example of IDA
2 – they were ill (drug treatment, positive clinical and laboratory findings)
Illustration of IDA by using See5
application.names - lists the classes to which cases may belong and the attributes
used to describe each case.
Attributes are of two types: discrete attributes have a value drawn from a set of
possibilities, and continuous attributes have numeric values.
application.data - provides information on the training cases from which See5 will
extract patterns.
The entry for each case consists of one or more lines that give the values for all
attributes.
application.data - provides information on the training cases from which See5 will
extract patterns.
The entry for each case consists of one or more lines that give the values for all
attributes.
The entry for each case consists of one or more lines that give the values for all
attributes.
Goal 1.1 :
application.names – example
gender:M,F
activity:1,2,3
age: continuous
smoking: No, Yes
…
Goal:1,2 :
application.data – example
M,1,59,Yes,0,0,0,0,119,73,103,86,247,87,15979,?,?,?,1,73,2.5
M,1,66,Yes,0,0,0,0,132,81,183,239,?,783,14403,27221,19153,23187,1,73,2.6
M,1,61,No,0,0,0,0,130,79,148,86,209,115,21719,12324,10593,11458,1,74,2.5
… …
Result:
Results – example
Sensitivity=0.97
Specificity=0.81
Sensitivity=0.97
Specificity=0.81
Sensitivity=0.98
Specificity=0.90
1.4.1 INTRODUCTION
Data
Properties of Data
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement
.
a) Amenability of use: From the dictionary meaning of data it is learnt that data are facts
used in deciding something. In short, data are meant to be used as a base for arriving at
definitive conclusions.
b) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.
c) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
d) Essence: A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the
matter.
e) Aggregation: Aggregation is cumulating or adding up.
f) Compression: Large amounts of data are always compressed to make them more
meaningful. Compress data to a manageable size.Graphs and charts are some examples
of compressed data.
g) Refinement: Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when
data are processed or refined.
In order to understand the nature of data it is necessary to categorize them into various
types.
Different categorizations of data are possible.
The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
Within each of these fields, there may be several ways in which data can be categorized into
types.
Nominal
Ordinal
Interval
Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
The distinction between the four types of scales center on three different characteristics:
We can convert or transform our data from ratio to interval to ordinal to nominal.
However, we cannot convert or transform our data from nominal to ordinal to interval
to ratio.
Example:
1st Place 2nd Place 3rd Place
Example :
60 degrees 12.5 feet 80 Miles per hour
In this case, 93% of all hospital have lower patient satisfaction scores than Eastridge hospital.
31% have lower satisfaction scores than Westridge Hospital.
Thus the nature of data and its value have great influence on data insight in it.
***********************
5. ANALYTIC PROCESS AND TOOLS
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
Step 1: Deployment
– In this phase,
• Business Understanding
– For the further process, we need to gather initial data, describe and explore the
data and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its application
and the need for the project in this phase.
– we need to select data as per the need, clean it, construct it to get useful
information and
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this
phase.
• we need to
– select a modeling technique, generate test design, build a model and assess the
model built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented
on the data in this phase.
Thus the BDA tools are used through out the BDA applications development.
******************
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.
• What is Reporting ?
• Reporting is
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing.”
Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly Inflexible
Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
Reporting and analysis can go hand-in-hand:
Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
Reporting translate a raw data into information
Reporting usually raises a question – What is happening ?
Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing
in the needy context.
*****************
1.7 MODERN ANALYTIC TOOLS
b) Apache flink
• Apache flink is
– an open source platform
– which is a streaming data flow engine that provides communication fault
tolerance and
– data distribution computation over data stream .
– flink is a top level project of Apache flink is scalable data analytics framework
that is fully compatible to hadoop .
– flink can execute both stream processing and batch processing easily.
– flink was designed as an alternative to map-reduce.
c) Kinesis
– Kinesis as an out of the box streaming data tool.
– Kinesis comprises of shards which Kafka calls partitions.
– For organizations that take advantage of real-time or near real-time access to
large stores of data,
– Amazon Kinesis is great.
– Kinesis Streams solves a variety of streaming data problems.
– One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
– Data is put into Kinesis streams.
– This ensures durability and elasticity.
a) Google’s Dremel is the google proposed an interactive analysis system in 2010. And
named named Dremel.
– which is scalable for processing nested data.
– Dremel provides
• a very fast SQL like interface to the data by using a different technique
than MapReduce.
• Dremel has a very different architecture:
– compared with well-known Apache Hadoop, and
– acts as a successful complement of Map/Reduce-based computations.
b) Apache drill
• Apache drill is:
– Drill is an Apache open-source SQL query engine for Big Data exploration.
– It is similar to Google’s Dremel.
• For Drill, there is:
– more flexibility to support
• a various different query languages,
• data formats and
• data sources.
• Drill is designed from the ground up to:
– support high-performance analysis on the semi-structured and
– rapidly evolving data coming from modern Big Data applications.
• Drill provides plug-and-play integration with existing Apache Hive and Apache
HBase deployments.
a) MapReduce Model
Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters.
OSDI 2004.
b) Apache Hadoop (2005)
Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC 2013.
Key Features of MapReduce Model
Designed for clouds
Large clusters of commodity machines
Designed for big data
Support from local disks based distributed file system (GFS / HDFS)
Disk based intermediate data transfer in Shuffling
MapReduce programming model
Computation pattern: Map tasks and Reduce tasks
Data abstraction: KeyValue pairs
• HaLoop
• An efficient Data Processing on Large clusters
• Have features:
– Loop-Aware Task Scheduling
– Caching and Indexing for Loop-Invariant Data on local disk.
Graph Model
• Graph Processing with BSP model
• Pregel (2010)
– A System for Large-Scale Graph Processing. SIGMOD 2010.
– Apache Hama (2010)
• Apache Giraph (2012)
– Scaling Apache Giraph to a trillion edges
GraphLab (2010)
• GraphLab: A New Parallel Framework for Machine Learning. UAI 2010.
• Distributed GraphLab: A Framework for Machine Learning and Data Mining in the
Cloud.
• Data graph
• Update functions and the scope
• PowerGraph (2012)
– PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.
– Gather, apply, Scatter (GAS) model
• GraphX (2013)
– A Resilient Distributed Graph System on Spark. GRADES
Collective Model
• Harp (2013)
– A Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0)
– Hierarchical data abstraction on arrays, key-values and graphs for easy
programming expressiveness.
– Collective communication model to support various communication operations
on the data abstractions.
– Caching with buffer management for memory allocation required from
computation and communication
– BSP style parallelism
– Fault tolerance with check-pointing.
Thus the modern analytical tools play an important role in the modern data world.
**********