ds4015 Big Data Analytics Vignesh K Notes
ds4015 Big Data Analytics Vignesh K Notes
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
UNIT I INTRODUCTION TO BIG DATA 9
Introduction to Big Data Platform – Challenges of Conventional Systems - Intelligent data analysis
–Nature of Data - Analytic Processes and Tools - Analysis Vs Reporting - Modern Data Analytic
Tools- Statistical Concepts: Sampling Distributions - Re-Sampling - Statistical Inference - Prediction
Error.
S
What is Big Data?
TE
Big data can be defined as a concept used to describe a large volume of data, which are both structured and
unstructured, and that gets increased day by day by any system or business. However, it is not the quantity of
data, which is essential. The important part is what any firm or organization can do with the data matters a lot.
O
Analysis can be performed on big data for insight and predictions, which can lead to a better decision and
reliable strategy in business moves.
Volume: Organizations and firms gather as well as pull together different data from different sources,
H
which includes business transactions and data, data from social media, login data, as well as
information from the sensor as well as machine-to-machine data. Earlier, this data storage would have
S
been an issue - but because of the advent of new technologies for handling extensive data with tools
like Apache Spark, Hadoop, the burden of enormous data got decreased.
E
Velocity: Data is now streaming at an exceptional speed, which has to be dealt with suitably. Sensors,
N
smart metering, user data as well as RFID tags are lashing the need for dealing with an inundation of
data in near real-time.
IG
Variety: The releases of data from various systems have diverse types and formats. They range from
structured to unstructured, numeric data of traditional databases to non-numeric or text documents,
emails, audios and videos, stock ticker data, login data, Blockchains' encrypted data, or even financial
V
transactions.
Big Data does not take care of how much data is there, but how it can be used. Data can be taken from various
sources for analyzing it and finding answers which enable:
Reduction in cost.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Time reductions.
New product development with optimized offers.
Well-groomed decision making.
When you merge big data with high-powered data analytics, it is possible to achieve business-related tasks like:
S
Risk-management can be done in minutes by calculating risk portfolios.
Detection of deceptive behavior before its influence.
TE
Things That Comes Under Big Data (Examples of Big Data)
As you know, the concept of big data is a clustered management of different forms of data generated by various
O
devices (Android, iOS, etc.), applications (music apps, web apps, game apps, etc.), or actions (searching through
SE, navigating through similar types of web pages, etc.). Here is the list of some commonly found fields of data
that come under the umbrella of Big Data:
ADVERTISING
N
K
Black Box Data: Black box data is a type of data that is collected from private and government
helicopters, airplanes, and jets. This data includes the capture of Flight Crew Sounds, separate
H
Transport Data: Transport data includes vehicle models, capacity, distance (from source to destination),
and the availability of different vehicles.
IG
Search Engine Data: Retrieve a wide variety of unprocessed information that is stored in SE databases.
There are various other types of data that gets generated in bulk amount from applications and organizations.
The data generated in bulk amount with high velocity can be categorized as:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Big Data Technologies
This technology is significant for presenting a more precise analysis that leads the business analyst to highly
accurate decision-making, ensuring more considerable operational efficiencies by reducing costs and trade risks.
Now to implement such analytics and hold such a wide variety of data, one must need an infrastructure that can
facilitate and manage and process huge data volumes in real-time. This way, big data is classified into two
subcategories:
S
Operational Big Data: comprises of data on systems such as MongoDB, Apache Cassandra, or CouchDB,
which offer equipped capabilities in real-time for large data operations.
TE
Analytical Big Data: comprises systems such as MapReduce, BigQuery, Apache Spark, or Massively
Parallel Processing (MPP) database, which offer analytical competence to process complex analysis on
large datasets.
O
Challenges of Big Data
it. There no 100% efficient way to filter out relevant data. N
Rapid Data Growth: The growth velocity at such a high rate creates a problem to look for insights using
Storage: The generation of such a massive amount of data needs space for storage, and organizations
K
face challenges to handle such extensive data without suitable tools and technologies.
Unreliable Data: It cannot be guaranteed that the big data collected and analyzed are totally (100%)
H
accurate. Redundant data, contradicting data, or incomplete data are challenges that remain within it.
Data Security: Firms and organizations storing such massive data (of users) can be a target of
S
cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such colossal data is also a
challenge for firms and organizations.
E
N
Big data has revolutionized the way businesses operate, but it has also presented a
number of challenges for conventional systems. Here are some of the challenges faced
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
have already started using big data analytics tools because they realize how much
potential there is in utilizing these systems effectively!
However, while there are many benefits associated with using such systems - including
faster processing times as well as increased accuracy - there are also some challenges
involved with implementing them correctly.
S
Scalability
Speed
TE
Storage
Data Integration
Security
O
Scalability
N
A common problem with conventional systems is that they can't scale. As the amount of
data increases, so does the time it takes to process and store it. This can cause
K
bottlenecks and system crashes, which are not ideal for businesses looking to make
quick decisions based on their data.
Conventional systems also lack flexibility in terms of how they handle new types of
H
information--for example, if you want to add another column (columns are like fields )
or row (rows are like records) without having to rewrite all your code from scratch.
S
Speed
E
Speed is a critical component of any data processing system. Speed is important because it
allows you to:
N
Process and analyze your data faster, which means you can make better-informed
IG
Storage
The amount of data being created and stored is growing exponentially, with estimates that it
will reach 44 zettabytes by 2020. That's a lot of storage space!
The problem with conventional systems is that they don't scale well as you add more data.
This leads to huge amounts of wasted storage space and lost information due to corruption or
security breaches.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Data Integration
The challenges of conventional systems in big data are numerous. Data integration is
one of the biggest challenges, as it requires a lot of time and effort to combine different
sources into a single database. This is especially true when you're trying to integrate
data from multiple sources with different schemas and formats.
Another challenge is errors and inaccuracies in analysis due to lack of understanding of
what exactly happened during an event or transaction. For example, if t here was an
S
error while transferring money from one bank account to another, then there would be
no way for us know what actually happened unless someone tells us about it later on
TE
(which may not happen).
Security
O
process and store their data. Traditional databases are designed to be accessed by
trusted users within an organization, but this makes it difficult to ensure that only
authorized people have access to sen sitive information.
N
Security measures such as firewalls, passwords and encryption help protect against
unauthorized access and attacks by hackers who want to steal data or disrupt
K
operations. But these security measures have limitations: They're expensive; they
require constant monitoring and maintenance; they can slow down performance if
implemented too extensively; and they often don't prevent breaches altogether
H
because there's always some way around them (such as through phishing emails).
S
Conventional systems are not equipped for big data. They were designed for a
different era, when the volume of information was much smaller and more
E
manageable. Now that we're dealing with huge amounts of data, conventional systems
are struggling to keep up. Convention al systems are also expensive and time -
N
ever before.
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
WhatisIntelligentDataAnalysis?
Intelligent data analysis refers to the use of analysis, classification, conversion, extraction organization, and
reasoning methods to extract useful knowledge from data. This data analytics intelligence process generally
consists of the data preparation stage, the data mining stage, and the result validation and explanation stage.
Data preparation involves the integration of required data into a dataset that will be used for data mining; data
S
mining involves examining large databases in order to generate new information; result validation involves the
verification of patterns produced by data mining algorithms; and result explanation involves the intuitive
TE
communication of results.
O
The Nature of Data
N
That’s a pretty broad title, but, really, what we’re talking about here are some fundamentally different ways to
treat data as we work with it. This topic can seem academic but it is relevant for web analysts specifically and
K
researchers broadly. Yes, this topic out to be pretty darn important when it comes time to applying statistical
So, we have to start with the basics: the nature of data. There are four types of data:
S
Nominal
E
Ordinal
Interval
N
Ratio
IG
Each offers a unique set of characteristics, which impacts the type of analysis that can be performed.
V
The distinction between the four types of scales center on three different characteristics:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Nominal Scales
Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the same as a 2
and 3.
S
True Zero: There is no true or real zero. In a nominal scale, zero is uninterpretable.
TE
Consider traffic source (or last touch channel) as an example in which visitors reach our site through a mutually
O
1. Paid Search
2.
3.
Organic Search
Email N
K
4. Display
(This list looks artificially short, but the logic and interpretation would remain the same for nine channels or for
H
99 channels.)
S
If we want to know that each channel is simply somehow different, then we could count the number of visits
E
Email 1,254
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Channel Count of Visits
Display 2,077
With nominal data, the order of the four channels would not change or alter the interpretation. Suppose we,
S
Channel Count of Visits
TE
Display 2,077
O
Paid Search 2,143
Email 1,254
N
K
Organic Search 3,124
H
And, the distance between the categories is not relevant. Display is not four times as much as paid search and
E
organic search is not half of organic search. While there is an arithmetic relationship between these counts, that
is only relevant if we treat the scales as ratio scales (see the Ratio Scales section below).
N
Finally, zero holds no meaning. We could not interpret a zero because it does not occur in a nominal scale.
IG
Ordinal Scales
At the risk of providing a tautological definition, ordinal scales measure, well, order. So, our characteristics for
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Order: The order of the responses or observations matters.
Distance: Ordinal scales do not hold distance. The distance between first and second is unknown as is
the distance between first and third along with all observations.
True Zero: There is no true or real zero. An item, observation, or category cannot finish zero.
Let’s work through our traffic source example and rank the channels based on the number of visits to our site,
S
with “1” being the highest number of visits:
TE
Channel Count of Visits
Organic Search 1
O
Paid Search
Display
2
3
N
K
Email 4
H
S
Again, for this example, we are limiting ourselves to four channels, but the logic would remain the same for
By ranking the channel from most to least number of visitors in terms of last point of contact, we’ve established
N
an order.
IG
However, the distance between the rankings appears unknown. Organic Search could have one more visit
compared to Paid Search or one hundred more visitors. The distance between the two items appears unknown.
V
Finally, zero holds no meaning. We could not interpret a zero because it does not occur in an ordinal scale. An
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Interval Scales
Interval scales provide insight into the variability of the observations or data. Classic interval scales are Likert
scales (e.g., 1 - strongly agree and 9 - strongly disagree) and Semantic Differential scales (e.g., 1 - dark and 9 -
light). In an interval scale, users could respond to “I enjoy opening links to the website from a company email”
S
The characteristics of interval scales are:
TE
Order: The order of the responses or observations does matter.
Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the same as 4 to 5.
Also, six is twice as much as three and two is half of four. Hence, we can perform arithmetic operations
O
on the data.
N
True Zero: There is no zero with interval scales. However, data can be rescaled in a manner that
contains zero. An interval scales measure from 1 to 9 remains the same as 11 to 19 because we added
K
10 to all values. Similarly, a 1 to 9 interval scale is the same a -4 to 4 scale because we subtracted 5
from all values. Although the new scale contains zero, zero remains uninterpretable because it only
H
Unless a web analyst is working with survey data, it is doubtful he or she will encounter data from an interval
scales. More likely, a web analyst will deal with ratio scales (next section).
E
Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard deviation (and
N
An argument exists about temperature. Is it an interval scale or an ordinal scale? Many researchers argue for
temperature as an interval scale. It offers order (e.g., 212∘∘ F is hotter than 32∘∘ F), distance (e.g., 40∘∘ F to
44∘∘ F is the same as 100∘∘ F to 104∘∘ F), and lacks a true zero (e.g., 0∘∘ F is not the same as 0∘∘ C). However,
other researchers argue for temperature as an ordinal scale because of the issue related to distance. 200∘∘ F is
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
not twice as 100 F. The human brain registers both temperatures as equally hot (if standing outside) or mild (if
touching a stove). Finally, we would not say that 80 F is twice as warm as 40∘∘ F or that 30∘∘ F is a third colder
as 90∘∘ F.
Ratio Scales
Ratio scales appear as nominal scales with a true zero. They have the following characteristics:
S
Order: The order of the responses or observations matters.
TE
Distance: Ratio scales do do have an interpretable distance.
O
Income is a classic example of a ratio scale:
Order is established. We would all prefer $100 to $1!
N
Zero dollars means we have no income (or, in accounting terms, our revenue exactly equals our
K
expenses!)
Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.
H
In web analytics, the number of visits and the number of goal completions serve as examples of ratio scales. A
S
thousand visits is a third of 3,000 visits, while 400 goal completions are twice as many as 200 goal completions.
E
Zero visitors or zero goal completions should be interpreted as just that: no visits or completed goals (uh-oh…
For the web analyst, the statistics for ratio scales are the same as for interval scales.
IG
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard deviation (and
An Important Note: Don’t let the term “ratio” trip you up. Laypeople (aka, “non-statisticians”) are taught that
ratios represent a relationship between two numbers. For instance, conversion rate is the “ratio” of orders to
visits. But, as illustrated above, that is an overly narrow definition when it comes to statistics.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Summary Cheat Sheet
The table below summarizes the characteristics of all four types of scales.
S
TE
Distance Is Interpretable No No Yes
Zero Exists No No No
O
Transformation
N
Did you notice that we used channel for three of our four examples? And, for all three, the underlying metric
was “visits.” What that means is that any given variable isn’t inherently a single type of data (type of scale). It
K
depends on how the data is being used.
H
What that means is that some types of scales can be transformed to other types of scales. We can convert or
transform our data from ratio to interval to ordinal to nominal. However, we cannot convert or transform our
S
Put another way, take a look at the cheat sheet above. If you have data using one scale, you can change a “Yes”
N
to a “No” (and, thus, change the type of scale), but you cannot change a “No” to a “Yes.”
Pause here to take an aspirin as needed, should your head be starting to hurt.
IG
As an example, let’s say our website receives 10,000 visits in a month. That figure – 10,000 visits – is a ratio
scale. I could convert it to the number of visits in a week for that month (let’s pick our month as February, 2015,
V
as the first of the month fell on a Sunday and there were exactly 4 weeks in the month!):
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Week 4 had 4,000 visits
We could treat these numbers as interval; specifically, an equal width interval. However, there is little reason –
conceptually or managerially – to treat these numbers as interval. So, let’s move on.
We could rank the weeks based on the number of visits, which would transform the data to an ordinal scale.
S
Week 4
TE
Week 2
Week 1
O
Week 3
N
Finally, we could group week 2 and week 4 into “heavy traffic” weeks and group week 1 and week 3 into “light
traffic” week and we would have created a nominal scale. The order heavy-light or light-heavy would not matter
K
provided we remember the coding effort.
We started with a ratio scale that we ultimately transformed into a nominal scale. As we did so, we lost a lot of
H
information. But, by transforming this data, we can use different analytical tools to answer different types of
questions.
S
E
As we’re growing with the pace of technology, the demand to track data is increasing rapidly. Today,
IG
almost 2.5quintillion bytes of data are generated globally and it’s useless until that data is segregated in a
proper structure. It has become crucial for businesses to maintain consistency in the business by collecting
meaningful data from the market today and for that, all it takes is the right data analytic tool and
V
a professional data analyst to segregate a huge amount of raw data by which then a company can make the
right approach.
There are hundreds of data analytics tools out there in the market today but the selection of the right tool
will depend upon your business NEED, GOALS, and VARIETY to get business in the right direction. Now, let’s
check out the top 10 analytics tools in big data.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Big data is the storage and analysis of large data sets. These are complex data sets which can be both structured
or unstructured. They are so large that it is not possible to work on them with traditional analytical tools. These
days, organizations are realising the value they get out of big data analytics and hence they are deploying big data
tools and processes to bring more efficiency in their work environment. They are willing to hire good big data
analytics professionals at a good salary. In order to be a big data analyst, you should get acquainted with big data
first and get certification by enrolling yourself in analytics courses online.
1. APACHE Hadoop
S
TE
It’s a Java-based open-source platform that is being used to store and process big data. It is built on a cluster
system that allows the system to process data efficiently and let the data run parallel. It can process both
structured and unstructured data from one server to multiple computers. Hadoop also offers cross-
platform support for its users. Today, it is the best big data analytic tool and is popularly used by many tech
O
giants such as Amazon, Microsoft, IBM, etc.
Features of Apache Hadoop:
Offers quick access via HDFS (Hadoop Distributed File System). N
Free to use and offers an efficient storage solution for businesses.
Highly flexible and can be easily implemented with MySQL, and JSON.
K
Highly scalable as it can distribute a large amount of data in small segments.
It works on small commodity hardware like JBOD or a bunch of disks.
H
2. Cassandra
S
APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large amounts of data.
E
It’s one of the most popular tools for data analytics and has been praised by many tech companies due to its
high scalability and availability without compromising speed and performance. It is capable of delivering
N
thousands of operations every second and can handle petabytes of resources with almost zero downtime. It
was created by Facebook back in 2008 and was published publicly.
IG
Data Distribution System: Easy to distribute data with the help of replicating data on multiple
data centers.
Fast Processing: Cassandra has been designed to run on efficient commodity hardware and also
offers fast storage and data processing.
Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.
3. Qubole
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
It’s an open-source big data tool that helps in fetching data in a value of chain using ad -hoc analysis in
machine learning. Qubole is a data lake platform that offers end-to-end service with reduced time and effort
which are required in moving data pipelines. It is capable of configuring multi-cloud services such as AWS,
Azure, and Google Cloud. Besides, it also helps in lowering the cost of cloud computing by 50%.
Features of Qubole:
Supports ETL process: It allows companies to migrate data from multiple sources in one place.
Real-time Insight: It monitors user’s systems and allows them to view real-time insights
S
Predictive Analysis: Qubole offers predictive analysis so that companies can take actions
accordingly for targeting more acquisitions.
TE
Advanced Security System: To protect users’ data in the cloud, Qubole uses an advanced security
system and also ensures to protect any future breaches. Besides, it also allows encrypting cloud
data from any potential threat.
O
4. Xplenty
N
It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a wide range of
solutions for sales, marketing, and support. With the help of its interactive graphical interface, it provides
K
solutions for ETL, ELT, etc. The best part of using Xplenty is its low investment in hardware & software and its
offers support via email, chat, telephonic and virtual meetings. Xplenty is a platform to process data for
analytics over the cloud and segregates all the data together.
H
Features of Xplenty:
Rest API: A user can possibly do anything by implementing Rest API
S
Flexibility: Data can be sent, and pulled to databases, warehouses, and salesforce.
Data Security: It offers SSL/TSL encryption and the platform is capable of verifying algorithms
E
5. Spark
APACHE Spark is another framework that is used to process data and perform numerous tasks on a large
V
scale. It is also used to process data via multiple computers with the help of distributing tools. It is widely
used among data analysts as it offers easy-to-use APIs that provide easy data pulling methods and it
is capable of handling multi-petabytes of data as well. Recently, Spark made a record of processing 100
terabytes of data in just 23 minutes which broke the previous world record of Hadoop (71 minutes). This is the
reason why big tech giants are moving towards spark now and is highly suitable for ML and AI today.
Features of APACHE Spark:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
Real-time Processing: Spark can handle real-time streaming via Spark Streaming
Flexible: It can run on, Mesos, Kubernetes, or the cloud.
6. Mongo DB
Came in limelight in 2010, is a free, open-source platform and a document-oriented (NoSQL) database that is
S
used to store a high volume of data. It uses collections and documents for storage and its document consists
of key-value pairs which are considered a basic unit of Mongo DB. It is so popular among developers due to
TE
its availability for multi-programming languages such as Python, Jscript, and Ruby.
Features of Mongo DB:
Written in C++: It’s a schema-less DB and can hold varieties of documents inside.
Simplifies Stack: With the help of mongo, a user can easily store files without any disturbance in
O
the stack.
Master-Slave Replication: It can write/read data from the master and can be called back for
backup.
7. Apache Storm
N
K
A storm is a robust, user-friendly tool used for data analytics, especially in small companies. The best part
about the storm is that it has no language barrier (programming) in it and can support any of them. It was
H
designed to handle a pool of large data in fault-tolerance and horizontally scalable methods. When we talk
about real-time data processing, Storm leads the chart because of its distributed real-time big data
S
processing system, due to which today many tech giants are using APACHE Storm in their system. Some of
E
Data Processing: Storm process the data even if the node gets disconnected
Highly Scalable: It keeps the momentum of performance even if the load increases
IG
Fast: The speed of APACHE Storm is impeccable and can process up to 1 million messages of 100
bytes on a single node.
V
8. SAS
Today it is one of the best tools for creating statistical modeling used by data analysts. By using SAS, a data
scientist can mine, manage, extract or update data in different variants from different sources. Statistical
Analytical System or SAS allows a user to access the data in any format (SAS tables or Excel worksheets).
Besides that it also offers a cloud platform for business analytics called SAS Viya and also to get a strong grip
on AI & ML, they have introduced new tools and products.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Features of SAS:
Flexible Programming Language: It offers easy-to-learn syntax and has also vast libraries which
make it suitable for non-programmers
Vast Data Format: It provides support for many programming languages which also include SQL
and carries the ability to read data from any format.
Encryption: It provides end-to-end security with a feature called SAS/SECURE.
S
9. Data Pine
TE
Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a short period of
time, it has gained much popularity in a number of countries and it’s mainly used for data extraction (for
small-medium companies fetching data for close monitoring). With the help of its enhanced UI design,
anyone can visit and check the data as per their requirement and offer in 4 different price brackets, starting
O
from $249 per month. They do offer dashboards by functions, industry, and platform.
Features of Datapine:
N
Automation: To cut down the manual chase, datapine offers a wide array of AI assistant and BI
tools.
K
Predictive Tool: datapine provides forecasting/predictive analytics by using historical and current
data, it derives the future outcome.
Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc reporting, etc.
H
It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code platform and users
E
aren’t required to code for segregating data. Today, it is being heavily used in many industries such as ed-
tech, training, research, etc. Though it’s an open-source platform but has a limitation of adding 10000 data
N
rows and a single logical processor. With the help of Rapid Miner, one can easily deploy their ML models to
the web or mobile (only when the user interface is ready to collect real-time figures).
IG
Data validation: Rapid miner enables the visual display of multiple results in history for better
evaluation.
Conclusion
Big data has been in limelight for the past few years and will continue to dominate the market in almost
every sector for every market size. The demand for big data is booming at an enormous rate and ample tools
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
are available in the market today, all you need is the right approach and choose the best data analytic tool as
per the project’s requirement.
S
The terms reporting and analytics are often used interchangeably. This is not surprising since both take
in data as “input” — which is then processed and presented in the form of charts, graphs, or
TE
dashboards.
Reports and analytics help businesses improve operational efficiency and productivity, but in different
ways. While reports explain what is happening, analytics helps identify why it is happening. Reporting
O
summarizes and organizes data in easily digestible ways while analytics enables questioning and
exploring that data further. It provides invaluable insights into trends and helps create strategies to
N
help improve operations, customer satisfaction, growth, and other business metrics.
Reporting and analysis are both important for an organization to make informed decisions by
K
presenting data in a format that is easy to understand. In reporting, data is brought together from
different sources and presented in an easy-to-consume format. Typically, modern reporting apps
today offer next-generation dashboards with high-level data visualization capabilities. There are several
H
types of reports being generated by companies including financial reports, accounting reports,
operational reports, market reports, and more. This helps understand how each function is performing
S
Analytics enables business users to cull out insights from data, spot trends, and help make better
decisions. Next-generation analytics takes advantage of emerging technologies like AI, NLP, and
N
machine learning to offer predictive insights based on historical and real-time data.
IG
For instance, let us take a look at a manufacturing company that uses Oracle ERP to manage various
V
functions including accounting, financial management, project management, procurement, and supply
chain. For business users, it is critical to have a finger on the pulse of all key data. Additionally, specific
teams need to periodically generate reports and present data to senior management and other
stakeholders. In addition to reporting, it is also essential to analyze data from various sources and
gather insights. The problem today is people are using reporting and analytics interchangeably. When
the time comes to replace an end-of-life operational reporting tool, they are using solutions that are
designed for analytics. This would be a waste of time and resources.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
It is critical that operational reporting is done using a tool built for that purpose. Ideally, it’ll be a self-
service tool so business users don’t have to rely on IT to generate reports. It must have the ability to
drill down into several layers of data when needed. Additionally, if you’re using Oracle ERP you need an
operational reporting tool like Orbit that seamlessly integrates data from various business systems –
both on-premise and cloud. In this blog, we look at the nuances of both operational reporting and
analytics and why it is critical to have the right tools for the right tasks.
Steps Involved in Building a Report and Preparing Data for Analytics
S
To build a report, the steps involved broadly include:
TE
Collecting and gathering relevant data
O
Understanding the data context
Enabling real-time reporting
Use tools for data visualization, trend analysis, deep dives, etc.
One of the key differences between reporting and analytics is that, while a report involves organizing
data into summaries, analysis involves inspecting, cleaning, transforming, and modeling these reports
to gain insights for a specific purpose.
V
Knowing the difference between the two is essential to fully benefit from the potential of both without
missing out on key features of either one. Some of the key differences include:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
1. Purpose: Reporting involves extracting data from different sources within an organization and monitoring it to
gain an understanding of the performance of the various functions. By linking data from across functions, it
K
helps create a cross-channel view that facilitates comparison to understand data easily. An analysis is being able
to interpret data at a deeper level, interpreting it and providing recommendations on actions.
2. The Specifics: Reporting involves activities such as building, consolidating, organizing, configuring, formatting,
H
and summarizing. It requires clean, raw data and reports that may be generated periodically, such as daily,
weekly, monthly, quarterly, and yearly. Analytics includes asking questions, examining, comparing, interpreting,
S
and confirming. Enriching data with big data can help predict future trends as well.
E
3. The Final Output: In the case of reporting, outputs such as canned reports, dashboards, and alerts push
information to users. Through analysis, analysts try to extract answers using business queries and present them
N
in the form of ad hoc responses, insights, recommended actions, or a forecast. Understanding this key
difference can help businesses leverage analytics better.
IG
4. People: Reporting requires repetitive tasks that can be automated. It is often used by functional business
heads who monitor specific business metrics. Analytics requires customization and therefore depends on data
analysts and scientists. Also, it is used by business leaders to make data-driven decisions.
V
5. Value Proposition: This is like comparing apples to oranges. Both reporting and analytics serve a different
purpose. By understanding the purpose and using them correctly, businesses can derive immense value from
both.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
Orbit for both Reporting and Analytics
N
Orbit Reporting and Analytics is a single tool that can be used for both generating different reports and
K
running analytics to meet business objectives. It can work in multi-cloud environments, extracting data
from the cloud and on-prem systems and presenting them in many ways as required by the user. It
enables self-service, allowing business users to generate their own reports without depending on the IT
H
team, in real-time. It complies with security and privacy requirements by allowing access only to
authorized users. It also allows users to generate reports in real-time in Excel.
S
It also facilitates analytics, enabling businesses to draw insights and convert them into actions to
E
predict future trends, identify areas of improvement across functions, and meet the organizational goal
of growth.
N
IG
1. Apache Hadoop:
1. Apache Hadoop is a big data analytics tool is a Java-based free software framework.
2. It helps in the effective storage of a huge amount of data in a storage place known as a cluster.
3. It runs in parallel on a cluster and also has the ability to process huge data across all nodes in it.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
4. There is a storage system in Hadoop popularly known as the Hadoop Distributed File System
(HDFS), which helps to splits the large volume of data and distribute it across many nodes present
in a cluster.
2. KNIME:
1. KNIME analytics platform is one of the leading open solutions for data-driven innovation.
2. This tool helps in discovering the potential hidden in a huge volume of data, it also performs mine
for fresh insights, or predicts the new futures.
S
3. OpenRefine:
1. OneRefine tool is one of the efficient tools to work on the messy and large volume of data.
TE
2. It includes cleansing data and transforming that data from one format to another.
3. It helps to explore large data sets easily.
4. Orange:
1. Orange is famous for open-source data visualization and helps with data analysis for beginners and
O
as well to the expert
2. This tool provides interactive workflows with a large toolbox option to create the same which helps
5. RapidMiner:
1.
in the analysis and visualizing of data.
N
RapidMiner tool operates using visual programming and also it is much capable of manipulating,
K
analyzing and modelling the data.
2. RapidMiner tools make data science teams easier and more productive by using an open-source
platform for all their jobs like machine learning, data preparation, and model deployment.
H
6. R-programming:
1. R is a free open source software programming language and a software environment for statistical
S
2. It is used by data miners for developing statistical software and data analysis.
3. It has become a highly popular tool for big data in recent years.
N
7. Datawrapper:
1. It is an online data visualization tool for making interactive charts.
IG
8. Tableau:
1. Tableau is another popular big data tool. It is simple and very intuitive to use.
2. It communicates the insights of the data through data visualization.
3. Through Tableau, an analyst can check a hypothesis and explore the data before starting to work
on it extensively.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
As the data is based on one population at a time, the information gathered is easy to manage and is more
E
reliable as far as obtaining accurate results is concerned. Therefore, the sampling distribution is an effective tool
N
in helping researchers, academicians, financial analysts, market strategists, and others make well-informed and
wise decisions.
IG
Sampling distribution refers to studying the randomly chosen samples to understand the variations in
the outcome expected to be derived.
V
Many researchers, academicians, market strategists, etc., go ahead with it instead of choosing the
entire population.
Sampling distribution of the mean, sampling distribution of proportion, and T-distribution are three
major types of finite-sample distribution.
The central limit theorem states how the distribution still remains normal and almost accurate with
increasing sample size.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
How Does Sampling Distribution Work?
Sampling distribution in statistics represents the probability of varied outcomes when a study is conducted. It is
also known as finite-sample distribution. In the process, users collect samples randomly but from one chosen
population. A population is a group of people having the same attribute used for random sample collection in
terms of statistics.
However, the data collected is not based on the population but on samples collected from a specific population
to be studied. Thus, a sample becomes a subset of the chosen population. With sampling distribution, the
S
samples are studied to determine the probability of various outcomes occurring with respect to certain events.
For example, deriving data to understand the adverts that can help attract teenagers would require selecting a
TE
population of those aged between 13 and 19 only.
Using finite-sample distribution, users can calculate the mean, range, standard deviation, mean absolute value
O
of the deviation, variance, and unbiased estimate of the variance of the sample. No matter for what purpose
users wish to use the collected data, it helps strategists, statisticians, academicians, and financial analysts make
N
necessary preparations and take relevant actions with respect to the expected outcome.
As soon as users decide to utilize the data for further calculation, the next step is to develop a frequency
K
distribution with respect to individual sample statistics as calculated through the mean, variance, and other
methods. Next, they plot the frequency distribution for each of them on a graph to represent the variation in
the outcome. This representation is indicated on the distribution graph.
H
Influencing Factors
S
Moreover, the accuracy of the distribution depends on various factors, and the major ones that influence the
results include:
E
Types
The finite-sample distribution can be expressed in various forms. Here is a list of some of its types:
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
You are free to use this image on your website, templates, etc, Please provide us with an attribution link
N
K
#1 – Sampling Distribution of Mean
It is the probabilistic spread of all the means of samples of fixed size that users choose randomly from a
H
particular population. When they plot individual means on the graph, it indicates normal distribution. However,
the center of the graph is the mean of the finite-sample distribution, which is also the mean of that population.
S
This type of finite-sample distribution identifies the proportions of the population. The users select samples and
calculate the sample proportion. They, then, plot the resulting figures on the graph. The mean of the sample
N
proportions gathered from each sample group signifies the mean proportion of the population as a whole. For
example, a Vlogger collects data from a sample group to find out the proportion of it interested in watching its
IG
upcoming videos.
#3 – T-Distribution
V
People use this type of distribution when they are not well aware of the chosen population or when the sample
size is very small. This symmetrical form of distribution fulfills the condition of standard normal variate. As the
sample size increases, even T distribution tends to become very close to normal distribution. Users use it to find
out the mean of the population, statistical differences, etc.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Significance
This type of distribution plays a vital role in ensuring the outcome derived accurately represents the entire
population. However, reading or observing each individual in a population is difficult. Therefore, selecting
samples from the population randomly is an attempt to make sure the study conducted could help understand
the reactions, responses, grievances, or aspirations of a chosen population in the most effective way.
The method simplifies the path to statistical inference. Moreover, it allows analytical considerations to focus on
S
a static distribution rather than the mixed probabilistic spread of each chosen sample unit. This distribution
eliminates the variability present in the statistic.
TE
It provides us with an answer about the probable outcomes which are most likely to happen. In addition, it plays
a key role in inferential statistics and makes almost accurate inferences through chosen samples representing
O
the population.
Examples
Example #1
N
K
Sarah wants to analyze the number of teens riding a bicycle between two regions of 13-18.
H
Instead of considering each individual in the population of 13-18 years of age in the two regions, she selected
200 samples randomly from each area.
S
Here,
E
The average count of the bicycle usage here is the sample mean.
N
Each chosen sample has its own generated mean, and the distribution for the average mean is the
sample distribution.
IG
She plots the data gathered from the sample on a graph to get a clear view of the finite-sample distribution.
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
Example #2
N
Researcher Samuel conducts a study to determine the average weight of 12-year-olds from five different
regions. Thus, he decides to collect 20 samples from each region. Firstly, the researcher collects 20 samples
K
from region A and finds out the mean of those samples. Then, he repeats the same for regions B, C, D, and E to
get a separate representation for each sample population.
H
The researcher computes the mean of the finite-sample distribution after finding the respective average weight
of 12-year-olds. In addition, he also calculates the standard deviation of sampling distribution and variance.
S
The discussion on sampling distribution is incomplete without the mention of the central limit theorem, which
N
states that the shape of the distribution will depend on the size of the sample.
According to this theorem, the increase in the sample size will reduce the chances of standard error, thereby
IG
keeping the distribution normal. When users plot the data on a graph, the shape will be close to the bell-curve
shape. In short, the more sample groups one studies, the better and more normal is the result/representation.
Frequently Asked Questions (FAQs)
V
Also known as finite-sample distribution, it is the statistical study where samples are randomly chosen from a
population with specific attributes to determine the probability of varied outcomes. The result obtained helps
academicians, financial analysts, market strategists, and researchers conclude a study, take relevant actions and
make wiser decisions.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
How to find the mean of the sampling distribution?
S
Why is sampling distribution important?
TE
It is important to obtain a graphical representation to understand to what extent the outcome related to an
event could vary. In addition, it helps users to understand the population with which they are dealing. For
example, a businessman can figure out the probability of how fruitful selling their products or services would be.
O
At the same time, financial analysts can compare the investment vehicles and determine which one has more
potential to bear more profits, etc.
Resampling involves the selection of randomized cases with replacement from the original data sample in such a
manner that each number of the sample drawn has a number of cases that are similar to the original data
E
sample. Due to replacement, the drawn number of samples that are used by the method of resampling consists
of repetitive cases.
N
IG
While reading about Machine Learning and Data Science we often come across a term called Imbalanced Class
Distribution , generally happens when observations in one of the classes are much higher or lower than any
other classes.
V
As Machine Learning algorithms tend to increase accuracy by reducing the error, they do not consider the class
distribution. This problem is prevalent in examples such as Fraud Detection, Anomaly Detection, Facial
recognition etc.
Two common methods of Resampling are –
1. Cross Validation
2. Bootstrapping
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Cross Validation –
Cross-Validation is used to estimate the test error associated with a model to evaluate its performance.
Validation set approach:
This is the most basic approach. It simply involves randomly dividing the dataset into two parts: first a training
set and second a validation set or hold-out set. The model is fit on the training set and the fitted model is used
to make predictions on the validation set.
S
TE
O
N
K
Leave-one-out-cross-validation:
LOOCV is a better option than the validation set approach. Instead of splitting the entire dataset into two halves
H
only one observation is used for validation and the rest is used to fit the model.
S
E
N
IG
V
k-fold cross-validation –
This approach involves randomly dividing the set of observations into k folds of nearly equal size. The first fold is
treated as a validation set and the model is fit on the remaining folds. The procedure is then repeated k times,
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
where a different group each time is treated as the validation set.
S
TE
O
N
K
Bootstrapping –
H
Bootstrap is a powerful statistical tool used to quantify the uncertainty of a given model. However, the real
power of bootstrap is that it could get applied to a wide range of models where the variability is hard to obtain
S
Algorithms in Machine Learning tend to produce unsatisfactory classifiers when handled with unbalanced
datasets.
N
Positive Dataset : 90
Negative Dataset : 10
V
Event rate : 2%
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
misclassification of the minority class in comparison with the majority class.
Evaluation of classification algorithm is measured by confusion matrix.
S
TE
O
A way to evaluate the results is by the confusion matrix, which shows the correct and incorrect predictions for
each class. In the first row, the first column indicates how many classes “True” got predicted correctly, and the
second column, how many classes “True” were predicted as “False”. In the second row, we note that all class
“False” entries were predicted as class “True”.
N
Therefore, the higher the diagonal values of the confusion matrix, the better the correct prediction.
K
Handling Approach:
Random Over-sampling:
It aims to balance class distribution by randomly increasing minority class examples by replicating
H
them.
For example –
S
Positive Dataset : 90
E
Negative Dataset : 10
N
Event Rate : 2%
IG
SMOTE (Synthetic Minority Oversampling Technique) synthesises new minority instances between
existing minority instances. It randomly picks up the minority class and calculates the K-nearest
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
neighbour for that particular point. Finally, the synthetic points are added between the neighbours
and the chosen spot.
Random Under-Sampling:
It aims to balance class distribution by randomly eliminating majority class examples.
For Example –
Total Observations : 100
Positive Dataset : 90
S
Negative Dataset : 10
TE
Event rate : 2%
We take 10% samples of Positive Dataset and combine it with Negative Dataset.
O
Positive Dataset after Random Under-Sampling : 10% of 90 = 9
N
K
Total observation after combining it with Negative Dataset: 10+9=19
H
When instances of two different classes are very close to each other, we remove the instances of
S
the majority class to increase the spaces between the two classes. This helps in the classification
process.
E
K means clustering algorithm is independently applied to both the class instances such as to
identify clusters in the datasets. All clusters are oversampled such that clusters of the same class
IG
Positive Dataset : 90
Negative Dataset : 10
Event Rate : 2%
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Cluster 2: 30 Observations
Cluster 3: 12 Observations
Cluster 4: 18 Observations
Cluster 5: 10 Observations
Minority Class Cluster:
Cluster 1: 8 Observations
Cluster 2: 12 Observations
S
After oversampling all clusters of the same class have the same number of observations.
Majority Class Cluster:
TE
Cluster 1: 20 Observations
Cluster 2: 20 Observations
Cluster 3: 20 Observations
Cluster 4: 20 Observations
O
Cluster 5: 20 Observations
Minority Class Cluster:
Cluster 1: 15 Observations
Cluster 2: 15 Observations N
K
Below is the implementation of some resampling techniques:
You can download the dataset from the given link below : Dataset download
H
Python3
# importing libraries
S
import pandas as pd
E
import numpy as np
N
Python3
dataset = pd.read_csv(r'C:\Users\Abhishek\Desktop\creditcard.csv')
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
/len(dataset) * 100, 2), '% of the dataset')
TE
print('Class 1(Fraud) :', round(dataset['Class'].value_counts()[1]
O
N
K
Python3
H
X_res = pd.DataFrame(X_res)
V
Y_res = pd.DataFrame(y_res)
print("After Under Sampling Of Major Class Total Samples are :", len(Y_res))
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
Python3
tl = TomekLinks()
N
K
X_res, y_res = tl.fit_resample(X_data, Y_data)
H
X_res = pd.DataFrame(X_res)
S
Y_res = pd.DataFrame(y_res)
E
N
print("After TomekLinks Under Sampling Of Major Class Total Samples are :", len(Y_res))
IG
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Python3
S
TE
X_res, y_res = ros.fit_resample(X_data, Y_data)
O
X_res = pd.DataFrame(X_res)
Y_res = pd.DataFrame(y_res)
N
K
print("After Over Sampling Of Minor Class Total Samples are :", len(Y_res))
Python3
sm = SMOTE(random_state = 42)
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
X_res = pd.DataFrame(X_res)
Y_res = pd.DataFrame(y_res)
S
TE
print("After SMOTE Over Sampling Of Minor Class Total Samples are :", len(Y_res))
O
print('Class 1(Fraud) :', round(Y_res[0].value_counts()[1]
Statistical inference
IG
Statistical inference is the process of analysing the result and making conclusions from data subject to random
variation. It is also called inferential statistics. Hypothesis testing and confidence intervals are the applications of
the statistical inference. Statistical inference is a method of making decisions about the parameters of a
V
population, based on random sampling. It helps to assess the relationship between the dependent and
independent variables. The purpose of statistical inference to estimate the uncertainty or sample to sample
variation. It allows us to provide a probable range of values for the true values of something in the population.
The components used for making statistical inference are:
Sample Size
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Size of the observed differences
There are different types of statistical inferences that are extensively used for making conclusions. They are:
Confidence Interval
S
Pearson Correlation
Bi-variate regression
TE
Multi-variate regression
O
ANOVA or T-test
N
K
Begin with a theory
Conduct statistical tests to see if the collected sample properties are adequately different from what
N
would be expected under the null hypothesis to be able to reject the null hypothesis
Statistical inference solutions produce efficient use of statistical data relating to groups of individuals or trials. It
deals with all characters, including the collection, investigation and analysis of data and organizing the collected
V
data. By statistical inference solution, people can acquire knowledge after starting their work in diverse fields.
Some statistical inference solution facts are:
It is a common way to assume that the observed sample is of independent observations from a
population type like Poisson or normal
Statistical inference solution is used to evaluate the parameter(s) of the expected model like normal
mean or binomial proportion
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Importance of Statistical Inference
Inferential Statistics is important to examine the data properly. To make an accurate conclusion, proper data
analysis is important to interpret the research results. It is majorly used in the future prediction for various
observations in different fields. It helps us to make inference about the data. The statistical inference has a wide
range of application in different fields, such as:
Business Analysis
S
Artificial Intelligence
Financial Analysis
TE
Fraud Detection
Machine Learning
Share Market
O
Pharmaceutical Sector
1. Diamond cards
2. Black cards
3. Except for spade
V
Solution:
i.e.,90+100+120+90=400
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Number of trials in which diamond card is drawn = 90
S
(3) Except for spade
TE
Therefore, P(except spade) = 310/400 = 0.775
O
Prediction Error
N
Prediction error refers to the difference between the predicted values made by some model and the actual
values.
K
Prediction error is often used in two settings:
1. Linear regression: Used to predict the value of some continuous response variable.
H
We typically measure the prediction error of a linear regression model with a metric known as RMSE, which
S
It is calculated as:
N
RMSE = √Σ(ŷ i – y i ) 2 / n
IG
where:
2. Logistic Regression: Used to predict the value of some binary response variable.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
One common way to measure the prediction error of a logistic regression model is with a metric known as the
total misclassification rate.
It is calculated as:
The lower the value for the misclassification rate, the better the model is able to predict the outcomes of the
S
response variable.
TE
The following examples show how to calculate prediction error for both a linear regression model and a logistic
regression model in practice.
O
Suppose we use a regression model to predict the number of points that 10 players will score in a basketball
game.
N
The following table shows the predicted points from the model vs. the actual points the players scored:
K
H
S
E
N
IG
V
RMSE = √Σ(ŷ i – y i ) 2 / n
RMSE = √(((14-12)2+(15-15)2+(18-20)2+(19-16)2+(25-20)2+(18-19)2+(12-16)2+(12-20)2+(15-16)2+(22-
16)2) / 10)
RMSE = 4
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
The root mean squared error is 4. This tells us that the average deviation between the predicted points scored
and the actual points scored is 4.
Suppose we use a logistic regression model to predict whether or not 10 college basketball players will get
S
drafted into the NBA.
TE
The following table shows the predicted outcome for each player vs. the actual outcome (1 = Drafted, 0 = Not
Drafted):
O
N
K
H
S
E
This value is quite high, which indicates that the model doesn’t do a very good job of predicting whether or not
a player will get drafted.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
UNIT II SEARCH METHODS AND VISUALIZATION
S
Search by simulated Annealing
TE
Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a
given function. Specifically, it is a metaheuristic to approximate global optimization in a large search
space for an optimization problem. It is often used when the search space is discrete (for example
O
the traveling salesman problem, the boolean satisfiability problem, protein structure prediction,
and job-shop scheduling). For problems where finding an approximate global optimum is more
important than finding a precise local optimum in a fixed amount of time, simulated annealing may be
N
preferable to exact algorithms such as gradient descent or branch and bound.
The name of the algorithm comes from annealing in metallurgy, a technique involving heating and
K
controlled cooling of a material to alter its physical properties. Both are attributes of the material that
depend on their thermodynamic free energy. Heating and cooling the material affects both the
temperature and the thermodynamic free energy or Gibbs energy. Simulated annealing can be used for
H
very hard computational optimization problems where exact algorithms fail; even though it usually
achieves an approximate solution to the global minimum, it could be enough for many practical
S
problems.
E
The problems solved by SA are currently formulated by an objective function of many variables, subject
to several constraints. In practice, the constraint can be penalized as part of the objective function.
N
Similar techniques have been independently introduced on several occasions, including Pincus
(1970),[1] Khachaturyan et al (1979,[2] 1981[3]), Kirkpatrick, Gelatt and Vecchi (1983), and Cerny
IG
(1985).[4] In 1983, this approach was used by Kirkpatrick, Gelatt Jr., Vecchi, [5] for a solution of the
traveling salesman problem. They also proposed its current name, simulated annealing.
V
This notion of slow cooling implemented in the simulated annealing algorithm is interpreted as a slow
decrease in the probability of accepting worse solutions as the solution space is explored. Accepting
worse solutions allows for a more extensive search for the global optimal solution. In general,
simulated annealing algorithms work as follows. The temperature progressively decreases from an
initial positive value to zero. At each time step, the algorithm randomly selects a solution close to the
current one, measures its quality, and moves to it according to the temperature-dependent
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
probabilities of selecting better or worse solutions, which during the search respectively remain at 1
(or positive) and decrease toward zero.
The simulation can be performed either by a solution of kinetic equations for density functions [6][7] or
by using the stochastic sampling method.[5][8] The method is an adaptation of the Metropolis–Hastings
algorithm, a Monte Carlo method to generate sample states of a thermodynamic system, published
by N. Metropolis et al. in 1953
S
TE
Local Search with Simulated Annealing from Scratch
O
A short refresher: local search is a heuristic that tries to improve a given solution by looking at neighbors. If the
objective value of a neighbor is better than the current objective value, the neighbor solution is accepted and the
N
search continues. Simulated annealing allows worse solutions to be accepted, this makes it possible to escape
local minima.
K
Simulated Annealing Generic Code
H
The code works as follows: we are going to create four code files. The most important one is sasolver.py, this file
contains the generic code for simulated annealing. The problems directory contains three examples of
S
For solving a problem with simulated annealing, we start to create a class that is quite generic:
import copy
import logging
import math
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
import numpy as np
import random
import time
S
class SimulatedAnnealing():
TE
def __init__(self, problem):
self.problem = problem
O
str='lin'):
start = time.time()
best_solution = self.problem.baseline_solution()
best_obj = self.problem.score_solution(best_solution)
logging.info(f"First solution.
N
Objective: {round(best_obj, 2)} Solution: {best_solution}")
K
initial_temp = best_obj
prev_solution = copy.deepcopy(best_solution)
prev_obj = best_obj
H
iteration = 0
S
last_update = 0
E
last_update += 1
accept = False
IG
curr_solution = self.problem.select_neighbor(copy.deepcopy(prev_solution))
curr_obj = self.problem.score_solution(curr_solution)
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
if curr_obj < best_obj:
best_solution = copy.deepcopy(curr_solution)
best_obj = curr_obj
prev_solution = copy.deepcopy(curr_solution)
prev_obj = curr_obj
last_update = 0
logging.info(f"Better solution found. Objective: {round(best_obj, 2)} Solution: {curr_solution}")
S
else:
if accept:
TE
prev_obj = curr_obj
prev_solution = copy.deepcopy(curr_solution)
last_update = 0
O
if last_update >= update_iterations:
break
"""
Determine the acceptance criterion (threshold for accepting a solution that is worse than the current one)
S
"""
E
acc = -1
return acc
V
@staticmethod
def _calculate_temperature(initial_temp: int, iteration: int, max_iterations: int, how: str = None) -> float:
"""
Decrease the temperature to zero based on total number of iterations.
"""
if iteration >= max_iterations:
return -1
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
if how == "exp":
cooling_rate = 0.95
return initial_temp * (cooling_rate**iteration)
elif how == "quadratic":
cooling_rate = 0.01
return initial_temp / (1 + cooling_rate * iteration**2)
elif how == "log":
S
cooling_rate = 1.44
return initial_temp / (1 + cooling_rate * np.log(1 + iteration))
TE
elif how == "lin mult":
cooling_rate = 0.1
return initial_temp / (1 + cooling_rate * iteration)
else:
O
return initial_temp * (1 - iteration / max_iterations)
if __name__ == '__main__':
problem = 'rastrigin' # choose one of knapsack, tsp, rastrigin N
logging.basicConfig(filename=f'{problem}.log', encoding='utf-8', level=logging.INFO)
K
if problem == 'tsp':
problem = TravelingSalesman(n_locations=10, height=100, width=100)
sa = SimulatedAnnealing(problem)
H
final_solution = sa.run_sa()
problem._plot_solution(final_solution, title='final')
S
final_solution = sa.run_sa()
elif problem == 'rastrigin':
IG
problem = Rastrigin(n_dims=2)
sa = SimulatedAnnealing(problem)
final_solution = sa.run_sa()
V
This file is sasolver.py. It takes a problem as input, and then you can solve the problem with simulated
annealing, run_sa(). There are different ways to handle cooling, implemented in _calc_temperature. The
acceptance value is calculated based on the metropolis acceptance criterion.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
By modifying the problem = 'tsp' line, (below if __name__ == '__main__':,) it’s possible to select another problem
(replace tsp by knapsack or rastrigin).
We need to have three methods in the example problems to make this code work:
1. baseline_solution()
This method creates the first solution (starting point) for a problem.
S
2. score_solution(solution)
TE
The score_solution method calculates the objective value.
3. select_neighbor(solution)
We need to apply local moves to the solutions and select a neighbor, this will be implemented in
O
this method.
N
We are going to implement these three methods for three problems: traveling salesman, knapsack and the
Rastrigin function.
K
Example 1. Traveling Salesman
H
The first problem we are going to look at is the traveling salesman problem. In this problem, there are locations
that need to be visited. The goal is to minimize the distance traveled. Below you can see an example:
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
Example: 10 locations we want to visit and minimize the distance. Image by author.
H
import numpy as np
import random
E
class TravelingSalesman():
IG
def __init__(self, n_locations: int = 10, locations: List[tuple] = None, height: int = 100, width: int = 100, starting_point:
int=0):
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
self.n_locations = len(locations)
self.distances = self._create_distances()
S
self._plot_solution(baseline, title='baseline')
self._plot_solution(baseline, title='dots', only_dots=True)
TE
return baseline
O
return sum([self.distances[node, solution[i+1]] for i, node in enumerate(solution[:-1])])
solution[idx2] = value1
return solution
S
E
def _plot_solution(self, solution: list, title: str = 'tsp', only_dots: bool = False):
IG
plt.clf()
plt.rcParams["figure.figsize"] = [5, 5]
plt.rcParams["figure.autolayout"] = True
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
plt.plot(x_values, y_values, 'bo')
plt.text(x_values[0]-2, y_values[0]+2, str(location_id1))
plt.savefig(f'{title}')
S
for nj, j in enumerate(self.locations):
distances[ni, nj] = self._distance(i[0], i[1], j[0], j[1])
TE
return distances
@staticmethod
def _distance(x1: float, y1: float, x2: float, y2: float) -> float:
O
return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)
N
In this problem, the baseline solution is created by visiting the locations in sequence (0 to 9). For the example, it
K
gives us this route:
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Baseline solution. Image by author.
This doesn’t look optimal, and it isn’t. A local move is defined by swapping two locations. The score of the
solution is the distance we need to travel. After running simulated annealing, this is the final solution:
S
TE
O
N
K
H
S
For small problems, this works okay (still not recommended). For larger ones, there are better solutions and
algorithms available, for example the Lin-Kernighan heuristic. What also helps is a better starting solution, e.g. a
greedy algorithm.
V
Example 2. Knapsack
The knapsack problem is a classic one, but for those who don’t know it, here follows an explanation.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Imagine you are in a cave full of beautiful treasures. Due to some unforeseen circumstances the cave is
collapsing. You have time to fill your knapsack with treasures and then you need to run away to safety. Of course,
you want to take the items with you that together bring most value. What items should you take?
S
TE
O
N
K
The knapsack problem. The knapsack has a capacity of 50. What items should you select to maximize the value?
Image by author.
H
The data you need to have for solving this problem is the capacity of the knapsack, the capacity needed for the
items and the value of the items.
S
E
import copy
import random
IG
import numpy as np
from typing import List
V
class Knapsack():
def __init__(self, knapsack_capacity: int, n_items: int = 20, item_values: list = None, item_capacities: list = None):
self.name = 'knapsack'
self.knapsack_capacity = knapsack_capacity
if item_values is None and item_capacities is None:
item_values, item_capacities = self._create_sample_data(n_items)
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
self.item_values = item_values
self.item_capacities = item_capacities
self.n_items = len(item_values)
S
solution = []
while True:
TE
selected = random.choice([i for i in range(self.n_items) if i not in solution])
if capacity + self.item_capacities[selected] > self.knapsack_capacity:
break
else:
O
solution.append(selected)
capacity += self.item_capacities[selected]
return solution
if len(solution) == 0:
move = 'add'
IG
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
new_item = random.choice(possible_to_add)
new_solution.append(new_item)
return new_solution
elif move == 'swap':
n=0
while n < 50:
new_solution = copy.deepcopy(solution)
S
in_item = random.choice([i for i in range(self.n_items) if i not in solution])
out_item = random.choice(range(len(solution)))
TE
new_solution.pop(out_item)
new_solution.append(in_item)
n += 1
if self._is_feasible(new_solution):
O
return new_solution
move = 'remove'
The baseline solution selects an item at random until the knapsack is full. The solution score is the sum of values
N
of the items in the knapsack, multiplied by -1. This is necessary because the SA solver minimizes a given objective.
In this situation, there are three local moves possible: adding an item, removing an item or swapping two items.
IG
This makes it possible to reach every solution possible in solution space. If we swap an item, we need to check if
the new solution is feasible.
V
In the next image you can see a sample run log file. There are 10 items we need to choose from. On top the item
values, below the capacity the items take, and on the third line the value densities (item value divided by item
capacity). Then the solution process starts. The solution contains the index number(s) of the selected items. In
the final solution, items 4, 5 and 8 are selected (counting starts at 0):
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
Example 3. Rastrigin Function
TE
A function that is used often to test optimization algorithms, is the Rastrigin function. In 3D it looks like this:
O
N
K
H
S
E
N
It has many local optima. The goal is to find the global minimum, which is at coordinate (0, 0). It is easier to see in
V
a contour plot:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
Contour plot of the Rastrigin function. Image by author.
H
The landscape consists of many local optima with the highest ones in the four corners and the lowest ones in the
center.
S
E
We can try to find the global minimum with simulated annealing. This problem is continuous instead of discrete,
and we want to find the values for x and y that minimize the Rastrigin function.
N
Let’s try to find the optimum for the function with three dimensions (x, y, and z). The domain is defined
by x and y, so the problem is exactly as the plots above.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
class Rastrigin():
def __init__(self, n_dims: int = 2):
TE
self.name = 'rastrigin'
self.n_dims = n_dims
O
solution = [random.uniform(-5.12, 5.12) for _ in range(self.n_dims)]
return solution
return neighbor
IG
For the baseline solution, we select a random float for x and y between -5.12 and 5.12. The score of the solution
is equal to z (the outcome of the Rastrigin function). A neighbor is selected by taking a step into a random
direction with a step size set to 0.1. The feasibility check is done to make sure we stay in the domain.
A log of a run:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
The final solution comes really close to the optimum.
O
But watch out, if you run the algorithm with more dimensions, it’s not guaranteed that you find the optimum:
N
K
H
S
E
As you can see, the final solution is a local optimum instead of the global one. It find goods coordinates for the
N
first two variables, but the third one is equal to 0.985, which is far away from 0. It’s important to verify the results
you get. This specific example will work well by finetuning the SA parameters, but for more dimensions you
IG
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the larger part of evolutionary
algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent
exploitation of random search provided with historical data to direct the search into the region of better
performance in solution space. They are commonly used to generate high-quality solutions for optimization
problems and search problems.
Genetic algorithms simulate the process of natural selection which means those species who can adapt to
S
changes in their environment are able to survive and reproduce and go to next generation. In simple words,
they simulate “survival of the fittest” among individual of consecutive generation for solving a problem. Each
TE
generation consist of a population of individuals and each individual represents a point in search space and
possible solution. Each individual is represented as a string of character/integer/float/bits. This string is
analogous to the Chromosome.
Foundation of Genetic Algorithms
O
Genetic algorithms are based on an analogy with genetic structure and behaviour of chromosomes of the
population. Following is the foundation of GAs based on this analogy –
1.
2.
Individual in population compete for resources and mate
N
Those individuals who are successful (fittest) then mate to create more offspring than others
K
3. Genes from “fittest” parent propagate throughout the generation, that is sometimes parents
create offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.
H
Search space
The population of individuals are maintained within search space. Each individual represents a solution in search
S
space for given problem. Each individual is coded as a finite length vector (analogous to chromosome) of
components. These variable components are analogous to Genes. Thus a chromosome (individual) is composed
E
Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to “compete”. The individual
having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with their fitness scores.The
individuals having better fitness scores are given more chance to reproduce than others. The individuals with
better fitness scores are selected who mate and produce better offspring by combining chromosomes of
parents. The population size is static so the room has to be created for new arrivals. So, some individuals die
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
and get replaced by new arrivals eventually creating new generation when all the mating opportunity of the old
population is exhausted. It is hoped that over successive generations better solutions will arrive while least fit
die.
Each new generation has on average more “better genes” than the individual (solution) of previous generations.
Thus each new generations have better “partial solutions” than previous generations. Once the offspring
produced having no significant difference from offspring produced by previous populations, the population is
converged. The algorithm is said to be converged to a set of solutions for the problem.
S
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves the generation using following operators –
TE
1) Selection Operator: The idea is to give preference to the individuals with good fitness scores and allow them
to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are selected using selection
operator and crossover sites are chosen randomly. Then the genes at these crossover sites are exchanged thus
O
creating a completely new individual (offspring). For example –
N
K
3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity in the
H
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
d) Calculate fitness for new population
Characters A-Z, a-z, 0-9, and other special symbols are considered as genes
A string generated by these characters is considered as chromosome/solution/Individual
Fitness score is the number of characters which differ from characters in target string at a particular index. So
S
individual having lower fitness value is given more preference.
TE
C++
Python3
O
// random string using Genetic Algorithm
#include <bits/stdc++.h>
N
K
using namespace std;
H
// Valid Genes
N
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
return random_int;
S
}
TE
// Create random genes for mutation
char mutated_genes()
O
{
}
H
S
string create_gnome()
E
{
N
for(int i = 0;i<len;i++)
V
gnome += mutated_genes();
return gnome;
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
class Individual
public:
string chromosome;
S
int fitness;
TE
Individual(string chromosome);
int cal_fitness();
O
};
Individual::Individual(string chromosome)
N
K
{
this->chromosome = chromosome;
H
fitness = cal_fitness();
S
};
E
for(int i = 0;i<len;i++)
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
// random probability
S
// from parent 1
TE
if(p < 0.45)
child_chromosome += chromosome[i];
O
// if prob is between 0.45 and 0.90, insert
else
E
child_chromosome += mutated_genes();
N
}
IG
return Individual(child_chromosome);
};
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
// string.
int Individual::cal_fitness()
S
{
TE
int len = TARGET.size();
int fitness = 0;
for(int i = 0;i<len;i++)
O
{
if(chromosome[i] != TARGET[i])
fitness++;
N
K
}
return fitness;
H
};
S
{
IG
}
V
// Driver code
int main()
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
srand((unsigned)(time(0)));
// current generation
int generation = 0;
S
vector<Individual> population;
TE
bool found = false;
O
for(int i = 0;i<POPULATION_SIZE;i++)
}
H
S
while(! found)
{
E
sort(population.begin(), population.end());
IG
if(population[0].fitness <= 0)
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
found = true;
break;
S
vector<Individual> new_generation;
TE
// Perform Elitism, that mean 10% of fittest population
O
int s = (10*POPULATION_SIZE)/100;
for(int i = 0;i<s;i++)
new_generation.push_back(population[i]);
N
K
// From 50% of fittest population, Individuals
H
s = (90*POPULATION_SIZE)/100;
for(int i = 0;i<s;i++)
E
{
N
r = random_num(0, 50);
new_generation.push_back(offspring);
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
population = new_generation;
S
TE
generation++;
O
cout<< "String: "<< population[0].chromosome <<"\t";
}
cout<< "Fitness: "<< population[0].fitness << "\n";
N
K
Output:
Generation: 1 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18
H
.
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Generation: 31 String: I love Geeksfo0Geeks Fitness: 1
Note: Every-time algorithm start with random strings, so output may differ
As we can see from the output, our algorithm sometimes stuck at a local optimum solution, this can be further
S
improved by updating fitness score calculation algorithm or by tweaking mutation and crossover operators.
TE
Why use Genetic Algorithms
They are Robust
Provide optimisation over large space state.
Unlike traditional AI, they do not break on slight change in input or presence of noise
O
Application of Genetic Algorithms
Genetic algorithms have many applications, some of them are –
Recurrent Neural Network
Mutation testing
N
K
Code breaking
Filtering and signal processing
Learning fuzzy rule base etc
H
S
Genetic programming is a technique to create algorithms that can program themselves by simulating biological
N
breeding and Darwinian evolution. Instead of programming a model that can solve a particular problem, genetic
programming only provides a general objective and lets the model figure out the details itself. The basic
IG
approach is to let the machine automatically test various simple evolutionary algorithms and then “breed” the
most successful programs in new generations.
V
While applying the same natural selection, crossover, mutations and other reproduction approaches as
evolutionary and genetic algorithms, gene programming takes the process a step further by automatically
creating new models and letting the system select its own goals.
The entire process is still an area of active research. One of the biggest obstacles to widespread adoption of this
genetic machine learning approach is quantifying the fitness function, i.e to what degree each new program is
contributing to reaching the desired goal.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Specify terminals – For example, independent variables of the problem, zero-argument functions, or
random constants for each branch of program that will be go through evolution.
S
Define initial “primitive” functions for each branch of the genetic program.
Choose a fitness measure – this measures the fitness of individuals in the population to determine if they
TE
should reproduce.
Any special performance parameters for controlling the run.
Select termination criterion and methods for reaching the run’s goals.
O
From there, the program runs automatically, without requiring any training data.
A random initial population (generation 0) of simple programs will be generated based upon basic functions and
terminal defined by the human.
N
Each program will be executed and its fitness measured from the results. The most successful or “fit” programs
K
will be breeding stock to birth a new generation, with some new population members directly copied
(reproduction), some through crossover (randomly breeding parts of the programs) and random mutations.
Unlike evolutionary programming, an additional architecture-altering operation is chosen, similar to a species, to
H
A genetic algorithm(or GA) is a search technique used in computing to find true or approximate solutions to
optimization and search problems based on the theory of natural selection and evolutionary biology.
N
IG
Vocabulary:
3. Fitness: Target function that we are optimizing (each individual has a fitness)
It starts from a population of randomly generated individuals and happens in generations. In each generation, the
fitness of every individual in the population is evaluated, multiple individuals are selected (based on their fitness),
and modified to form a new population. The new population is used in the next iteration of the algorithm. The
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness
level has been reached for the population.
For example, consider a population of giraffes. Giraffes with slightly longer necks could feed on leaves of higher
branches when all lower ones had been eaten off. They had a better chance of survival. Favorable characteristics
propagated through generations of giraffes. Now, evolved species have long necks.
S
On the other hand, Genetic programming is a specialization of genetic algorithms where each individual is a
computer program. The main difference between genetic programming and genetic algorithms is the
TE
representation of the solution.
The output of the genetic algorithm is a quantity, while the output of the genetic programming is another computer
O
program.
Introduction:
N
Genetic Programming(or GP) introduced by Mr. John Koza is a type of Evolutionary Algorithm (EA), a subset of
K
machine learning. EAs are used to discover solutions to problems humans do not know how to solve, directly.
H
Genetic programming is a systematic method for getting computers to automatically solve a problem and
iteratively transform a population of computer programs into a new generation of programs by applying analogs
S
of naturally occurring genetic operations. The genetic operations include crossover, mutation, reproduction, gene
duplication, and gene deletion.
E
Working:
It starts with an initial set of programs composed of functions and terminals that may be handpicked or randomly
V
generated. The functions may be standard arithmetic operations, programming operations, mathematical
functions, or logical functions. The programs compete with each other over given input and expected output
data. Each computer program in the population is measured in terms of how well it performs in a particular
problem environment. This measure is called a fitness measure. Top-performing programs are picked, mutation
and breeding are performed on them to generate the next generation. Next-generation competes with each
other, the process goes on until the perfect program is evolved.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
Main Loop of Genetic Programming(source)
TE
Here’s a concise explanation:
O
Make an Initial population, Evaluation(assign a fitness function to each program in the population), Selection of
‘fitter’ individuals, Variation(mutation, crossover, etc), Iteration, and Termination.
Program Representation:
N
K
Programs in genetic programming are expressed as syntax trees rather than as lines of code. Trees can be easily
evaluated recursively. The tree includes nodes(functions) and links(terminals). The nodes indicate the instructions
H
to execute and the links indicate the arguments for each instruction. For illustration consider the implementation
of the equation: x ∗ ((x%y) − sin(x)) + exp(x) shown in figure below. In this example, terminal set = {x, y} and
S
GP syntax tree(source)
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
FIVE MAJOR PREPARATORY STEPS FOR GP :
S
Determining the method for designating a result and the criterion for terminating a run.
TE
O
N
K
Now let’s look into how genetic operators like crossover and mutation can be applied on a subtree.
H
The crossover operator is used for the exchanging of subtrees between two individuals.
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
GP applies point mutation in which a random node in the tree is chosen and replaced with a different random
generated subtree.
S
TE
Mutation Operator for Genetic Programming
O
The various types of Genetic Programming include:
Tree-based Genetic Programming
N
K
Stack-based Genetic Programming
Grammatical Evolution
S
1. LISP
2. Matlab
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
3. Python
4. Java
5. C
Applications:
S
Intrusion Detection Systems, cancer research, curve fitting, data modeling, symbolic regression, feature selection,
TE
classification, game playing, quantum computing, etc.
These two videos will help you understand the working of genetic programming and its applications.
O
What is data visualization? N
K
H
S
E
N
IG
V
Data visualization is the graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
patterns in data. Additionally, it provides an excellent way for employees or business owners to present data to
non-technical audiences without confusion.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.
S
Something as simple as presenting data in graphic format may seem to have no downsides. But sometimes data
can be misrepresented or misinterpreted when placed in the wrong style of data visualization. When choosing
TE
to create a data visualization, it’s best to keep both the advantages and disadvantages in mind.
Advantages
O
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and squares from circles. Our
culture is visual, including everything from art and advertisements to TV and movies. Data visualization is
N
another form of visual art that grabs our interest and keeps our eyes on the message. When we see a chart,
we quickly see trends and outliers. If we can see something, we internalize it quickly. It’s storytelling with a
purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend, you know how much
K
more effective a visualization can be.
Disadvantages
IG
While there are many advantages, some of the disadvantages may seem less obvious. For example, when
viewing a visualization with many different datapoints, it’s easy to make an inaccurate assumption. Or
V
sometimes the visualization is just designed wrong so that it’s biased or confusing.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Why data visualization is important
The importance of data visualization is simple: it helps people see, interact with, and better understand data.
Whether simple or complex, the right visualization can bring everyone on the same page, regardless of their
level of expertise.
It’s hard to think of a professional industry that doesn’t benefit from making data more understandable. Every
STEM field benefits from understanding data—and so do fields in government, finance, marketing, history,
S
consumer goods, service industries, education, sports, and so on.
TE
While we’ll always wax poetically about data visualization (you’re on the Tableau website, after all) there are
practical, real-life applications that are undeniable. And, since visualization is so prolific, it’s also one of the most
useful professional skills to develop. The better you can convey your points visually, whether in a dashboard or a
slide deck, the better you can leverage that information. The concept of the citizen data scientist is on the rise.
O
Skill sets are changing to accommodate a data-driven world. It is increasingly valuable for professionals to be
able to use data to make decisions and use visuals to tell stories of when data informs the who, what, when,
where, and how.
N
While traditional education typically draws a distinct line between creative storytelling and technical analysis,
K
the modern professional world also values those who can cross between the two: data visualization sits right in
the middle of analysis and visual storytelling.
H
As the “age of Big Data” kicks into high gear, visualization is an increasingly key tool to make sense of the
trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form
E
easier to understand, highlighting the trends and outliers. A good visualization tells a story, removing the noise
from data and highlighting useful information.
N
However, it’s not simply as easy as just dressing up a graph to make it look better or slapping on the “info” part
IG
of an infographic. Effective data visualization is a delicate balancing act between form and function. The plainest
graph could be too boring to catch any notice or it make tell a powerful point; the most stunning visualization
could utterly fail at conveying the right message or it could speak volumes. The data and the visuals need to
V
work together, and there’s an art to combining great analysis with great storytelling.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Table of Contents
1. What is Data Visualization?
S
4. Univariate Analysis Techniques for Data Visualization
TE
Distribution Plot
Box and Whisker Plot
Violin Plot
O
5. Bivariate Analysis Techniques for Data Visualization
Line Plot
Bar Plot
N
K
Scatter Plot
H
By using visual elements like charts, graphs, and maps, data visualization techniques provide an accessible way to
E
In modern days we have a lot of data in our hands i.e, in the world of Big Data, data visualization tools, and
N
technologies are crucial to analyze massive amounts of information and make data-driven decisions.
IG
Visualize phenomenons that cannot be observed directly, such as weather patterns, medical conditions,
or mathematical relationships.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
So, Data visualization is another technique of visual art that grabs our interest and keeps our main focus on the
message captured with the help of eyes.
Whenever we visualize a chart, we quickly identify the trends and outliers present in the dataset.
It is a powerful technique to explore the data with presentable and interpretable results.
S
In the data mining process, it acts as a primary step in the pre-processing portion.
It supports the data cleaning process by finding incorrect data and corrupted or missing values.
TE
It also helps to construct and select variables, which means we have to determine which variable to
include and discard in the analysis.
In the process of Data Reduction, it also plays a crucial role while combining the categories.
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Univariate Analysis: In the univariate analysis, we will be using a single feature to analyze almost all of its
properties.
Bivariate Analysis: When we compare the data between exactly 2 features then it is known as bivariate analysis.
Multivariate Analysis: In the multivariate analysis, we will be comparing more than 2 variables.
NOTE:
S
In this article, our main goal is to understand the following concepts:
TE
How do find some inferences from the data visualization techniques?
In which condition, which technique is more useful than others?
O
We are not going to deep dive into the coding/implementation part of different techniques on a particular dataset
but we try to find the answer to the above questions and understand only the snippet code with the help of
sample plots for each of the data visualization techniques.
It is one of the best univariate plots to know about the distribution of data.
E
When we want to analyze the impact on the target variable(output) with respect to an independent
variable(input), we use distribution plots a lot.
N
This plot gives us a combination of both probability density functions(pdf) and histogram in a single plot.
IG
Implementation:
Python Code:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
From the above distribution plot we can conclude the following observations:
We have observed that we created a distribution plot on the feature ‘Age’(input variable) and we used
different colors for the Survival status(output variable) as it is the class to be predicted.
There is a huge overlapping area between the PDFs for different combinations.
In this plot, the sharp block-like structures are called histograms, and the smoothed curve is known as
the Probability density function(PDF).
S
NOTE:
TE
The Probability density function(PDF) of a curve can help us to capture the underlying distribution of that feature
which is one major takeaway from Data visualization or Exploratory Data Analysis(EDA).
O
This plot can be used to obtain more statistical details about the data.
The straight lines at the maximum and minimum are also called whiskers.
Points that lie outside the whiskers will be considered as an outlier. N
The box plot also gives us a description of the 25th, 50th,75th quartiles.
K
With the help of a box plot, we can also determine the Interquartile range(IQR) where maximum details
of the data will be present. Therefore, it can also give us a clear idea about the outliers in the dataset.
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Implementation:
S
The code snippet is as follows:
TE
sns.boxplot(x='SurvStat',y='axil_nodes',data=hb)
O
N
K
H
S
From the above box and whisker plot we can conclude the following observations:
N
How much data is present in the 1st quartile and how many points are outliers etc.
For class 1, we can see that it is very little or no data is present between the median and the 1st quartile.
IG
There are more outliers for class 1 in the feature named axil_nodes.
NOTE:
V
We can get details about outliers that will help us to well prepare the data before feeding it to a model since
outliers influence a lot of Machine learning models.
3. Violin Plot
The violin plots can be considered as a combination of Box plot at the middle and distribution plots(Kernel
Density Estimation) on both sides of the data.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
This can give us the description of the distribution of the dataset like whether the distribution
is multimodal, Skewness, etc.
It also gives us useful information like a 95% confidence interval.
S
TE
O
N
K
Fig. General Diagram for a Violin-plot
Implementation:
H
sns.violinplot(x='SurvStat',y='op_yr',data=hb,size=6)
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Some conclusions inferred from the above violin plot:
From the above violin plot we can conclude the following observations:
S
TE
Bivariate Analysis Techniques for Data Visualization
1. Line Plot
This is the plot that you can see in the nook and corners of any sort of analysis between 2 variables.
O
The line plots are nothing but the values on a series of data points will be connected with straight lines.
The plot may seem very simple but it has more applications not only in machine learning but in many
other areas.
Implementation:
N
K
The line plot is present in the Matplotlib package.
plt.plot(x,y)
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
From the above line plot we can conclude the following observations:
These are used right from performing distribution Comparison using Q-Q plots to CV tuning using
the elbow method.
Used to analyze the performance of a model using the ROC- AUC curve.
2. Bar Plot
S
This is one of the widely used plots, that we would have seen multiple times not just in data analysis, but
we use this plot also wherever there is a trend analysis in many fields.
TE
Though it may seem simple it is powerful in analyzing data like sales figures every week, revenue from a
product, Number of visitors to a site on each day of a week, etc.
Implementation:
O
The bar plot is present in the Matplotlib package.
From the above bar plot we can conclude the following observations:
We can visualize the data in a cool plot and can convey the details straight forward to others.
This plot may be simple and clear but it’s not much frequently used in Data science applications.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
3. Scatter Plot
It is one of the most commonly used plots used for visualizing simple data in Machine learning and Data
Science.
This plot describes us as a representation, where each point in the entire dataset is present with respect
to any 2 to 3 features(Columns).
Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the common one, where
we will primarily try to find the patterns, clusters, and separability of the data.
S
Implementation:
TE
The scatter plot is present in the Matplotlib package.
O
plt.scatter(x,y)
N
K
H
S
E
N
From the above Scatter plot we can conclude the following observations:
V
The colors are assigned to different data points based on how they were present in the dataset i.e, target
column representation.
We can color the data points as per their class label given in the dataset.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
What is a data type?
A data type is an attribute of a piece of data that tells a device how the end-user might interact with the data.
You can also think of them as categorizations that different coding programs might combine in order to execute
certain functions. Most programming languages including C++ and Java use the same basic data types.
S
10 data types
TE
Each programming language uses a different combination of data types. Some of these types include:
1. Integer
O
Integer data types often represent whole numbers in programming. An integer's value moves from one integer
to another without acknowledging fractional numbers in between. The number of digits can vary based on the
device, and some programming languages may allow negative values.
2. Character
N
K
In coding, alphabet letters denote characters. Programmers might represent these data types as (CHAR) or
(VARGCHAR), and they can be single characters or a string of letters. Characters are usually fixed-length figures
that default to 1 octet—an 8-bit unit of digital information—but can increase to 65,000 octets.
H
3. Date
E
This data type stores a calendar date with other programming information. Dates are typically a combination of
N
integers or numerical figures. Since these are typically integer values, some programs can store basic
mathematical operations like days elapsed since certain events or days away from an upcoming event.
IG
Floating-point data types represent fractional numbers in programming. There are two main floating-point data
types, which vary depending on the number of allowable values in the string:
Float: A data type that typically allows up to seven points after a decimal.
Double: A data type that allows up to 15 points after a decimal.
5. Long
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Long data types are often 32- or 64-bit integers in code. Sometimes, these can represent integers with 20 digits
in either direction, positive or negative. Programmers use an ampersand to indicate the data type is a long
variable.
6. Short
Similar to the long data type, a short is a variable integer. Programmers represent these as whole numbers, and
they can be positive or negative. Sometimes a short data type is a single integer.
S
7. String
TE
A string data type is a combination of characters that can be either constant or variable. This often incorporates
a sequence of character data types that result in specific commands depending on the programming language.
Strings can include both upper and lowercase letters, numbers and punctuation.
O
8. Boolean
N
Boolean data is what programmers use to show logic in code. It's typically one of two values—true or false—
intended to clarify conditional statements. These can be responses to "if/when" scenarios, where code indicates
K
if a user performs a certain action. When this happens, the Boolean data directs the program's response, which
determines the next code in the sequence.
H
9. Nothing
S
The nothing data type shows that a code has no value. This might indicate that a code is missing, the
programmer started the code incorrectly or that there were values that defy the intended logic. It's also called
E
10. Void
Similar to the nothing type, the void type contains a value that the code cannot process. Void data types tell a
IG
user that the code can't return a response. Programmers might use or encounter the void data type in early
system testing when there are no responses programmed yet for future steps.
V
Data types can vary based on size, length and use depending on the coding language. Here are some examples
of the data types listed above that you might encounter when programming:
Integer
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Integers are digits that account for whole numbers only. Some integer examples include:
425
65
9
Character
S
Characters are letters or other figures that programmers might combine in a string. Examples of characters
include:
TE
Date
Programmers can include individual dates, ranges or differences in their code. Some examples might be:
O
2009-09-15
1998-11-30 09:45:87
SYSDATETIME () N
K
Long
Long data types are whole numbers, both positive and negative, that have many place values. Examples include:
H
-398,741,129,664,271
S
9,000,000,125,356,546
E
Short
N
Short data types can be up to several integers, but they are always less than long data. Examples include:
IG
-27,400
5,428
17
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Similar but often longer in length, an example of the floating-point double might be:
The floating-point double type can provide more accurate values, but it also may require additional memory to
process.
S
String
TE
Strings are a combination of figures that includes letters and punctuation. In some code, this might look like this:
O
String c = new String("Say Hello!")
Boolean
N
K
Boolean data can help guide the logic in a code. Here are some examples of how you might use this:
H
Depending on the program, the code may direct the end-user to different screens based on their selection.
E
Nothing
N
Nothing means a code has no value, but the programmer coded something other than the digit 0. This is often
"Null," "NaN" or "Nothing" in code. An example of this is:
IG
Void
The void data type in coding functions as an indicator that code might not have a function or a response yet.
This might appear as:
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
What are the Interactive Data Visualization?
Interactive data visualization supports exploratory thinking so that decision-makers can actively investigate
intriguing findings. Interactive visualization supports faster decision making, greater data access and stronger
user engagement along with desirable results in several other metrics. Some of the key findings include:
70% of the interactive visualization adopters improve collaboration and knowledge sharing.
64% of the interactive visualization adopters improve user trust in underlying data.
S
Interactive Visualizes are more likely than static visualizers to be satisfied easily with the use of
TE
analytical tools.
Examples of Interactive Data Visualization
O
The New Yorker (Interactive Visual Content for Media)
Bloomberg (Interactive Financial Data)
Sales Presentations
H
Training Modules
S
Product Collateral
Shareholder Presentations
E
Educational Content
N
to determine their relative importance. Click to explore about, Data Visualization with React and GraphQL
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
By interacting with data to put the focus on specific metrics, decision-makers are able to compare specific
throughout definable timeframes.
S
Apollo
Keuzestress
TE
Marvel Cinematic Universe
Newsmap
O
The Big Mac Index
CF Weather Charts
Galaxy of Covers
Techniques
N
IG
Data visualization is a graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and
patterns in data. This blog on data visualization techniques will help you understand detailed techniques and
V
benefits.
In the world of Big Data, data visualization in Python tools and technologies are essential to analyze massive
amounts of information and make data-driven decisions.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Benefits of good data visualization
Our eyes are drawn to colours and patterns. We can quickly identify red from blue, and square from the circle.
Our culture is visual, including everything from art and advertisements to TV and movies.
Data visualization is another form of visual art that grabs our interest and keeps our eyes on the message. When
we see a chart, we quickly see trends and outliers. If we can see something, we internalize it quickly. It’s
storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend, you
S
know how much more effective a visualization can be. The uses of Data Visualization as follows.
TE
Powerful way to explore data with presentable results.
Primary use is the pre-processing portion of the data mining process.
Supports the data cleaning process by finding incorrect and missing values.
O
For variable derivation and selection means to determine which variable to include and
discarded in the analysis.
N
Also play a role in combining categories as part of the data reduction process.
Data Visualization Techniques
K
Box plots
Histograms
Heat maps
H
Charts
Tree maps
S
Enrol Now – Data Visualization Using Tableau course for free offered by Great Learning Academy.
N
Box Plots
IG
The image above is a box plot. A boxplot is a standardized way of displaying the distribution of data based on a
five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell
you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly
V
A box plot is a graph that gives you a good indication of how the values in the data are spread out. Although box
plots may seem primitive in comparison to a histogram or density plot, they have the advantage of taking up
less space, which is useful when comparing distributions between many groups or datasets. For some
distributions/datasets, you will find that you need more information than the measures of central tendency
(median, mean, and mode). You need to have information on the variability or dispersion of the data.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
List of Methods to Visualize Data
Column Chart: It is also called a vertical bar chart where each category is represented by a
rectangle. The height of the rectangle is proportional to the values that are plotted.
Bar Graph: It has rectangular bars in which the lengths are proportional to the values which
are represented.
Stacked Bar Graph: It is a bar style graph that has various components stacked together so
that apart from the bar, the components can also be compared to each other.
S
Stacked Column Chart: It is similar to a stacked bar; however, the data is stacked horizontally.
TE
Area Chart: It combines the line chart and bar chart to show how the numeric values of one or
more groups change over the progress of a viable area.
Dual Axis Chart: It combines a column chart and a line chart and then compares the two
variables.
O
Line Graph: The data points are connected through a straight line; therefore, creating a
representation of the changing trend.
N
Mekko Chart: It can be called a two-dimensional stacked chart with varying column widths.
Pie Chart: It is a chart where various components of a data set are presented in the form of a
pie which represents their proportion in the entire data set.
K
Waterfall Chart: With the help of this chart, the increasing effect of sequentially introduced
positive or negative values can be understood.
H
Bubble Chart: It is a multi-variable graph that is a hybrid of Scatter Plot and a Proportional
Area Chart.
S
Scatter Plot Chart: It is also called a scatter chart or scatter graph. Dots are used to denote
values for two different numeric variables.
E
Bullet Graph: It is a variation of a bar graph. A bullet graph is used to swap dashboard gauges
and meters.
N
Funnel Chart: The chart determines the flow of users with the help of a business or sales
process.
IG
Heat Map: It is a technique of data visualization that shows the level of instances as color in
two dimensions.
Five Number Summary of Box Plot
V
Minimum Q1 -1.5*IQR
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
First quartile (Q1/25th The middle number between the smallest number (not the “minimum”)
Percentile)”: and the median of the dataset
Median (Q2/50th
the middle value of the dataset
Percentile)”:
S
Third quartile (Q3/75th the middle value between the median and the highest value (not the
Percentile)”: “maximum”) of the dataset.
TE
Maximum” Q3 + 1.5*IQR
O
interquartile range (IQR) 25th to the 75th percentile.
Histograms
N
K
A histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups
numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and
H
It is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set
of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal
E
distribution), outliers, skewness, etc. It is an accurate representation of the distribution of numerical data, it
relates only one variable. Includes bin or bucket- the range of values that divide the entire range of values into a
N
series of intervals and then count how many values fall into each interval.
IG
Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins leave no gaps, the rectangles
of histogram touch each other to indicate that the original value is continuous.
V
In a histogram, the height of the bar does not necessarily indicate how many occurrences of scores there were
within each bin. It is the product of height multiplied by the width of the bin that indicates the frequency of
occurrences within that bin. One of the reasons that the height of the bars is often incorrectly assessed as
indicating the frequency and not the area of the bar is because a lot of histograms often have equally spaced
bars (bins), and under these circumstances, the height of the bin does reflect the frequency.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Also Read: Machine Learning Interview Questions
The major difference is that a histogram is only used to plot the frequency of score occurrences in a continuous
data set that has been divided into classes, called bins. Bar charts, on the other hand, can be used for a lot of
other types of variables including ordinal and nominal data sets.
S
Heat Maps
TE
A heat map is data analysis software that uses colour the way a bar graph uses height and width: as a data
visualization tool.
If you’re looking at a web page and you want to know which areas get the most attention, a heat map shows
O
you in a visual way that’s easy to assimilate and make decisions from. It is a graphical representation of data
where the individual values contained in a matrix are represented as colours. Useful for two purposes: for
Note that heat maps are useful when examining a large number of values, but they are not a replacement for
K
more precise graphical displays, such as bar charts, because colour differences cannot be perceived accurately.
Charts
H
Line Chart
S
The simplest technique, a line plot is used to plot the relationship or dependence of one variable on another. To
E
plot the relationship between the two variables, we can simply call the plot function.
N
Bar Charts
IG
Bar charts are used for comparing the quantities of different categories or groups. Values of a category are
represented with the help of bars and they can be configured with vertical or horizontal bars, with the length or
height of each bar representing the value.
V
Pie Chart
It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc length of
each slide is proportional to the quantity it represents. As a rule, they are used to compare the parts of a whole
and are most effective when there are limited components and when text and percentages are included to
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
describe the content. However, they can be difficult to interpret because the human eye has a hard time
estimating areas and comparing visual angles.
Scatter Charts
Another common visualization technique is a scatter plot that is a two-dimensional plot representing the joint
variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents an
observation. The marker position indicates the value for each observation. When you assign more than two
S
measures, a scatter plot matrix is produced that is a series scatter plot displaying every possible pairing of the
TE
measures that are assigned to the visualization. Scatter plots are used for examining the relationship, or
correlations, between X and Y variables.
Bubble Charts
O
It is a variation of scatter chart in which the data points are replaced with bubbles, and an additional dimension
of data is represented in the size of the bubbles.
Timeline Charts
N
K
Timeline charts illustrate events, in chronological order — for example the progress of a project, advertising
campaign, acquisition process — in whatever unit of time the data was recorded — for example week, month,
H
year, quarter. It shows the chronological sequence of past or future events on a timescale.
S
Tree Maps
E
A treemap is a visualization that displays hierarchically organized data as a set of nested rectangles, parent
elements being tiled with their child elements. The sizes and colours of rectangles are proportional to the values
N
of the data points they represent. A leaf node rectangle has an area proportional to the specified dimension of
the data. Depending on the choice, the leaf node is coloured, sized or both according to chosen attributes. They
IG
make efficient use of space, thus display thousands of items on the screen simultaneously.
The variety of big data brings challenges because semi-structured, and unstructured data require new
visualization techniques. A word cloud visual represents the frequency of a word within a body of text with its
relative size in the cloud. This technique is used on unstructured data as a way to display high- or low-frequency
words.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Another visualization technique that can be used for semi-structured or unstructured data is the network
diagram. Network diagrams represent relationships as nodes (individual actors within the network) and ties
(relationships between the individuals). They are used in many applications, for example for analysis of social
networks or mapping product sales across geographic areas.
Learn all about Data Visualization with Power BI with this free course.
S
FAQs Related to Data Visualization
TE
What are the techniques of Visualization?
A: The visualization techniques include Pie and Donut Charts, Histogram Plot, Scatter Plot, Kernel Density
Estimation for Non-Parametric Data, Box and Whisker Plot for Large Data, Word Clouds and Network Diagrams
for Unstructured Data, and Correlation Matrices.
O
What are the types of visualization?
N
A: The various types of visualization include Column Chart, Line Graph, Bar Graph, Stacked Bar Graph, Dual-Axis
Chart, Pie Chart, Mekko Chart, Bubble Chart, Scatter Chart, and Bullet Graph.
K
What are the various visualization techniques used in data analysis?
A: Various visualization techniques are used in data analysis. A few of them include Box and Whisker Plot for
Large Data, Histogram Plot, and Word Clouds and Network Diagrams for Unstructured Data, to name a few.
H
S
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Introduction To Streams Concepts – Stream Data Model and Architecture - Stream Computing -
Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream –
Estimating Moments – Counting Oneness in a Window – Decaying Window - Real time Analytics
Platform(RTAP) Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
S
Predictions
TE
Stream in Data Analytics
.
O
Introduction to stream concepts :
N
A data stream is an existing, continuous, ordered (implicitly by entrance time or explicitly by timestamp) chain of
items. It is unfeasible to control the order in which units arrive, nor it is feasible to locally capture stream in its
entirety.
K
It is enormous volumes of data, items arrive at a high rate.
Data stream –
A data stream is a(possibly unchained) sequence of tuples. Each tuple comprised of a set of attributes, similar to
S
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
we for sure need to think about what we can be kept continuing and what can only be archived.
2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be
numerous of them, each producing a stream of images at a break of 1 second each.
S
3. Internet and Web Traffic –
A bobbing node in the center of the internet receives streams of IP packets from many inputs and
TE
paths them to its outputs. Websites receive streams of heterogeneous types. For example, Google
receives a hundred million search queries per day.
Characteristics of Data Streams :
1. Large volumes of continuous data, possibly infinite.
O
2. Steady changing and requires a fast, real-time response.
3. Data stream captures nicely our data processing needs of today.
4.
5.
6.
Random access is expensive and a single scan algorithm
Store only the summary of the data seen so far. N
Maximum stream data are at a pretty low level or multidimensional in creation, needs multilevel
K
and multidimensional treatment.
Applications of Data Streams :
1. Fraud perception
H
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
What is streaming data architecture? Find out how to stream data model and architecture in big data
Before we get to streaming data architecture, it is vital that you first understand streaming data. Streaming data
is a general term used to describe data that is generated continuously at high velocity and in large volumes.
A stream data source is characterized by continuous time-stamped logs that document events in real time.
Examples include a sensor reporting the current temperature, or a user clicking a link on a web page. Stream
data sources include:
S
Clickstream data from websites and apps
TE
IoT sensors
O
N
K
H
S
E
N
Therefore, a streaming data architecture is a dedicated network of software components capable of ingesting
IG
and processing copious amounts of stream data from many sources. Unlike conventional data architecture
solutions, which focus on batch reading and writing, a streaming data architecture ingests data as it is generated
in its raw form, stores it, and may incorporate different components for real-time data processing and
V
manipulation.
An effective streaming architecture must account for the distinctive characteristics of data streams which tend
to generate copious amounts of structured and semi-structured data that requires ETL and pre-processing to be
useful.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Due to its complexity, stream processing cannot be solved with one ETL tool or database. That’s why
organizations need to adopt solutions consisting of multiple building blocks that can be combined with data
pipelines within the organization’s data architecture.
Although stream processing was initially considered a niche technology, it is hard to find a modern business that
does not have an eCommerce site, an online advertising strategy, an app, or products enabled by IoT.
Each of these digital assets generates real-time event data streams, thus fueling the need to implement a
S
streaming data architecture capable of handling powerful, complex, and real-time analytics.
TE
Batch processing vs. real-time stream processing
In batch data processing, data is downloaded in batches before being processed, stored, and analyzed. On the
other hand, stream data ingest data continuously, allowing it to be processed simultaneously and in real-time.
O
N
K
H
S
E
N
IG
V
The complexity of the current business requirements has rendered legacy data processing methods obsolete
because they do not collect and analyze data in real-time. This doesn’t work for modern organizations as they
need to act on data in real-time before it becomes stale.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Benefits of stream data processing
The main benefit of stream processing is real-time insight. We live in an information age where new data is
constantly being created. Organizations that leverage streaming data analytics can take advantage of real-time
information from internal and external assets to inform their decisions, drive innovation and improve their
overall strategy. Here are a few other benefits of data stream processing:
S
Batch processing tools need to gather batches of data and integrate the batches to gain a meaningful
conclusion. By reducing the overhead delays associated with batching events, organizations can gain instant
TE
insights from huge amounts of stream data.
O
Stream processing processes and analyzes data in real-time to provide up-to-the-minute data analytics and
insights. This is very beneficial to companies that need real-time tracking and streaming data analytics on their
N
processes. It also comes in handy in other scenarios such as detection of fraud and data breaches and machine
performance analysis.
K
H
S
E
N
IG
V
Batch processing systems may be overwhelmed by growing volumes of data, necessitating the addition of other
resources, or a complete redesign of the architecture. On the other hand, modern streaming data architectures
are hyper-scalable, with a single stream processing architecture capable of processing gigabytes of data per
second [4].
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Detecting patterns in time-series data
Detection of patterns in time-series data, such as analyzing trends in website traffic statistics, requires data to
be continuously collected, processed, and analyzed. This process is considerably more complex in batch
processing as it divides data into batches, which may result in certain occurrences being split across different
batches.
Increased ROI
S
The ability to collect, analyze and act on real-time data gives organizations a competitive edge in their
TE
respective marketplaces. Real-time analytics makes organizations more responsive to customer needs, market
trends, and business opportunities.
O
Organizations rely on customer feedback to gauge what they are doing right and what they can improve on.
N
Organizations that respond to customer complaints and act on them promptly generally have a good reputation
[5].
K
Fast responsiveness to customer complaints, for example, pays dividends when it comes to online reviews and
word-of-mouth advertising, which can be deciding factor for attracting prospective customers and converting
them into actual customers.
H
Losses reduction
S
In addition to supporting customer retention, stream processing can prevent losses as well by providing
E
warnings of impending issues such as financial downturns, data breaches, system outages, and other issues that
negatively affect business outcomes. With real-time information, a business can mitigate, or even prevent the
N
becomes a vital necessity as some of these processes may generate up to a gigabyte of data per second.
Stream processing is also becoming a vital component in many enterprise data infrastructures. For example,
organizations can use clickstream analytics to track website visitor behaviors and tailor their content
accordingly.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Likewise, historical data analytics can help retailers show relevant suggestions and prevent shopping cart
abandonment. Another common use case scenario is IoT data analysis, which typically involves analyzing large
streams of data from connected devices and sensors.
S
are some of the most common challenges in streaming data architecture, along with possible solutions.
TE
O
N
K
Business Integration hiccups
H
Most organizations have many lines of business and applications teams, each working concurrently on its own
S
mission and challenges. For the most part, this works fairly seamlessly for a while until various teams need to
integrate and manipulate real-time event data streams.
E
Organizations can federate the events by multiple integration points so that the actions of one or more teams
don’t inadvertently disrupt the entire system.
N
Scalability bottlenecks
IG
As an organization grows, so do its datasets. When the current system is unable to handle the growing datasets,
operations become a major problem. For example, backups take much longer and consume a significant
V
number of resources. Similarly, rebuilding indexes, reorganizing historical data, and defragmenting storage
becomes more time-consuming and resource-intensive operations.
To solve this, organizations can check the production environment loads. By test-running the expected load of
the system using past data before implementing it, they can find and fix problems [8].
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
These are crucial considerations when working with stream processing or any other distributed system. Since
data comes from different sources in varying volumes and formats, an organization’s systems must be able to
stop disruptions from any point of failure and effectively store large streams of data.
S
are some of its components:
TE
Message broker (Stream Processor)
O
N
K
H
S
E
This message broker collects data from a source, also known as a producer, converts it to a standard message
format, and then streams it for consumption by other components such as data warehouses, and ETL tools,
N
among others.
IG
Despite their high throughput, stream processors don’t do any data transformation or task scheduling. First-
V
generation stream processors such as Apache ActiveMQ and RabbitMQ relied on the Message Oriented
Middleware (MOM) paradigm. These systems were later replaced by hyper-format messaging platforms (stream
processors), which are better suited for a streaming paradigm.
Unlike the legacy MOM brokers, message brokers hold up high-performance capabilities, have a huge capacity
for message traffic, and are highly focused on streaming with minimal support requirements for task scheduling
and data transformations.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Stream processors can act as a proxy between two applications whereby communication is achieved through
ques. In that case, we can refer to them as point-to-point brokers. Alternatively, if an application is broadcasting
a single message or dataset to multiple applications, we can say that the broker is acting as a Publish/Subscribe
model.
Stream data processes are vital components of the big data architecture in data-intensive organizations. In most
S
cases, data from multiple message brokers must be transformed and structured before the data sets can be
analyzed, typically using SQL-based analytics tools
TE
This can also be achieved using an ETL tool or other platform that receives queries from users, gathers events
from message queues, then generates results by applying the query. Other processes such as performing
additional joins, aggregations, and transformations can also run concurrently with the process. The result may
O
be an action, a visualization, an API call, an alert, or in other cases, a new data stream.
After the stream data is processed and stored, it should be analyzed to give actionable value. For this, you need
E
data analytics tools such as query engines, text search engines, and streaming data analytics tools like Amazon
Kinesis and Azure Stream Analytics.
N
Even with a robust streaming data architecture, you still need streaming architecture patterns to build reliable,
secure, scalable applications in the cloud. They include:
V
Idempotent Producer
A typical event streaming platform cannot deal with duplicate events in an event stream. That’s where the
idempotent producer pattern comes in. This pattern deals with duplicate events by assigning each producer a
producer ID (PID). Every time it sends a message to the broker, it includes its PID along with a monotonically
increasing sequence number.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
Event Splitter
N
Data sources mostly produce messages with multiple elements. The event splitter works by splitting an event
K
into multiple events. For instance, it can split an eCommerce order event into multiple events per order item,
making it easy to perform streaming data analytics.
H
Event Grouper
S
In some cases, events only become significant after they happen several times. For instance, an eCommerce
business will tempt parcel delivery at least three times before asking a customer to collect their order from the
E
depot.
N
IG
V
The business achieves this by grouping logically similar events, then counting the number of occurrences over a
given period.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Claim-check pattern
Message-based architectures often have to send, receive and manipulate large messages, such as in video
processing and image recognition. Since it is not recommended to send such large messages directly to the
message bus, organizations can send the claim check to the messaging platform instead and store the message
on an external service.
S
Final thoughts on streaming data architecture and streaming data analytics
As stream data models and architecture in big data become a vital component in the development of modern
TE
data platforms, organizations are shifting from legacy monolithic architectures to a more decentralized model to
promote flexibility and scalability. The resulting effect is the delivery of robust and expedient solutions that not
only improve service delivery but also give an organization a competitive edge
O
Stream Computing N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Stream computing, the long-held dream of “high real-time computing” and “high-throughput
computing”, with programs that compute continuous data streams, has opened up the new era of
future computing due to big data, which is a datasets that is large, fast, dispersed, unstructured, and
beyond the ability of available hardware and software facilities to undertake their acquisition, access,
analytics, and application in a reasonable amount of time and space .
Stream computing is a computing paradigm that reads data from collections of software or hardware
sensors in a stream form and computes continuous data streams, where feedback results should be in
S
a real-time data stream as well. A data stream is a sequence of data sets, and a continuous stream is an
infinite sequence of data sets, and parallel streams have more than one stream to be processed at the
TE
same time.
Stream computing is one effective way to support big data by providing extremely low-latency
velocities with massively parallel processing architectures, and is becoming the fastest and most
efficient way to obtain useful knowledge from big data, allowing organizations to react quickly when
O
problems appear or to predict new trends in the near future .
A big data input stream has the characteristics of high speed, real time, and large volume for
N
applications such as sensor networks, network monitoring, micro blog, web exploring, social
networking, and so on.
K
H
S
E
N
IG
These data sources often take the form of continuous data streams, and timely analysis of such a data
V
stream is very important as the life cycle of most data is very short
Furthermore, the volume of data is so high that there is no enough space for storage, and not all data
need to be stored. Thus, the storing-then-computing batch computing model does not fit at all. Nearly
all data in big data environments have the feature of streams, and stream computing has appeared to
solve the dilemma of big data computing by computing data online 3 within real-time constraints
.Consequently, the stream computing model will be a new trend for high-throughput computing in the
big data era.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
Again the assumption here is that the universal set of all elements is too large to keep in memory, so we’ll need
to find another way to count how many distinct values we’ve seen. If we’re okay with simply getting an estimate
instead of the actual value, we can use the Flajolet-Martin (FM) algorithm.
V
Flajolet-Martin Algorithm
In the FM algorithm we hash the elements of a strea into a bit string. A bit string is a sequence of zeros and
ones, such as 1011000111010100100. A bit string of length � can hold 2� possible combinations. For the FM
algorithm to work, we need � to be large enough such that the bit string will have more possible combinations
than there are elements in the universal set. This basically means that there should be no possible collisions
when we has the elements into a bit string.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
The idea behind the FM algorithm is that the more distinct elements we see, the higher the likelihood that one
of their hash values will be “unusual”. The specific “unusualness” we will exploit here is that the bit string ends
in many consecutive 0s.
For example, the bit string 1011000111010100100 ends with 2 consecutive zeros. We call this value of 2 the tail
length of the bit string. Now let � be the maximum tail length of that we have seen of any hashed bit string of
the stream. The estimate of the number of distinct elements using FM is simply 2�.
S
#Library function for non-streams
def flajoletMartin(iterator):
TE
max_tail_length = 0
O
bit_string = bin(hash(hashlib.md5(val.encode('utf-8')).hexdigest()))
i = len(bit_string) - 1
tail_length = 0
while i >= 0:
N
K
if bit_string[i] == '0':
tail_length += 1
else:
H
break
E
i -= 1
N
return (2**max_tail_length)
V
testList = []
n=0
while n < 100000:
n += 1
testList.append(np.random.choice(words))
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
print(flajoletMartin(iter(testList)))
65536
S
This way we can get estimates that aren’t just powers of 2. If the correct count is between two large powers of
TE
2, for example 7000, it will be impossible to get a good estimate using just one hash function.
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
Decaying Windows
N
We have assumed that a sliding window held a certain tail of the stream, either the most recent N elements for
fixed N , or all the elements that arrived after some time in the past. Sometimes we do not want to make a
sharp distinction between recent elements and those in the distant past, but want to weight the recent
K
elements more heavily. In this section, we consider “exponentially decaying windows,” and an application
where they are quite useful: finding the most common “recent” elements.
H
Suppose we have a stream whose elements are the movie tickets purchased all over the world, with the name
of the movie as part of the element. We want to keep a summary of the stream that is the most popular movies
E
“currently.”
While the notion of “currently” is imprecise, intuitively, we want to discount the popularity of a movie like Star
N
Wars–Episode 4, which sold many tickets, but most of these were sold decades ago. On the other hand, a movie
that sold n tickets in each of the last 10 weeks is probably more popular than a movie that sold 2n tickets last
IG
One solution would be to imagine a bit stream for each movie, and give it value 1 if the ticket is for that movie,
V
N , which is the number of most recent tickets that would be considered in evaluating popularity. Then, use the
method of Section 4.6 to estimate the number of tickets for each movie, and rank movies by their estimated
counts.
This technique might work for movies, because there are only thousands of movies, but it would fail if we were
instead recording the popularity of items sold at Amazon, or the rate at which different Twitter-users tweet,
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
because there are too many Amazon products and too many tweeters. Further, it only offers approximate
answers.
An alternative approach is to redefine the question so that we are not asking for a count of 1’s in a window.
Rather, let us compute a smooth aggregation of all the 1’s ever seen in the stream, with decaying weights, so
the further back in the stream, the less weight is given. Formally, let a stream currently consist of the elements
S
a1, a2, . . . , at, where a1 is the first element to arrive and at is the current element. Let c be a small constant,
such as 10−6 or 10−9 . Define the exponentially decaying window for this stream to be the sum
TE
t−1
O
i=0
at−i(1− c) i
N
The effect of this definition is to spread out the weights of the stream el-ements as far back in time as the
stream goes. In contrast, a fixed window with the same sum of the weights, 1/c, would put equal weight 1 on
K
each of the most recent 1/c elements to arrive and weight 0 on all previous elements. The distinction is
suggested by Fig. 4.4.
Figure 4.4: A decaying window and a fixed-length window of equal weight It is much easier to adjust the sum in
S
an exponentially decaying window than in a sliding window of fixed length. In the sliding window, we have to
worry about the element that falls out of the window each time a new element arrives. That forces us to keep
E
the exact elements along with the sum, or to use an approximation scheme such as DGIM. However, when a
new element at+1
N
The reason this method works is that each of the previous elements has now moved one position further from
the current element, so its weight is multiplied by 1− c. Further, the weight on the current element is (1 − c) 0 = 1,
so adding at+1is the correct way to include the new element’s contribution.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Let us return to the problem of finding the most popular movies in a stream of ticket sales.5 We shall use an
exponentially decaying window with a constant c, which you might think of as 10 −9 . That is, we approximate a
sliding window holding the last one billion ticket sales. For each movie, we imagine a separate stream with a 1
each time a ticket for that movie appears in the stream, and a 0 each time a ticket for some other movie arrives.
The decaying sum of the 1’s measures the current popularity of the movie.
We imagine that the number of possible movies in the stream is huge, so we do not want to record values for
the unpopular movies. Therefore, we establish a threshold, say 1/2, so that if the popularity score for a movie
S
goes below this number, its score is dropped from the counting. For reasons that will become obvious, the
threshold must be less than 1, although it can be any number less than 1. When a new ticket arrives on the
TE
stream, do the following:
1. For each movie whose score we are currently maintaining, multiply its score by (1− c).
2. Suppose the new ticket is for movie M . If there is currently a score for M , add 1 to that score. If there is no
O
score for M , create one and initialize it to 1.
N
It may not be obvious that the number of movies whose scores are main-tained at any time is limited. However,
note that the sum of all scores is 1/c.
K
There cannot be more than 2/c movies with score of 1/2 or more, or else the sum of the scores would exceed
1/c. Thus, 2/c is a limit on the number of movies being counted at any time. Of course in practice, the ticket
H
sales would be concentrated on only a small number of movies at any time, so the number of actively counted
movies would be much less than 2/c.
S
E
N
Solution ideas
This article is a solution idea. If you'd like us to expand the content with more information, such as potential use
cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub
V
feedback.
This solution idea describes how you can get insights from live streaming data. Capture data continuously from
any IoT device, or logs from website clickstreams, and process it in near-real time.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Architecture
S
TE
O
N
K
Download a Visio file of this architecture.
H
Dataflow
S
1. Easily ingest live streaming data for an application, by using Azure Event Hubs.
2. Bring together all your structured data using Synapse Pipelines to Azure Blob Storage.
E
3. Take advantage of Apache Spark pools to clean, transform, and analyze the streaming data, and
combine it with structured data from operational databases or data warehouses.
N
4. Use scalable machine learning/deep learning techniques, to derive deeper insights from this
data, using Python, Scala, or .NET, with notebook experiences in Apache Spark pools.
IG
5. Apply Apache Spark pool and Synapse Pipelines in Azure Synapse Analytics to access and move
data at scale.
V
6. Build analytics dashboards and embedded reports in dedicated SQL pool to share insights within
your organization and use Azure Analysis Services to serve this data to thousands of users.
7. Take the insights from Apache Spark pools to Azure Cosmos DB to make them accessible
through real time apps.
Components
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Azure Synapse Analytics is the fast, flexible, and trusted cloud data warehouse that lets you
scale, compute, and store elastically and independently, with a massively parallel processing
architecture.
Synapse Pipelines Documentation allows you to create, schedule, and orchestrate your ETL/ELT
workflows.
Azure Data Lake Storage: Massively scalable, secure data lake functionality built on Azure Blob
Storage
S
Azure Synapse Analytics Spark pools is a fast, easy, and collaborative Apache Spark-based
analytics platform.
TE
Azure Azure Event Hubs Documentation is a big data streaming platform and event ingestion
service.
Azure Cosmos DB is a globally distributed, multi-model database service. Then learn how to
replicate your data across any number of Azure regions and scale your throughput independent
O
from your storage.
Azure Synapse Link for Azure Cosmos DB enables you to run near real-time analytics over
N
operational data in Azure Cosmos DB, without any performance or cost impact on your
transactional workload, by using the two analytics engines available from your Azure Synapse
workspace: SQL Serverless and Spark Pools.
K
Azure Analysis Services is an enterprise grade analytics as a service that lets you govern, deploy,
test, and deliver your BI solution with confidence.
H
Power BI is a suite of business analytics tools that deliver insights throughout your organization.
Connect to hundreds of data sources, simplify data prep, and drive unplanned analysis. Produce
S
beautiful reports, then publish them for your organization to consume on the web and across
mobile devices.
E
Alternatives
N
Synapse Link is the Microsoft preferred solution for analytics on top of Azure Cosmos DB data.
IG
Azure IoT Hub can be used instead of Azure Event Hubs. IoT Hub is a managed service hosted in
the cloud that acts as a central message hub for communication between an IoT application and
its attached devices. You can connect millions of devices and their backend solutions reliably
V
Scenario details
This scenario illustrates how you can get insights from live streaming data. You can capture data continuously
from any IoT device, or logs from website clickstreams, and process it in near-real time.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Potential use cases
This solution is ideal for the media and entertainment industry. The scenario is for building analytics from live
streaming data.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding
S
tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-
TE
Architected Framework.
Cost optimization
O
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies. For more information, see Overview of the cost optimization pillar.
that need to be figured out in order not to interrupt the flow. The time to react for real-time analysis can vary
from nearly instantaneous to a few minutes or seconds. The key components of real-time analytics comprise the
S
following.
E
o Aggregator
o Broker
N
o Analytics engine
o Stream processor
IG
Momentum is the primary benefit of real-time analysis of data. The shorter a company has to wait for data from
the moment it arrives and is processed, and the business is able to utilize data insights to make changes and make
the results of a crucial decision.
In the same way, real-time analytics tools allow companies to see how users connect to an item after liberating
the product, so there's no problem in understanding the behaviour of users to make the necessary adjustments.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Advantages of Real-time Analytics:
S
Perform immediate adjustments if necessary.
TE
Other Benefits:
Other advantages and benefits include managing data location, detecting irregularities, enhancing marketing and
O
sales, etc. The following benefits can be useful.
Sentiment analysis is a text analysis tool that uses machine learning with natural language processing (NLP) to
H
automatically read and classify text as positive, negative, neutral, and everywhere in between. It can read all
manner of text (online and elsewhere) for opinion and emotion – to understand the thoughts and feelings of
S
the writer.
E
See the example below from a pre-trained sentiment analyzer, which easily classifies a customer comment as
negative with near 100% confidence.
N
Classify Text
Results
TAGCONFIDENCE
Negative100.0%
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Sentiment analysis can be put to work on hundreds of pages and thousands of individual opinions in just
seconds, and constantly monitor Twitter, Facebook, emails, customer service tickets, etc., 24/7 and in real time.
Real-time sentiment gives you a window into what your customers and the public at large are expressing about
your brand “right now” for targeted, minute-by-minute analysis, and to follow brand sentiment over time.
S
Marketing campaign success analysis
TE
Target your analysis to follow marketing campaigns right as they launch and get a solid idea of how your
messaging is working with current and potentially new customers. Find out which demographics respond most
positively or negatively. Follow sentiment as it rises or falls, and compare current campaigns against previous
ones.
O
Prevention of business disasters
N
In 2017, United Airlines forcibly removed a passenger from an overbooked flight. Other passengers posted
videos of the incident to Facebook, one of which had been viewed 6.8 million times just 24 hours later. After
K
United’s CEO responded to the incident as "reaccommoda[ting] these customers," Twitter exploded in outrage
and public shaming of United.
H
Negative comments on social media can travel around the world in just minutes. Real-time sentiment analysis of
Twitter data, for example, will allow you to put out fires from negative comments before they grow out of
S
control, or use positive comments to your advantage. Oftentimes it’s helpful just to let them know you’re
listening:
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
O
N
K
H
S
E
N
IG
You can similarly follow feedback on new products right as they’re released. Influencers (and regular social
V
media users) are eager to be the first commenters upon the release of new products or updates. Follow social
media and online reviews to tweak products or beta releases right after release, or stimulate conversations with
your customers, so they always know they’re important to you.
You can even use social media sentiment analysis for market research to find out what’s missing from the
market or for competitive research to exploit the shortcomings of your competition and create new products.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Follow the real-time sentiment of any business as it rises and falls to get up-to-the-minute information on stock
price changes. If a new product release is met with enthusiasm across the board, you can expect the stock to
rise. While a social media PR crisis can bring even industry giants to their knees.
S
3. Clean your data
4. Analyze & visualize sentiments in real-time
TE
5. Act on your results
There are two options when it comes to performing sentiment analysis: build a model or invest in a SaaS tool.
O
Building a model can produce exceptional results, but it is time-consuming and costly.
N
SaaS tools, on the other hand, are generally ready to put into use right away, much less expensive, and you can
still train custom models to the specific language, needs, and criteria of your organization.
K
MonkeyLearn’s powerful SaaS platform offers immediate access to sentiment analysis tools and other text
analytics techniques, like the keyword extractor, survey feedback classifier, intent and email classifier, and many
many more.
H
And with MonkeyLearn Studio, you can analyze and visualize your results in real time.
S
Let’s take a look at how easy it can be to perform real-time sentiment analysis.
E
First, decide what you want to achieve. Do you want to compare sentiment toward your brand against that of
your competition? Do you want to regularly mine Twitter or perform social listening to extract brand mentions
IG
Maybe you need to automatically analyze email exchanges and customer support tickets to get an idea of how
V
well your customer service is working. The use cases for real-time sentiment analysis are practically endless
when you have the right tools in place.
There are a number of ways to get the data you need, from simply cutting and pasting, to using APIs. Below are
some of the most common and easiest to use.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Tools like Zapier easily integrate with MonkeyLearn to pull brand mentions from Twitter or other outlets of your
choice.
Web mining and web scraping tools, like Dexi, Content Grabber, and Pattern allow you to link APIs or extract
content directly from the web into CSV or Excel files, and more.
APIs:
S
The Graph API is best for pulling data directly from Facebook.
Twitter’s API allows users access to public Twitter data.
TE
The Python Reddit API Wrapper scrapes data from subreddits, accesses comments from specific posts,
and more.
O
Website, social media, and email data often have quite a bit of “noise.” This can be repetitive text, banner ads,
N
non-text symbols and emojis, email signatures, etc. You need to first remove this unnecessary data, or it will
skew your results.
K
You can run spell check or scan documents for URLs and symbols, but you’re much better off automating this
process – especially for accurate real-time analysis – because time is of the essence, and manual data cleaning
will create an information bottleneck.
H
MonkeyLearn offers several models to make data cleaning quick and easy. The boilerplate extractor extracts
S
only the text you want from HTML, removing unneeded clutter, like templates, navigation bars, ads, etc.
E
The email cleaner automatically removes email signatures, legal notices, and previous replies to give you only
the most recent message in the chain:
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
And the opinion units extractor breaks up sentences or entire pages of text into individual sentiments or
thoughts called “opinion units”:
S
TE
O
N
K
It can break down hundreds of pages and thousands of opinion units automatically to prep your data for
H
analysis.
S
MonkeyLearn Studio is an all-in-one real-time sentiment analysis and visualization tool. After a simple set-up,
you just upload your data and visualize the results for powerful insights.
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
S
TE
4.1. Choose a MonkeyLearn Studio template
O
MonkeyLearn Studio allows you to chain together a number of text analysis techniques, like keyword
N
extraction, aspect classification, intent classification, and more, along with your real-time sentiment analysis, for
super fine-grained results.
K
If you want to learn how to build a custom sentiment analysis model to your specific criteria (and then use it
with MonkeyLearn Studio), take a look at this tutorial. You can do it in just a few steps.
H
Once you’re ready for MonkeyLearn Studio, you can choose an existing template or create your own:
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
4.2. Upload your data
You can upload cleaned text from a CSV or Excel file, connect to integrations with Zendesk, SurveyMonkey, etc.,
or use simple, low-code APIs to extract directly from social media, websites, email, and more.
S
TE
O
N
K
4.3. Run your analysis
H
As you can see below, the model automatically tags the statement for Sentiment, Category, and Intent, all
working simultaneously.
S
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
4.4. Automatically visualize your results with MonkeyLearn Studio
MonkeyLearn Studio’s deep learning models are able to chain together a number of text analysis techniques in a
seamless process, so you just set it up and let it do the work for you. Once your real-time sentiment analyzer is
trained to your criteria, it can perform analysis 24/7, with limitless accuracy.
Take a look at the MonkeyLearn Studio dashboard below. In this case we ran aspect-based sentiment
analysis on customer reviews of Zoom. Each opinion unit is categorized by “aspect” or category: Usability,
S
Support, Reliability, etc., then each category is run through sentiment analysis to show opinion from positive to
negative.
TE
O
N
K
H
S
E
You can see how individual reviews have been pulled by date and time for real-time analysis, and to follow
N
Another analysis for “intent,” shows the reason for the comment. This is more often used to analyze emails and
IG
customer service data. In this case, as this is an analysis of customer reviews, most are simply marked as
“opinion.”
V
The results are in! With sentiment analysis and MonkeyLearn Studio, you can be confident you’re making real-
time, data-driven decisions.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Imagine you release a new product. You can perform real-time aspect-based sentiment analysis on Twitter
mentions of your product, for example, to find out what aspect your customers are responding to most
favorably or unfavorably.
Play around with the public dashboard to see how it works: search by date, sentiment, category, etc. With
MonkeyLearn Studio you can perform new analyses and add or remove data directly in the dashboard. No more
uploading and downloading data between applications – it’s all right there.
S
TE
Stock Market Prediction
O
Introduction
Stock market prediction and analysis are some of the most difficult jobs to complete. There are numerous causes
N
for this, including market volatility and a variety of other dependent and independent variables that influence the
value of a certain stock in the market. These variables make it extremely difficult for any stock market expert to
anticipate the rise and fall of the market with great precision. Considered among the most potent tree-based
K
techniques, Random Forest can predict the stock process as it can also solve regression-based problems.
However, this tutorial, with the introduction of Data Science, Machine Learning, and artificial intelligence and its
H
strong algorithms, the most recent market research, and Stock price Prediction advancements have begun to
include such approaches in analyzing stock market data.
S
Source: moneycontrol.com
E
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
In summary, Machine Learning Algorithms like regression, classifier, and support vector machine (SVM) are widely
utilized by many organizations in stock market prediction. This article will walk through a simple implementation
of analyzing and forecasting the stock prices of a Popular Worldwide Online Retail Store in Python using various
Machine Learning Algorithms.
Learning Objectives
In this tutorial, we will learn about the best ways possible to predict stock prices using a long-short-term
S
memory (LSTM) for time series forecasting.
We will learn everything about stock market prediction using LSTM.
TE
Problem Statement for Stock Market Prediction
Let us see the data on which we will be working before we begin implementing the software to anticipate stock
O
market values. In this section, we will examine the stock price of Microsoft Corporation (MSFT) as reported by the
National Association of Securities Dealers Automated Quotations (NASDAQ). The stock price data will be supplied
N
as a Comma Separated File (.csv) that may be opened and analyzed in Excel or a Spreadsheet.
MSFT’s stocks are listed on NASDAQ, and their value is updated every working day of the stock market. It should
K
be noted that the market does not allow trading on Saturdays and Sundays. Therefore, there is a gap between
the two dates. The Opening Value of the stock, the Highest and Lowest values of that stock on the same day, as
well as the Closing Value at the end of the day are all indicated for each date.
H
The Adjusted Close Value reflects the stock’s value after dividends have been declared (too technical!).
S
Furthermore, the total volume of the stocks in the market is provided. With this information, it is up to the job of
a Machine Learning/Data Scientist to look at the data and develop different algorithms that may extract patterns
E
Microsoft Corporation stock values. They are used to make minor changes to the information by multiplying and
adding. Long-term memory (LSTM) is a deep learning artificial recurrent neural network (RNN) architecture.
V
Unlike traditional feed-forward neural networks, LSTM has feedback connections. It can handle single data points
(such as pictures) as well as full data sequences (such as speech or video).
Program Implementation
We will now go to the section where we will utilize Machine Learning techniques in Python to estimate the stock
value using the LSTM.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Step 1: Importing the Libraries
As we all know, the first step is to import the libraries required to preprocess Microsoft Corporation stock data
and the other libraries required for constructing and visualizing the LSTM model outputs. We’ll be using the Keras
library from the TensorFlow framework for this. All modules are imported from the Keras library.
S
import NumPy as np
%matplotlib inline
TE
import matplotlib. pyplot as plt
import matplotlib
from sklearn. Preprocessing import MinMaxScaler
O
from Keras. layers import LSTM, Dense, Dropout
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib. dates as mandates
from sklearn. Preprocessing import MinMaxScaler
N
K
from sklearn import linear_model
from Keras. Models import Sequential
from Keras. Layers import Dense
H
Using the Pandas Data Reader library, we will upload the stock data from the local system as a Comma Separated
Value (.csv) file and save it to a pandas DataFrame. Finally, we will examine the data.
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
In this step, firstly, we will print the structure of the dataset. We’ll then check for null values in the data frame to
ensure that there are none. The existence of null values in the dataset causes issues during training since they
function as outliers, creating a wide variance in the training process.
S
>> Dataframe Shape: (7334, 6)
>>Null Value Present: False
TE
Date Open High Low Close Adj Close
O
1990-01-03 0.621528 0.626736 0.614583 0.619792 0.449788
The Adjusted Close Value is the final output value that will be forecasted using the Machine Learning model. This
figure indicates the stock’s closing price on that particular day of stock market trading.
S
df[‘Adj Close’].plot()
N
IG
V
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Step 5: Setting the Target Variable and Selecting the Features
The output column is then assigned to the target variable in the following step. It is the adjusted relative value of
Microsoft Stock in this situation. Furthermore, we pick the features that serve as the independent variable to the
target variable (dependent variable). We choose four characteristics to account for training purposes:
Open
High
S
Low
Volume
TE
#Set Target Variable
output_var = PD.DataFrame(df[‘Adj Close’])
O
#Selecting the Features
features = [‘Open’, ‘High’, ‘Low’, ‘Volume’]
Step 6: Scaling
N
To decrease the computational cost of the data in the table, we will scale the stock values to values between 0
and 1. As a result, all of the data in large numbers is reduced, and therefore memory consumption is decreased.
K
Also, because the data is not spread out in huge values, we can achieve greater precision by scaling down. To
perform this, we will be using the MinMaxScaler class of the sci-kit-learn library.
H
#Scaling
scaler = MinMaxScaler()
S
feature_transform = scaler.fit_transform(df[features])
feature_transform= pd.DataFrame(columns=features, data=feature_transform, index=df.index)
E
feature_transform.head()
N
As shown in the above table, the values of the feature variables are scaled down to lower values when compared
to the real values given above.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Step 7: Creating a Training Set and a Test Set for Stock Market Prediction
We must divide the entire dataset into training and test sets before feeding it into the training model. The
Machine Learning LSTM model will be trained on the data in the training set and tested for accuracy and
backpropagation on the test set.
The sci-kit-learn library’s TimeSeriesSplit class will be used for this. We set the number of splits to 10, indicating
that 10% of the data will be used as the test set and 90% of the data will be used to train the LSTM model. The
S
advantage of utilizing this Time Series split is that the split time series data samples are examined at regular time
intervals.
TE
#Splitting to Training set and Test set
timesplit= TimeSeriesSplit(n_splits=10)
for train_index, test_index in timesplit.split(feature_transform):
O
X_train, X_test = feature_transform[:len(train_index)], feature_transform[len(train_index):
(len(train_index)+len(test_index))]
(len(train_index)+len(test_index))].values.ravel()
Step 8: Data Processing For LSTM
N
y_train, y_test = output_var[:len(train_index)].values.ravel(), output_var[len(train_index):
K
Once the training and test sets are finalized, we will input the data into the LSTM model. Before we can do that,
we must transform the training and test set data into a format that the LSTM model can interpret. As the LSTM
H
needs that the data to be provided in the 3D form, we first transform the training and test data to NumPy arrays
and then restructure them to match the format (Number of Samples, 1, Number of Features). Now, 6667 are the
S
number of samples in the training set, which is 90% of 7334, and the number of features is 4. Therefore, the
E
training set is reshaped to reflect this (6667, 1, 4). Likewise, the test set is reshaped.
trainX =np.array(X_train)
testX =np.array(X_test)
IG
Finally, we arrive at the point when we construct the LSTM Model. In this step, we’ll build a Sequential Keras
model with one LSTM layer. The LSTM layer has 32 units and is followed by one Dense Layer of one neuron.
We compile the model using Adam Optimizer and the Mean Squared Error as the loss function. For an LSTM
model, this is the most preferred combination. The model is plotted and presented below.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
#Building the LSTM Model
lstm = Sequential()
lstm.add(LSTM(32, input_shape=(1, trainX.shape[1]), activation=’relu’, return_sequences=False))
lstm.add(Dense(1))
lstm.compile(loss=’mean_squared_error’, optimizer=’adam’)
plot_model(lstm, show_shapes=True, show_layer_names=True)
S
TE
O
N
K
Step 10: Training the Stock Market Prediction Model
H
Finally, we use the fit function to train the LSTM model created above on the training data for 100 epochs with a
batch size of 8.
S
#Model Training
E
Eросh 4/100
834/834 [==============================] – 1s 2ms/steр – lоss: 21.5447
Eросh 5/100
834/834 [==============================] – 1s 2ms/steр – lоss: 6.1709
Eросh 6/100
834/834 [==============================] – 1s 2ms/steр – lоss: 1.8726
Eросh 7/100
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
834/834 [==============================] – 1s 2ms/steр – lоss: 0.9380
Eросh 8/100
834/834 [==============================] – 2s 2ms/steр – lоss: 0.6566
Eросh 9/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.5369
Eросh 10/100
834/834 [==============================] – 2s 2ms/steр – lоss: 0.4761
S
.
.
TE
.
.
Eросh 95/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4542
O
Eросh 96/100
834/834 [==============================] – 2s 2ms/steр – lоss: 0.4553
Eросh 97/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4565
Eросh 98/100
N
K
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4576
Eросh 99/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4588
H
Eросh 100/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4599
S
Finally, we can observe that the loss value has dropped exponentially over time over the 100-epoch training
E
Now that we have our model ready, we can use it to forecast the Adjacent Close Value of the Microsoft stock by
IG
using a model trained using the LSTM network on the test set. This is accomplished by employing the simple
predict function on the LSTM model that has been created.
V
#LSTM Prediction
y_pred= lstm.predict(X_test)
Step 12: Comparing Predicted vs True Adjusted Close Value – LSTM
Finally, now that we’ve projected the values for the test set, we can display the graph to compare both Adj Close’s
true values and Adj Close’s predicted value using the LSTM Machine Learning model.
VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
#Predicted vs True Adj Close Value – LSTM
plt.plot(y_test, label=’True Value’)
plt.plot(y_pred, label=’LSTM Value’)
plt.title(“Prediction by LSTM”)
plt.xlabel(‘Time Scale’)
plt.ylabel(‘Scaled USD’)
plt.legend()
S
plt.show()
TE
O
N
K
H
The graph above demonstrates that the extremely basic single LSTM network model created above detects some
patterns. We may get a more accurate depiction of every specific company’s stock value by fine-tuning many
S